Jupyter notebook/lab is a fantastic tool for data analysis. I used it a lot in my daily research. But one of my habits makes me suffered a lot: I keep updating the same notebook for a long time, and to make the notebook neat, I always delete some of the cells, which are "redundant"/"useless" for me at certain time points (or I just delete them by accident). So in many cases, I wondered whether I could add version control to notebooks, so that when something wrong happens, I can rollback to the status before specific events.

Start to secure your notebooks

The basic idea is to add a hook to Jupyter whenever it finishes saving a file (especially notebook), which will:

  1. generate a git repo in the folder where the notebook locates (if there isn’t);
  2. copy the notebook to the repo (and convert notebook to python script);
  3. add all changes to git;
  4. commit modifications.

Note: In many cases, we will focus in a short time and make a lot of modifications to notebooks. To avoid making too many commits, it would be better if we can randomly skip some modifications. In this post, let's denote time span (in seconds) between last commit and current modification as \(ts\), then the probability is \(\min(1, \frac{ts}{3600})\) for the modified notebook to get committed.

Here are the steps to enable automatical backup (MUST install git first):

  1. Make sure you have a configuration file (jupyter_notebook_config.py) for Jupyter Notebook or Jupyter Lab. By default, it's located at ~/.jupyter. If are not sure, you can run jupyter --config-dir, which will tell you the file locates. If you don't have any configuration files, generate one with jupyter notebook --generate-config;
  2. Add the following codes to the begining of the configuration file:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    import os
    from time import ctime, time
    from random import choices
    from subprocess import check_call, CalledProcessError
    from shutil import copyfile

    def post_save(model, os_path, contents_manager):
    """
    This function will do the following jobs:
    1. creating a new folder named vcs (version control system)
    2. converting jupyter notebooks to both python scripts and html files
    3. moving converted files to vcs folder
    4. keeping tracks with these files via git
    """
    if model['type'] != 'notebook':
    return
    logger = contents_manager.log

    d, fname = os.path.split(os_path)

    # skip new born notebooks
    if fname.startswith("Untitled"):
    return

    vcs_d = os.path.join(d, "vcs")
    if not os.path.exists(vcs_d):
    logger.info("Creating vcs folder at %s" % d)
    os.makedirs(vcs_d)

    try:
    check_call(['git', 'rev-parse'], cwd=vcs_d)
    except CalledProcessError:
    logger.info("Initiating git repo at %s" % d)
    check_call(['git', 'init'], cwd=vcs_d)

    def add_file_or_not(file_name, folder):
    file_path = os.path.join(folder, file_name)
    if os.path.exists(file_path):
    ctime = os.path.getctime(file_path)
    delta = time() - ctime
    delta = delta if delta >= 0 else 0
    # if file was modified in the past 1 hour,
    # then the new modification got 30% chance
    # to be saved
    prob = delta / 3600
    prob = prob if prob <= 1 else 1
    save_or_not = choices((0, 1), weights=(1-prob, prob), k=1)[0]
    if save_or_not:
    return True
    else:
    return False
    else:
    return True

    rfn, ext = os.path.splitext(fname)

    # in case notebook is not a python-based one (R,...)
    script_ext = ".py"
    with open(os_path) as fh:
    tmp = json.load(fh)
    script_ext = tmp["metadata"]["language_info"]["file_extension"]
    script_fn = rfn+script_ext

    if add_file_or_not(script_fn, d):
    check_call(['jupyter', 'nbconvert', '--to', 'script', fname], cwd=d)
    os.replace(os.path.join(d, script_fn), os.path.join(d, "vcs", script_fn))
    copyfile(os_path, os.path.join(d, "vcs", fname))

    check_call(['git', 'add', script_fn, fname], cwd=vcs_d)
    commit_msg = 'Autobackup for %s (%s)' % (fname, ctime())
    try:
    check_call(['git', 'commit', '-m', commit_msg], cwd=vcs_d)
    except CalledProcessError:
    pass
    else:
    logger.info("File too new to be traced.")
  3. Search for post_save_hook in the configuration file; uncomment this line (if you see a # at the begining of the line, remove it), and change it to:
    1
    c.FileContentsManager.post_save_hook = post_save
  4. Restart Jupyter Lab or Notebook.

Restore from disasters

The easiest way to restore jupyter notebook from backups is add the vcs folder to a GUI for git (like Sourcetree) and export the version that you want to roll back to. But in case GUIs are not available, you can use the following commands.

Let’s assume the notebook is a.ipynb, and it locates at /foo/bar, then the git repo is located at /foo/bar/vcs, you can:

  1. use git log --abbrev-commit to see the history of modifications;
  2. choose the potential commit according to the time of submission, and record the commit id (red boxes above);
  3. use git show COMMIT_ID:FILE_NAME > SAVE_TO to export the committed file to SAVE_TO. If you want to export the records jupyter, then in this case, you can replace FILE_NAME as a.ipynb; but if you just want to export python codes, then you can replace FILE_NAME with a.py.

A more comprehensive implementation

A better implementation of this function is to enable:

  1. backup/version control only in specific folders instead of everywhere;
  2. tracing other non-notebook files, like Python/R scripts;
  3. allowing pushing changes to remote repositories.

So by introducing the following two changes, the above three goals can be meet smoothly:

  1. Add a configuration file (backup.conf) in the folder that you want to enable version control, and below is a template for this file
    1
    2
    3
    4
    5
    6
    7
    8
    9
    [repo]
    local_repo_folder = vcs
    remote_repo =

    [backup]
    backup_notebooks = no
    backup_notebook_converted_scripts = yes
    backup_by_filetypes = yes
    backup_filetypes = .py|.R

The keys in the configuration file stand for:

  • local_repo_folder: name of the local folder that the git repo will be located;
  • remote_repo: remote address for the repo, leave it as blank if you don't want to push these changes to other servers;
  • backup_notebooks: set it to yes to enable tracing Notebooks in .ipynb format. NOTE: if your notebook contains a lot images or significant amount of outputs, then you may want to set it as no to save some space;
  • backup_notebook_converted_scripts: set it to yes to enable tracing Notebooks in .py format or any other format that the notebook is based-on;
  • backup_by_filetypes and backup_filetypes: set the first one to be yes to enable tracing other text files with extensions in backup_filetypes. In this example, all files ending with .py and .R will be traced (case-insensitive).
  1. Modify post_save:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    import os
    import json
    from configparser import ConfigParser
    from time import ctime, time
    from random import choices
    from subprocess import check_call, CalledProcessError
    from shutil import copyfile

    def post_save(model, os_path, contents_manager):
    """
    This function will do the following jobs:
    1. creating a new folder named vcs (version control system)
    2. converting jupyter notebooks to both python scripts and html files
    3. moving converted files to vcs folder
    4. keeping tracks with these files via git
    """
    d, fname = os.path.split(os_path)
    conf_file = os.path.join(d, "backup.conf")
    if not os.path.exists(conf_file):
    return
    config = ConfigParser()
    config.optionxform = str
    config.read(conf_file)
    supported_files = set([ft.lower() for ft in config.get("backup", "backup_filetypes").split("|")])
    logger = contents_manager.log

    vcs_d = os.path.join(d, config.get("repo", "local_repo_folder"))
    if not os.path.exists(vcs_d):
    logger.info("Creating vcs folder at %s" % d)
    os.makedirs(vcs_d)

    repo_init = 0
    try:
    check_call(['git', 'rev-parse'], cwd=vcs_d)
    except CalledProcessError:
    logger.info("Initiating git repo at %s" % d)
    check_call(['git', 'init'], cwd=vcs_d)
    if config.get("repo", "remote_repo") != "":
    check_call(['git', 'remote', 'add', 'origin', config.get("repo", "remote_repo")], cwd=vcs_d)
    repo_init = 1

    def add_file_or_not(file_name, folder):
    file_path = os.path.join(folder, file_name)
    if os.path.exists(file_path):
    ctime = os.path.getctime(file_path)
    delta = time() - ctime
    delta = delta if delta >= 0 else 0
    # if file was modified in the past 1 hour,
    # then the new modification got 30% chance
    # to be saved
    prob = delta / 3600
    prob = prob if prob <= 1 else 1
    save_or_not = choices((0, 1), weights=(1-prob, prob), k=1)[0]
    if save_or_not:
    return True
    else:
    return False
    else:
    return True

    rfn, ext = os.path.splitext(fname)
    lext = ext.lower()
    updated_files = []
    if model["type"] == "notebook":
    # skip new born notebooks
    if fname.startswith("Untitled"):
    return

    # in case notebook is not a python-based one (R,...)
    script_ext = ".py"
    with open(os_path) as fh:
    tmp = json.load(fh)
    script_ext = tmp["metadata"]["language_info"]["file_extension"]
    script_fn = rfn + script_ext

    if add_file_or_not(script_fn, d):
    if config.get("backup", "backup_notebooks") == "yes":
    copyfile(os_path, os.path.join(vcs_d, fname))
    updated_files.append(fname)
    if config.get("backup", "backup_notebook_converted_scripts") == "yes":
    check_call(['jupyter', 'nbconvert', '--to', 'script', fname], cwd=d)
    os.replace(os.path.join(d, script_fn), os.path.join(vcs_d, script_fn))
    updated_files.append(script_fn)
    else:
    logger.info("File too new to be traced.")
    elif config.get("backup", "backup_by_filetypes") == "yes" and lext in supported_files:
    copyfile(os_path, os.path.join(vcs_d, fname))
    updated_files.append(fname)

    if len(updated_files) > 0:
    cmd = ['git', 'add']
    cmd.extend(updated_files)
    check_call(cmd, cwd=vcs_d)
    commit_msg = 'Autobackup for %s (%s)' % (fname, ctime())
    try:
    check_call(['git', 'commit', '-m', commit_msg], cwd=vcs_d)
    if config.get("repo", "remote_repo") != "":
    if repo_init:
    check_call(['git', 'push', '--set-upstream', 'origin', 'master'], cwd=vcs_d)
    else:
    check_call(['git', 'push'], cwd=vcs_d)
    except CalledProcessError as e:
    logger.info(e)

References