Last week I needed to take a backup of my laptop's contents because I needed to give it for repair. When I started just plain copying my files to an external hard-drive I noticed that the git-files (the files used by git to store version-history for project) were taking too much time since they were very small in size but huge in numbers. So I decided to find a way using which I could create a single-file for each git enabled project and then just copy that single file over to the external hard-drive.
I came across two options provided by git itself:
git archive
git bundle
archive
creates just a snapshot of how the project is at that particular instance and does not contain any of the version-history stuff, so this option went out the window for me.
bundle
gave me the option to generate a single-file containing all of my version-history across all the branches. This appeared to me to be the solution that I was looking for.
But alas, bundle
doesn't store any data of the changes you have stashed in your local-repo.
After much searching I couldn't find anything that gave me both the version-history and all the stashed changes in a single file. So I decided to write my own python script which would create zip file containing the bundle file generated by git bundle
and patch-files for all the stashed changes in my repo.
I have described how I made the program in this article, though if you only want the script, I have put a link to the gist at the bottom of this article.
Project breakdown
I decided to break the project down into parts and then conquer them one-by-one.
- Getting the directory for which the git-backup needs to be generated
- Checking if that directory is actually the root directory for a git-enabled project
- Creating a
git bundle
file from within the python script - Check if the project has any changes that haven't been committed or stashed yet
- Stashing any changes that I find in the step above
- Check if the project has any stashed changes
- Creating individual patch-files for all the stashed changes found in previous step
- Creating a zip-file which contains the bundle file from git and all the patch-files that we created in previous step
- Deleting the bundle and patch-files since we already have them included in the zip-file
Step 1: Getting the directory
I decided to give 2 options to the user. They can either:
- Go to the directory via terminal and execute my script from there
- Provide the directory as a command-line argument to my script
import sys
import os
arguments = sys.argv[1:]
if len(arguments) == 0:
directory_to_backup = os.getcwd()
else:
directory_to_backup = arguments[0]
directory_to_backup = os.path.abspath(directory_to_backup)
In python sys.argv
returns a list of command-line arguments passed to our program.
Since the first item in that list is always the name of our program so we ignore it and take a look at the rest of the list.
If the rest of the list is empty, we take the first use-case in consideration, ie- the user has launched our program from the directory which they want to backup. Therefore we use os.getcwd()
which returns the system's current working-directory, which would be the directory from where our script was launched.
After that we use os.path.abspath(directory_to_backup)
to clean the path provided to us and also convert it to absolute path as used by the system.
Step 2: Checking if this is a root-directory for a git project
Next up on the list was to check whether the directory that we are supplied is actually the root-directory for a project which has git initiated in it. This is very important since the rest of the script would just crash if we try to execute it on a folder which isn't actually a git project.
I decided to just check for the presence of a .git
folder inside the root folder that was supplied to us. Git stores all of its data for that project in a .git
folder inside that project. If you people know of a better way to make this check do let me know.
git_directory = os.path.join(directory_to_backup, '.git')
if os.path.exists(git_directory) and os.path.isdir(git_directory):
create_backup_zip(directory_to_backup)
else:
print(directory_to_backup + ' is not a git repository')
os.path.exists
checks for the existence of git_directory
and os.path.isdir
checks that it is actually a directory and not just a file named .git
.
Step 3: Creating a git bundle from within my script
Before this I had only created the git bundles from the terminal and never from within a python script. I had no idea how to do so.
First I decided that I would issue terminal commands from my script and capture the output of those commands. But then I thought that it probably might not work well across platforms (I am looking at you Windows).
I then went looking for a library in python to interact with git repositories. First result that Google came up with was GitPython.
I started reading the documentation for it and very quickly (like in under 5 minutes) started to get overwhelmed by all the git related concepts that it was throwing at me. But after around 10 minutes of searching, I found something that was helpful. I could get a handle on the Repo
object for that git-project and then execute my git commands using that handle.
parent_directory, directory_name = os.path.split(directory_to_backup)
repository = Repo(directory_to_backup)
git_handle = repository.git
git_bundle_file_name = os.path.join(parent_directory, (directory_name + '.bundle'))
git_handle.bundle('create', git_bundle_file_name, '--all')
This is the equivalent of:
git bundle create <git_bundle_name> --all
The --all
flag is to include all the branches and not just the current branch.
Step 4: Check for any changes that aren't committed or stashed
Next I needed to check for any changes that haven't been committed and also haven't been stashed, so that we can stash them.
Just running git status
gives me a list of the changed files, but it also outputs other things to make it human-friendly. There is an option --porcelain
which strips all that extra info.
There are some files which have been left untracked, we don't want to create a new stash when all of the uncommitted files are only files that are untracked. So we filter the lines by the value in their first column and remove any lines whose first-column value is made only of ?
. I don't know what the values of first column mean, but I only know that ?
signifies files that are untracked. If our list of unstaged files is not empty after removing all untracked files, then we perform the next step or else we skip it.
Step 5: Stash any changes from working tree and staging area
If we find that we have any unstaged but not untracked files then we stash those changes. The command is a simple one.
git_handle.stash('push', '-m', 'git-backup-stash')
Step 6: Now we check if the project has any stashed changes
Now we check if the project has any stashed changes. We do this so that we can determine whether or not we need to make any patch files. Git has a simple command to list all the stashes.
stashes = git_handle.stash('list').strip()
stash_list = stashes.split(os.linesep)
if (len(stashes) > 0) and (len(stash_list) > 0):
# Create patch-files here
Step 7: Create patch files for any stashes we have
Now we create patch files for any stashes that we have in our local repo. These files can then later be used to restore the changes that were stashed. Each stash entry gets its own patch file.
Format of the patch-file's name is as below:
<stash_name>:<branch_name>:<stash_message>.txt
if (len(stashes) > 0) and (len(stash_list) > 0):
patch_files_directory = os.path.join(parent_directory, 'patch-files')
os.makedirs(patch_files_directory, exist_ok=True)
for stash in stash_list:
stash_name = stash.split(': ')[0]
stash_branch_name = stash.split(': ')[1].split()[1]
stash_message = ': '.join(stash.split(': ')[2:])
patch_file_name = ':'.join([stash_name, stash_branch_name, stash_message]) + '.txt'
patch_file_path = os.path.join(patch_files_directory, patch_file_name)
patch_contents = git_handle.stash('show', '-p', stash_name)
with open(patch_file_path, 'x') as f:
f.write(patch_contents)
patch_files.append(patch_file_path)
Step 8: Create a zip-file of the bundle and all the patch files
Now all we need to do is integrate all of those above files into a single zip file. First we write the bundle file created by git into the zip file. Then we include the patch files, if we have any, into a sub-folder called "patch-files" in the zip file.
zip_file_name = directory_name + '.zip'
zip_file_path = os.path.join(parent_directory, zip_file_name)
with ZipFile(zip_file_path, 'x') as z:
z.write(git_bundle_file_name, arcname=os.path.split(git_bundle_file_name)[1])
for patch_file in patch_files:
z.write(patch_file, arcname=os.path.join('patch-files', os.path.split(patch_file)[1]))
Step 9: Deleting the bundle file and the patch files
Not so fast buddy, we still need to delete that git-bundle file and those patch-files that we created. Since we already have included them in our zip file, we no longer need them and therefore should clean those.
os.remove(git_bundle_file_name)
if len(patch_files) > 0:
shutil.rmtree(patch_files_directory)
Full Code
#!/usr/bin/python3
import os
import sys
import re
import shutil
from git import Repo
from zipfile import ZipFile
def create_backup_zip(directory_to_backup):
parent_directory, directory_name = os.path.split(directory_to_backup)
repository = Repo(directory_to_backup)
git_handle = repository.git
git_bundle_file_name = os.path.join(parent_directory, (directory_name + '.bundle'))
git_handle.bundle('create', git_bundle_file_name, '--all')
pattern_for_untracked_files_flag = re.compile(r'^\?+$')
non_staged_files_raw_output = git_handle.status('--porcelain')
if len(non_staged_files_raw_output.strip()) > 0:
non_staged_files = list(filter(
lambda x: pattern_for_untracked_files_flag.match(x[0]) is None,
list(map(lambda _: _.strip().split(), non_staged_files_raw_output.split(os.linesep)))
))
else:
non_staged_files = []
if len(non_staged_files) > 0:
git_handle.stash('push', '-m', 'git-backup-stash')
stashes = git_handle.stash('list').strip()
stash_list = stashes.split(os.linesep)
patch_files = []
if (len(stashes) > 0) and (len(stash_list) > 0):
patch_files_directory = os.path.join(parent_directory, 'patch-files')
os.makedirs(patch_files_directory, exist_ok=True)
for stash in stash_list:
stash_name = stash.split(': ')[0]
stash_branch_name = stash.split(': ')[1].split()[1]
stash_message = ': '.join(stash.split(': ')[2:])
patch_file_name = ':'.join([stash_name, stash_branch_name, stash_message]) + '.txt'
patch_file_path = os.path.join(patch_files_directory, patch_file_name)
patch_contents = git_handle.stash('show', '-p', stash_name)
with open(patch_file_path, 'x') as f:
f.write(patch_contents)
patch_files.append(patch_file_path)
zip_file_name = directory_name + '.zip'
zip_file_path = os.path.join(parent_directory, zip_file_name)
with ZipFile(zip_file_path, 'x') as z:
z.write(git_bundle_file_name, arcname=os.path.split(git_bundle_file_name)[1])
for patch_file in patch_files:
z.write(patch_file, arcname=os.path.join('patch-files', os.path.split(patch_file)[1]))
os.remove(git_bundle_file_name)
if len(patch_files) > 0:
shutil.rmtree(patch_files_directory)
print(zip_file_name + ' created')
if __name__ == "__main__":
arguments = sys.argv[1:]
if len(arguments) == 0:
directory_to_backup = os.getcwd()
else:
directory_to_backup = arguments[0]
directory_to_backup = os.path.abspath(directory_to_backup)
git_directory = os.path.join(directory_to_backup, '.git')
if os.path.exists(git_directory) and os.path.isdir(git_directory):
create_backup_zip(directory_to_backup)
else:
print(directory_to_backup + ' is not a git repository')
GitHub gist: https://gist.github.com/VarunBarad/c291e98dd426b0da1322171290d7bbd0
That's all folks
This is my solution to the problem I faced of taking a backup of my projects which also included unfinished work of mine. If you have any other suggestions regarding these or any other topics under the sky, contact me or tweet to me @varun_barad.