What is git?
git is a version control system (VCS)
git is a version control system developed by Linus Torvalds, who
is also the father of the Linux operating system, to facilitate the
development and maintenance of the Linux kernel. It helps programmers
to track the changes in the code and collaborate within a development team.
The story (ref1, ref2) goes that before
git, the Linux kernel
development team was using a proprietary VCS software, BitKeeper,
whose owner later threatened to withdraw the free license after
noticing that one team member, Andrew Tridgell, attempted to reverse
engineer the BitKeeper protocol. After weeks of failed mediation with
BitKeeper, Linus decided to create his own VCS system. So he did,
within only a weekend’s time. About 2 months later,
git was used to
manage a new Linux kernel release.
git was initially created by and for programmers, and still
is mostly used by them. However, some of its features, particularly
the version control functionality can be used to track changes in
files other than source codes, and benefit researchers not majoring in
git as VCS over the file-renaming approach
The file-renaming approach
One of the core functionalities of
git is version control. We
probably all have had the experience of copying the work-in-progress document
and renaming the file to indicate a version change:
manuscript_draft_1.tex manuscript_draft_2.tex manuscript_draft_3.tex manuscript_draft_3_minor_change.tex manuscript_draft_4.tex ... manuscript_final.tex manuscript_final_with_minor.tex manuscript_final_with_minor_new_figure.tex manuscript_final_with_minor_new_figure_and_table.tex ...
Each of these suffixes labels a new version of the same document, and collectively they form a development history. Being the easiest and most native version control approach, it has 4 major disadvantages:
- Version history quickly becomes obscure and difficult to track the changes.
As the revisions pile up, it becomes increasingly difficult to summarize a new version into a few words that can fit into the file name. A numerical version number is nice and clean, but it doesn’t carry much information about what exactly has been changed. Over time (maybe after a month?), the file names start to lose their meanings to you, and you start to look at the modification time instead to order the versions and find the latest one.
- Difficult to implement "branches" in the versions
I’m already using the
git terminology. By "branches" I mean some kind
of alternative pathways of the development of the document. Say for
manuscript_draft_4.tex you decide to revert back to
manuscript_draft_2.tex and start working from there on,
so the subsequent development diverges into two separate
branches/pathways: one continuing the
manuscript_draft_4.tex line of
thought and the other going in the direction of
manuscript_draft_2.tex. How would you name the next version of the
manuscript_draft_5.tex? That would be equally legit as
the next iteration of the
manuscript_draft_4.tex branch. I think you
get the idea. This suffix-appending scheme may be
suffice if you have a single linear development pathway and each
new iteration is based only on its immediate predecessor.
"Alternative universes" are not easy to maintain in this manner.
- Prone to data loss and difficulties in maintaining multiple copies
git when I needed to maintain multiple copies of the
same document across different machines, for instance one in my office desktop
and another my personal laptop, I used a USB flash drive to copy files
over. Other than the slight inconvenience involved in carrying the
drive around, plugging-in/out and copying/pasting, it worked fine.
Until one time I confused myself with which one was the newer version
and overwrote the newer with the older. Because versioning is done by
nothing but the creation of new files, once the files themselves are
lost or overwritten, one loses the version as well. It is also
inconvenient to maintain multiple copies of the same document
versions, as you have to copy all the version files over in order to
retain the entire version history.
- Duplicate copies take more disk spaces.
With the hard drives becoming cheaper, some additional disk space usage is not a big deal today, however it still clutters up your project folder, and makes collaboration less convenient.
git as the version control system, there would be a single
manuscript.tex file in the project folder, and the different
versions are tracked and represented by different
(we will go deeper into such concepts later). Then the revision
history would look like this：
More information can be displayed if such one-liner descriptions are not informative enough:
You can browse through the commit history, perform a filtering by the
contributor’s name, the time period or track down some individual
files/folders. It also allows one to revert back to a previous version
and start working from there on, or compare the changes between any 2
commits. With proper usages of the
git commands, the entire editing
history of the document(s) can be recorded and there will be no lost of
data even if you mistakenly overwrite or delete the file(s). Maintenance
of multiple copies can also be achieved by pushing and pulling from a
remote repository on different machines.
In the next session I will introduce some basic concepts of
walking you through a common
git workflow. In so doing I will also
cover some of the most frequently used
available in all major platforms including Linux, MaxOS and
Windows, with a terminal command line interface or a graphical
interface. I’ll be covering only the Linux command line usage of
git. If you choose to use a GUI implementation of
git or in a
different OS, the exact procedures will show some differences, but the
basic concepts are the same.
Basic concepts and a sample workflow
Step 0: Install
git and do some configurations
git comes pre-installed in some Linux distros. In case not already
installed, it is recommended to install it from the distro’s software
repository. For instance, for Debian based distros, do
apt install git
For Arch based distros, use
pacman -S git
After installation, it is recommended to do some basic configurations:
git config --global user.name <name> git config --global user.email <email> git config --global core.editor vim git config --global alias.ci commit git config --global alias.st status
--global flag makes these configurations global to all
repositories (we will talk about what
git repositories are in a
minute) of the current user of the system.
git has 3 levels of
configurations, stored in 3 different files:
|level||config file location||purpose|
|local||=<repo>/.git/config=||repository specific settings|
|global||=~/.gitconfig=||user specific settings|
local level config will overwrite
--local flag, or not giving level option to the
git config command
will set the configs to
In those configs,
user.email define some user
information. These are used to label a contributor in a git
repository. This information is more important when there are multiple
contributors working on a same project.
alias.st lines define aliases for the 2 most
git commands. They allow you to use
in the command line to achieve the same as if you have typed
git ci is short for
All configuration options are stored in plain text files, and you can use
git config --global --edit
to open the configuration file (in this case,
~/.gitconfig) in a
text editor to modify these settings. Typically, you only need to do
such configurations for once after you do a new
git installation on
a new working machine, then you can forget about it and focus on the
Step 1a: Create a repository
After setting up
git, the next step would be to create a new
git repository (
repo for short). A repo is a virtual container of the
files of a project. From a file system point of view, it is nothing
but a normal folder in the computer, whether it running Linux,
Mac or Windows. In that repo folder,
git maintains a special
.git subfolder (note the leading dot in the directory name) where it
stores the repo configurations, metadata and the history of
changes. In short, all the magic of
git happens inside this
folder. Outside of
.git, you manage files just like in any normal
folder. Once you delete
.git, you delete the
git repo as well, and it
becomes truly an ordinary folder in the file system.
There are more than 1 ways to create a
git repo, and it is possible to
git as a standalone, offline version control system without
connecting to any remote server. However, we will be focusing on a
more typical workflow where you maintain a local copy of a
git repo in
one machine and a remote repo hosted in a hosting service.
Popular choices of
git repo hosting services include Github,
bitbucket and Gitlab. I used bitbucket initially, then migrated
over to Github after it opened private repos for free and I started to
experience some connectivity issues with bitbucket (the migration from
one service to another is super easy, just a few clicks in the web
interface). I had little experience with Gitlab. It allows you to
deploy your own host in your own server. In the rest of the post we
will be using Github as our hosting service.
After registering an account with Github, you create a new repository from the web interface of Github. Then it presents you a form like the following to fill:
Aside from Owner and Repository name, all other fields are optional.
In this demo, we named the repo
git_demo, and gave it a one sentence
A demo repo to illustrate the use of git.
We toggled the
Private radio button to make it a private repo, and
chose to initialize the repo with a default README file. All these
settings, including the repo name, can be changed later on.
After clicking the
Create repository button, Github will do the
demanded job and present you with this screen after it finishes:
It can be seen that in the repo there is currently only one file
README.md that Github created for us. Its contents are displayed
below. Currently it only has a title "git_demo" and the description
line we added earlier on.
Then we click on the green button labeled
"Code", and copy this line
of code into the clipboard:
This is the URL address of our repo. NOTE that yours will be different.
Go to a terminal window, navigate into a folder where you want to store the repo, and type in the following command (paste the copied URL from clipboard):
git clone firstname.lastname@example.org:Xunius/git_demo.git
Again use your own URL in the command. Hit enter to execute it. Here is the screenshot of the output:
git clone, we navigated into the newly created
folder (now it is a
git repo), and showed its contents. You can see
.git folder, and the
Now we have a local, private copy of the repo, which is connected to and tracking the remote repo hosted in Github. Next we will be working with our own local copy of the repo. Imagine this is a collaborative work. Other members of the team will be working on their own copies as well.
Step 2: Make changes in the working directory
In the cloned repo, we do some serious, world-changing work:
We added a new
script1.py file, and appended a new line to the end
README.md. In real world applications you will be writing some
code files or manuscript files, creating some images etc..
Now examine the current status of the repo, using the
command. (Remember that we have aliased
st). Here is the
git informs us that there is one modified file
modified: README.md), under the
Changes not staged for commit:
category. And the
script1.py file is under the
category. What do these mean?
git repo, there is a staging area, which is a kind of
"preparation area" for one to prepare for the changes to be committed
(see the schematic in Figure 8).
You can make various changes to the files in the repo, like creating
new files/folders, modifying existing ones, renaming or deleting
things. All these happen in the working directory of the
repo, but nothing has been added to the staging area just yet.
After making some changes in the working directory, you can choose to add some specific changes, or all of them, to the staging area in preparation for a commit. You could also remove some files from the staging area, or make some more changes and add them to the staging area. Once you are happy with the changes in the staging area, you move onto the next step and commit these changes.
Therefore the general workflow is (see the schematic in Figure 8):
- a. make new changes.
- b. add to the staging area (
- c. commit changes in the staging area (
In this demo, we would like to add all the new changes to the staging area. We can achieve this by using:
git add -A
git add --all
NOTE there is subtle differences between
git add -A,
git add *
git add .. The most robust and recommended way is
git add -A.
The below image shows the output of
git st after
git add -A:
Step 3: Make commits
git commit command
After finishing building the staging area, we can commit these new changes using:
(Remember that we have aliased
git will open an editor for you to type in the commit
messages, see Figure 10 below.
Since we have set the default
git editor to
above), it opened a
vim session for me to edit the commit messages.
The very first line is the title of this commit, and due to some historical reasons, it is recommended to limit the length of commit title within 50 characters. This also encourages one to build more atomic changes (talked in a min).
The lines starting with a
# sign are comments created by
thing between the title line and the comments are more detailed
descriptions of the commit. You may choose to put some records on all
the changed files, or only those most important ones. The key idea is
to make it easier for someone else, or your future self, who would be
git log to get a good knowledge about this particular
Lastly, after one saves and closes this file, the commit is done.
Those changes in the staging area are committed into the repo’s history, while those changes you made in the working directory but not added to the staging area, if there are any, are not committed.
A commit is like a snapshot of the repo, it records the current status
of all the tracked files in the working directory. Things that have
been added by the
git add command get some kind of index updated for
to track their status across the development stages. Things that are
added to the working directory but have not been added by
are not tracked, and a
git commit won’t affect such untracked
Over time, you will be making more and more commits, which would start
to form a trace of history, like shown in Figure 1. You can look
back at them using the
git log command.
For each commit there is a string label, e.g.
6cc52c7, that uniquely
defines a commit. The current commit also gets a special "label",
HEAD (all in capitals), that by default points to the latest commit.
Try making atomic commits
A good practice of using
git is to make atomic commits. "Atomic"
means that a single commit contains only some tightly related changes
about a very specific issue, a feature, or a topic. In our
demo changes made in Step 2, the new
script1.py file and the
README.md file are all about a new Python script added to
the project, so they qualify as an atomic change. If there had been
another new file doing something totally unrelated, it would be better
to leave that to another commit. You can see that the staging area is
there to help one build atomic changes: you can make a bunch of
changes to the working directory, but commit them little by little,
each time as an atomic commit.
The advantage of doing atomic commits is to make future maintenance easier. Imagine you make a big commit that bundles multiple new features/new chapters into the code/manuscript, and weeks later you realize there was an error in one of these features/chapters and you need to revert back to the commit where the mistake was made and correct it. Bundled commits make it difficult to pin-point where exactly the issue happened; and even if you do locate it, difficult to perform an accurate surgery that doesn’t affect other unrelated things.
Step 4 [optional] Undo local changes
I made this step optional because you don’t always need to do this.
If everything goes correctly, you can move onto step
5. However, we make mistakes, and
git provides us some mechanisms to
undo them. Note that these undo mechanisms are specific to the local
repository, before the changes have been pushed to the remote.
In step 2 we saw that
git has a staging area, and to
commit some changes there are 2 separate steps: adding to the staging
area and committing from the staging area. Therefore there are different
undo commands for different situations.
The 1st type is to undo the
git add process: you added some changes
to the staging area, but later decide to take them down. To achieve
git restore --staged or
git reset HEAD. For instance,
git restore --staged script1.py
git restore HEAD script1.py
takes the changes involving the
script1.py file out from the staging
area. To take down everything down from the staging area:
git restore --staged . git restore HEAD
Note that in Figure 9 after we ran the
git add -A command, the
git st output already offered you
git restore --staged <file> option:
Changes to be committed: (use "git restore --staged <file>..." to unstage)
If you read other tutorials online, you might find that a different option is given instead:
Changes to be committed: (use "git reset HEAD <file>..." to unstage)
This is because
git restore is a relatively new feature added in
version 2.23 (Aug 2019) (see this post), before that,
git reset HEAD was the
method to take down some staged files. In newer versions after 2.23,
both methods work, therefore I included both in the schematic in
Undo changes in working directory
git restore or
git reset HEAD only takes down the changed files
from the staging area, but that doesn’t make them not "changed
files" – the changes are still there, in those files in the working
directory. What if you really screwed up and want to revert the
changes made to the files?
To achieve this, use
git restore <file>
git checkout <file>
to discard the changes made to
<file> and revert it to the last commit.
To discard all new changes in all files, use
git restore .
git checkout .
(see Figure 8).
Note that in Figure 7 after we made the new changes to
git st output already offered you
git restore <file> option:
Changes not staged for commit: (use "git add <file>..." to update what will be committed) (use "git restore <file>..." to discard changes in working directory) modified: README.md
git restore is new in version 2.23, and
was the old way of doing this, but still works in new versions.
Also note that when you discard changes using
git restore or
git checkout, those changes are lost permanently. In case you want to
retain these changes for later, check out the
git stash command.
git add and changes in working directory
To reset the staging area and the working directory to match the most recent commit, use:
git reset --hard
(See the big yellow arrow in Figure 8).
This has the combined effects of
git restore --staged and
git checkout. After this, the staging area is clear, and no difference
exists between any file in the working directory from the last
commit. Again, any change you made is permanently lost, so use with
If some changes have already been committed to the
git history, there
are 2 ways to "undo" them:
git commit --amend
This can be used to modify the most recent commit. Suppose after we
committed the changes made to
script1.py, we decide to let it
Goodbye world! instead of
Hello world!, we could:
- a. edit the
script1.pyfile to implement the new changes.
- b. run
git commit --amend --no-edit
Or, we are happy about the changes in both
README.md, but only need to modify the commit messages. Then simply
git commit --amend
It will prompt you to edit the commit message again.
git commit --amend doesn’t alter the most recent commit,
but replaces it with an entirely new one.
git revert <commit>
undoes a specified commit, but it does this not by going back to a previous commit, but by getting the state right before the target commit, creating a new snapshot of it, and appending the old state as the new HEAD state (latest commit).
For instance, after we committed the change that adds
README.md, the latest state is the one labeled HEAD in
Figure 11. To revert the previous commit, we can use
git revert HEAD
Note that it will get back to the state right before the target
state, i.e. the one labeled HEAD~1 in Figure 11. This tilda-number
~n) expression is used to specify a commit relative to HEAD, in
this case, one step backwards from HEAD. Otherwise you will have to
git log to find the hash label to specify the target commit.
As illustrated in Figure 11, this goes back to HEAD~1, and makes a copy of it and appends the copy to the end, so it becomes the new HEAD.
git revert, it will prompt you an editor window, like
the one below, to enter the commit messages for this new commit. All the
contents shown in Figure 12 are filled automatically by
git and I
didn’t change anything.
Saving and quitting the editor window finishes the commit.
If you want to revert several steps back, you will have to use
git revert --no-commit <commit>..HEAD git ci
This will revert everything from the HEAD back to the commit specified by <commit>, again by recreating that commit state and appending it to the end to make it the new HEAD.
--no-commit flag tells
git not to prompt for a message for
each commit along the way, but revert all the way back to <commit> in
Step 5 Push to remote
Suppose we have done a whole day’s work at office but we need to do some extra bit during off-work time, we can push the local repository to the remote. This is like "publishing" the local copy onto the remote "hub" such that other people, or you yourself from a different machine, can get access to the updated files and continue working from there.
First, do a
git st to check out the current status:
It shows that
Your branch is ahead of 'origin/master' by 2 commits. (use "git push" to publish your local commits)
git push to push the local repo to remote. Below is a
screenshot showing the entire process:
Step 1b:Pull from remote
Suppose we have got back home and would like to continue working on the
project. If this is the first time working with
git on the home
computer, we will have to repeat the installation and setup procedures
as described in Step 0, and get a local copy of the repo by using
git clone command. In all subsequent sessions we only need to
pull the latest updates from the remote repo to the local machine:
This downloads the changes from the remote repo instance to the local one. Therefore, the combination of git push and git pull achieves synchronization between multiple repos. Compared with the flash-drive-copy synchronization strategy, this is both more convenient and less error-prone. There is no danger of mistakenly overwriting something with an old copy, because it always pulls the latest.
This section introduces a typical
git workflow and some of the basic
concepts. To summarize the entire process:
If this is the 1st time using
git to setup a project.
You will be using these steps:
- Step 0: Install and config
- Step 1a: Create a repo.
- Step 2: Make changes in the working directory.
- Step 3: Make commits.
- Step 4: [optional] Undo local changes.
- Step 5: Push to remote.
- Step 1b: Pull from remote.
Then, for subsequent sessions, you will be repeating the sequence of
Step 1b -> 2 -> 3 -> 4 -> 5 -> 1b.
git has many other commands, some of which are for collaborations
among a team. Since I mostly use
git as a single user those are not
covered in this post. Hopefully these are enough to get you started.
Work with branches
Previously I mentioned that
git can help manage different
development pathways, or different branches.
By default, the main branch is called master (yes I’m sticking to the "racist" terminology, thank you). When creating a new brand new repo, that is also the only branch.
People typically create a new branch for new development work. For instance, correcting a bug, adding a new feature, or some experimental work that shouldn’t be integrated into master before fully functional. If you are doing some writing, whether it being academic writing, fictional or playwriting, you can use a new branch to experiment with a different structure of the article, a different plot of the story or a different ending of a character etc.. Different branches are isolated from each other, like parallel universes, so that there is no danger of losing the original work if the experimentation doesn’t work out well. After fully developed, one can choose to merge the development branch into the master branch, and optionally delete the development branch.
Here are some commonly used branching commands in
git branch: list all local branches.
git branch -v: list all local branches and their respective. latest commits.
git branch -a: list all local and remote branches.
git branch <branch_name>: create a new branch named <branch_name>.
git checkout <branch_name>: switch to branch named <branch_name>.
git checkout -b <branch_name>: create a new branch and switch to it.
git branch -d <branch_name>: delete the branch named <branch_name> from local repo.
git merge <branch_name>: merge the branch <branch_name> into the current branch.
There are some additional points that are worth noting:
No shame in using the copying-renaming method, and a mental change
To be fair, I still use the file copying-and-renaming method even in
git repos. It is the easiest and fastest way to create some
temporary versioning. For instance, when I need to compare multiple
different solutions to a problem, having multiple versions of the same
Python script allows me to execute each in a different Python session and
compare their results. There is no reason not to use a
method if it suits the problem best.
Aside from all the technical differences, I feel that there is also a
very important psychological factor involved in a VCS workflow like
git. Sometimes I would rush out a few computation scripts in a
single morning. It does feel like a good amount of work done, but it
also leaves quite a mental burden on me: I can’t help worrying
about the possible errors hidden inside those scripts. Are they
isolated mistakes? Or do they have some far-reaching impacts that could
possibly affect my subsequent analyses? Whenever there are some
significant changes happening in the repo during a short period of time, such
kind of anxiety would start to hurt. Then I won’t feel released until
I do a review in the commit messages on very file changed. The
relief is almost instantaneous: once the commits are done, I know for
sure that they are "engraved" into the
git history, labeled,
ordered, tracked and isolated. I know where to find them and I know
they are only there, not anywhere else. I feel increasingly accustomed
to such kind of "milestone" and "registration" mental effects in my everyday usage
git, and I feel that it does help me get more organized about
git works best with plain text files, not large binary files
When you manage file versions using the copy-renaming scheme, you
end up with multiple copies of the same file, taking up unnecessary
disk spaces. In
git, versions are saved incrementally, meaning
that when updating from one commit to another, only the changes
during the process are saved. This is more efficient not only for storage
but also for file transfers.
However, this incremental saving manner only works for plain text
files, such as txt and md files. Source codes of all
programming languages, including LaTeX source codes are plain text
files. Note that PDF, DOC, ODF and image files are binary files,
not plain texts.
git can generate binary diffs, but the result won’t
be human readable (see ref1, ref2).
It is also not encouraged to put large data files into a
things like videos, NetCDF or dat files are not suitable to be
git. If they are quite small in size, it is Ok to save some
of them, but not in large numbers. Anything goes above
~ 50 MB
should probably be saved elsewhere.
One can create a
.gitignore file (note the leading dot) in the
repo to tell
git not to track some files/folders. This can be used
to prevent some temporary files from being tracked, for instance the
pyc files created by Python interpreter, the
.swp files created
vim, or the
build folder created by some compilers. For more
.gitignore, see ref1, ref2.
git restore and
git switch commands
We talked about the newly added
git restore command. It was meant to replace some functionalities in
git switch is another newly introduced command
meant to be used to switch branches. For more information, see ref1,
Write good commit messages
It is regarded good practice to stick to a good format and write informative commit messages, even if you are the only person working on the project (because in some sense, your future self is not too different from a whole different person). Here are some guidelines regarding writing good commit messages: ref1, ref2.
With good commit messages, your
git log output can be used as a
log/journal of your research/project development. It can help you
trace back errors, or provide some lessons to help you progress
further in your career.
Manage multiple repos
It is possible that one has to manage multiple local repos in a single machine, each one for a different project. I have about 10 repos that I work with regularly, and about another 10 that I rarely touch anymore. As the number grows, it becomes more difficult to manage all those repos. In a future post I’ll share a script that I created that "scans" through a collection of repos and generates a report for me, telling me which one is lagging behind the remote, which has uncommitted changes etc.. So stay tuned.