7 Version control with Git and GitHub
This reading is meant as a primer to the workshop. It will introduce you to the concepts of version control, Git, and GitHub which are central concepts of the workshop and to working with files on GitHub in general.
7.1 What is version control and Git?
In our work lives, we regularly work with files, such as creating or editing them. These files can be anything from text documents, to images, to code. When we work with files, we often make changes to them, and sometimes many changes. We might want to keep track of how our files change over time or “save” specific versions of the files. This tracking of file changes over time is known as version control.
Version control can very useful for many reasons. For example, maybe you want to keep track of changes to a file so you can revert back to a previous version if you make a mistake. This is especially useful when you are collaborating with others on a project, as everyone in the group might want to keep track of changes made or feedback given by different people.
But version control is also useful when you are working mostly alone on a project, since we humans tend to forget things. For instance, you might wonder why you made a certain change or what the file looked like at a certain point in time by going back to that version.
If a file has the ability to internally “track changes”, like Word does, you may have used that before, maybe when getting feedback from others. At the file level (not when opening it), you may have “tracked changes” informally by saving multiple versions of a file with different names, like in the example image below.
Does this way of saving files and keeping track of versions look familiar? The above image may exaggerate what some people’s versioning looks like, but there is some truth to it: It is the most common approach to “version control”.
This “informal” form of version control isn’t ideal because it involves multiple copies of the same file. It makes it difficult to keep track of specific changes and find the right version of the files. Having multiple versions of the same file as different names, as in the image, really highlights the need for version control and that it is hard to manually track file changes.
Luckily for us, there exist “formal” version control systems that automatically track changes to files. One of the world’s most popular version control systems is called Git. Git is used by millions of people around the world, including thousands of organisations. It is also used increasingly by researchers.
With Git you can create snapshots of file changes, known as commits. Each commit captures:
- What specific changes were made to the file or files.
- Who made the changes to the files.
- When they made the changes to the files.
Each commit also has a short message attached to it that can describe why the changes were made.
Git stores these commits in a history log. The history log allows you to quickly go back and explore the changes made to files, along with a message describing the changes. This is extremely useful when you revisit your own work after a long time and when you work in groups or with collaborators.
Git only tracks changes to files within a specific folder (and its sub-folders). In Git terminology, this folder is called a repository (or a repo for short). The best way to use a repository is to store all files related to a specific project, like a research project or administration files for your lab or group, in the repository (the “folder”). This way, you can track all changes made to all files in the project. It keeps things more organised and self-contained, since everything related to a project is in one place.
Any type of file can be stored in a repository, including both text and other non-text based files like Word or images. However, Git can only show specific changes made to a file if it is text-based, like a .txt
, .csv
, or code. Since these text-based files are literally only text characters, it is easier for the computer to show the exact changes to the exact lines of text. Unlike files like images, or Word documents (that actually aren’t just text), where there are no “lines” to track changes on.
To understand how powerful formal version control like Git is, consider these questions:
- How many files of different versions of a scientific document or thesis do you have lying around after getting feedback from your supervisor or co-authors?
- Have you ever wanted to test an analysis in a file but ended up creating a new one to avoid modifying the original?
- Have you ever deleted something and wished you hadn’t?
All these problems can be fixed by using formal version control! There are many good reasons to use version control, especially in science:
- More organised files and folders, since you only need one version of each file.
- Easier collaboration, because you can work on a single file/folder in a single central location.
- Transparency of work done for others to see, which can protect against accusations of fraud or misconduct.
- Claim to first discovery, since you have a time-stamped history of your work.
- Easier to share your work with others, since you can share the repository with them.
7.2 What is GitHub then?
There are several ways to use Git. In this workshop, we will use GitHub, which is a website that hosts Git repositories and builds on Git’s core features. This means that your Git repositories can be stored on GitHub, and you can manage your files and projects using Git through GitHub’s web interface.
Everything we do in this workshop (including storing and managing files and folders) will happen through the GitHub website. Behind the scenes, GitHub will use Git to track the changes we make.
In the simplest terms, Git is a software, while GitHub is a company and website that makes it easier to use Git and share Git repositories. For beginners, GitHub’s web interface has some advantages: you commit changes immediately after editing a file, and it’s easier to view changes and file history compared to using Git alone on your computer.
While we will only be interacting with Git on GitHub during this workshop, when you feel more comfortable with the concepts, you can eventually start using Git on your computer. Using Git on your computer has the benefit of being faster (you do work locally, so don’t need to wait for the internet) and more flexible (you can do more things with Git on your computer than on GitHub). Then you can use GitHub as a place to keep backups of your repository, to track tasks, and to make use of the other features GitHub has. How you would use Git locally with GitHub would look something like the figure below.
Using GitHub on its own is a great way to get started with Git. It allows you to learn the concepts of version control and Git without needing to install anything on your computer and without needing to learn some of the more technical details of Git. Since GitHub is a website it also makes it easier to share your work with others and to collaborate with others. This is one of the main reasons why GitHub is so popular.
You may notice that GitHub sounds a bit like file synching tools such as OneDrive or Dropbox. So how is GitHub different? Unlike OneDrive or Dropbox, GitHub (via Git) tracks line-level changes to files, not just file-level changes. This means you can see the specific changes made in a file, not just that it was changed. The messages you attach to commits can also help you keep track of why the changes were made.
OneDrive and Dropbox also use a simple way of handling conflicts when synching between the cloud and your computer by either creating a new file with some details appending to it or by overwriting which ever is newer. Git and GitHub, on the other hand, use a more complex way of handling conflicts by showing you the changes and allowing you to resolve them as you want to.
File synching tools are really good for easily sharing files within a team or group, but they aren’t as good for collaboratively working together on files. That’s where GitHub shines. It’s built for working on files together, not just sharing them.
7.3 Summary of Git and GitHub
- Using a formal version control system like Git can help you keep track of changes to your files and projects.
- A Git repository is a place where you store all the files for your project along with their history.
- GitHub is a website that hosts Git repositories, allowing you to store and share your files and projects online.
- Through GitHub you can manage your files and projects using Git.
So far, we have encountered the following terminology:
Term | Definition |
---|---|
Version control | The practice of tracking changes to files over time. |
Git | A widely popular version control system that tracks changes to files and projects. |
(Git) Repository | A “project” with files that are stored and tracked by Git. |
Commit | A snapshot of changes made to file(s) in a repository. |
GitHub | A website that hosts Git repositories. |