7 Version control with Git and GitHub
This reading is meant as a primer to the workshop. It will introduce you to the concepts of version control, Git, and GitHub which are central concepts of the workshop.
7.1 What is version control?
Within academic and research contexts, most people’s work usually involves some interacting with or changing files in one way or another. These files can be anything from text documents to images to code. When we change files, it can be useful to track what was changed and when it was changed. This tracking of changes to files is known as version control.
Version control can very useful for many reasons. Especially, when you are collaborating with others on a project since version control allows everyone to track who and what was changed in the files you collaborate on.
But version control is also useful when you are working mostly alone on a project, as we humans tend to forget things. For instance, you may wonder why you made a certain change or want to revert back to an earlier version of your file. Version control can help you with this by enabling you to see what was changed in your file and when (called the “history” of your file) and to view earlier versions of your files.
You may have used a file’s internal “track changes” features, like in Word, or maybe you have informally “tracked changes” on the file level by saving multiple versions of a file with different names, like in the example image below.
{#fig-informal-version-control fig-alt=“A screenshot of a file explorer showing multiple versions of different files with names like”draft.txt”, “draft1.txt”, and “draft_final.txt”.”}
Does the above image look familiar? While it may exaggerate what some people’s “versioning” looks like, it is the most common approach people use to “version control” their files.
This “informal” way to do version control isn’t ideal because it involves multiple copies of the same file; it makes it difficult to keep track of specific changes and find the right version of the files. Having multiple versions of the same file with slight modifications under different names, as in the image above, really highlights that it is hard to manually track file changes and that there is a need for more formal version control.
Luckily for us, there are “formal” version control systems that automatically track changes to files. So, let’s take a look at one such system: Git.
7.2 What is Git?
One of the world’s most popular version control systems is called Git. Git is used by millions of people around the world, including thousands of organisations. It is also used increasingly by researchers.
With Git you can create snapshots of file changes, known as commits. Each commit captures:
- What specific changes were made to the file or files.
- Who made the changes to the files.
- When they made the changes to the files.
Each commit also has a short message attached to it that can describe why the changes were made.
Git stores these commits in a history log. The history log allows you to quickly go back and explore each change made to the files, along with the individual commit messages. This is extremely useful when you revisit your own work after a long time and when you work in groups or with collaborators.
Git only tracks changes to files within a specific folder (and its sub-folders). In Git terminology, this folder is called a repository (or a repo for short). The best way to use a repository is to store all files related to a specific project, like a research project or administration files for your lab or group, in the repository (the “folder”). This way, you can track all changes made to all files in the project. It keeps things more organised and self-contained, since everything related to a project is in one place.
Any type of file can be stored in a repository, including both text and other non-text based files like Word or images. However, Git can only show specific changes made to a file if it is text-based, like a .txt, .csv, or code. Since these text-based files are literally only text characters, it is easier for the computer to show the exact changes to the exact lines of text. Unlike files like images or Word documents (that actually aren’t just text), where there are no “lines” to track changes on.
To understand how powerful formal version control like Git is, consider these questions:
- How many different versions of a scientific document or thesis do you have lying around after getting feedback from your supervisor or co-authors?
- Have you ever wanted to test out something in a file and ended up creating a new one to avoid modifying the original?
- Have you ever deleted something and wished you hadn’t?
All these problems can be fixed by using formal version control! Besides this, there are many good reasons to use version control, especially in science. Version control helps with:
- Keeping your files and folders more organised, since you only need one version of each file.
- Easier collaboration, because you can collaborate on a single file/folder in a single central location.
- Complete transparency of who did what and when, which can protect against accusations of fraud or misconduct.
- Claim to first discovery, since you have a time-stamped history of your work.
- Easing the process of sharing your work with others, since you can share the repository with them.
7.3 What is GitHub then?
There are several ways to use Git. In this workshop, we will use GitHub, which is a website that hosts Git repositories and builds on Git’s core features. What this means is that your Git repositories can be stored on GitHub, and you can manage your files and projects using Git through GitHub’s web interface.
Everything we do in this workshop (including storing and managing files and folders) will happen through the GitHub website. Behind the scenes, GitHub will use Git to track the changes we make.
In the simplest terms, Git is a software, while GitHub is a company and website that makes it easier to use Git and share Git repositories. For beginners, GitHub’s web interface has some advantages: you commit changes immediately after editing a file, and it’s easier to view changes and file history compared to using Git alone on your computer.
While we will only be interacting with Git via GitHub during this workshop, when you feel more comfortable with the concepts, you can eventually start using Git on your computer (instead of via the GitHub website). Using Git on your computer has the benefit of being faster (you do work locally, so don’t need to wait for the internet) and more flexible (you can do more things with Git on your computer than on GitHub). Then you can use GitHub as a place to keep backups of your repository, collaborate with others, track tasks, and make use of the other features GitHub has. How you would use Git locally with GitHub looks something like the figure below.
Using GitHub on its own is a great way to get started with Git; it allows you to learn the concepts of version control and Git without needing to install anything on your computer and without needing to learn some of the more technical details of Git. Since GitHub is a website, it also makes it easier to share your work with others and to collaborate with others. This is one of the main reasons why GitHub is so popular.
You may notice that GitHub sounds a bit like file synching tools such as OneDrive or Dropbox. So how is GitHub different? Unlike OneDrive or Dropbox, GitHub (via Git) tracks line-level changes to files, not just file-level changes, if you work with text-based files. This means you can see the specific changes made in a file, not just that it was changed. The messages you attach to commits also help you keep track of why the changes were made.
OneDrive and Dropbox use a simple way of handling conflicts (i.e., different changes to the same file) when synching between the cloud and your computer by either creating a new file with some details appending to it or by overwriting which ever is newer. Git and GitHub, on the other hand, use a more complex way of handling conflicts by showing you the changes and allowing you to resolve them as you want to. This means that with Git and GitHub, you have complete control over how conflicts are resolved.
File synching tools are really good for easily sharing files within a team or group, but they aren’t as good for collaboratively working together on files. That’s where GitHub shines. It’s built for working on files together, not just sharing them.
7.4 Why learn and use GitHub specifically?
When you’re learning something new, it helps to put it into a context. While we’ve already covered some of this in the syllabus, we haven’t fully addressed probably one of the biggest questions you may have: Why GitHub?
GitHub is an extremely popular and widely used online platform for managing Git repositories. As you read previously, Git repositories come with a lot of features to work with files and track changes to them. GitHub does much more than that. It can be used for:
- project management1
- running automatic workflows
- hosting websites
- collaborating with others, for example by providing a framework for reviewing each other’s work
1 The links in this and other sections are included so you know where to go for more information, but are not required reading for this workshop.
GitHub also allows you to create tasks or notes (known as GitHub Issues) that are connected to your repository. We’ll create our own GitHub Issues later in this workshop.
While there are other platforms that do similar things, such as GitLab and Bitbucket, GitHub is more widely used than the others and has a much larger community. So, you will be able to find many different resources, tools, and projects on GitHub that you can use and learn from.
More and more researchers use GitHub to conduct their research, collaborate, and share their work. It is a powerful platform that not only facilitates collaboration but also makes your research more discoverable, since GitHub is extensively indexed by search engines and has its own search. Overall, it’s an excellent way to make your research more open and accessible.
In order to remain relevant and connected to the current and future research community, we need to embrace and use tools that improve how we do our work and how we disseminate our research. And GitHub is one of these tools.
In that context, this workshop is designed as a gentle introduction to using GitHub to manage your files related to your research and work. Before diving into some of the more advanced uses, it’s important to learn about and try out the basics of Git and GitHub works and how to use it effectively. In this workshop, our focus is to cover the most fundamental and important features of GitHub that will help you get started with using it for your own work.
7.5 Summary of Git and GitHub
- Using a formal version control system like Git can help you keep track of changes to your files and projects.
- A Git repository is a place where you store all the files for your project along with their history.
- GitHub is a website that hosts Git repositories, allowing you to store and share your files and projects online.
- Through GitHub you can manage your files and projects using Git.
So far, we have encountered the following terminology:
| Term | Definition |
|---|---|
| Version control | The practice of tracking changes to files over time. |
| Git | A widely popular version control system that tracks changes to files and projects. |
| (Git) Repository | A “project” with files that are stored and tracked by Git. |
| Commit | A snapshot of changes made to file(s) in a repository. |
| GitHub | A company and website that hosts Git repositories. |