Assignment 1

Goals: set yourself up on Github; start learning R; practice the statistical reasoning we did in class.

Preparing: Git and Github

From this point on, you’re going to use a “version control” system to do all your work. Version control is a bit like an “undo” feature, but far more sophisticated. The main thing a version control system gives you is a detailed record of everything you’ve ever done. If you have ever had the thought “I’m sure everything was working / was just the way I wanted it last Tuesday…”, this is the solution. (If you have not, you almost certainly will over the course of this class.) It also gives you the possibility to add multiple collaborators to a project. The advantage over shared folders (for example, Dropbox) for this purpose is that, since each change is logged, you know exactly who changed what when, and can go back to any previous state at any time.

Create a Github account

Github is a site that provides free hosting for code projects managed through Git (we will find out what this means in a minute). You store the files for a single project in what’s called a “repository.” We’ll discuss this more below.

The only condition for storing your project on Github is that all of the files be publicly available for viewing (not for modification), unless you pay. Since, as a researcher funded by public money, all your work should be publicly available anyway (and more and more people will bother you about this; and under increasingly many conditions this is legally required; and some journals will require it; and people in chemistry look at us like we’re criminals when we tell them we don’t keep a detailed record of literally every single thing we do, open to scrutiny; etc etc), this should be no problem for you. Sometimes people have legitimate concerns that their data or their analysis will be used by some nefarious competitor before they can publish. In linguistics as it is today, this fear is almost never well-founded. In fast-moving fields where people publish every couple of weeks, it may be a concern. At any rate, in the future, if you want to wait until after you’ve published your paper to make your work available to the world, you can pay for private hosting on Github, you can use BitBucket (another service), or you can set up a private Git server in your lab or research group, that allows you to control access.

In this class, however, you’re required to put everything in a public Github repository (and, yes, this means that you can copy each other, and, no, I don’t mind if you get ideas from one another, but if you copy and paste code it will be obvious, and you will often have to explain your code in words, which will also be pretty obvious if you copy).

So: if you don’t have one already, create an account at http://github.com/, and tell me your username (e.g., by email). If you don’t do this, I won’t be able to mark your assignment.

Create a repository

When you log in to Github, you should see a green button called “New repository.” Create a new repository called “assignment-1” with the “Initialize this repository with a README” box checked. Also have it create a “.gitignore” file (i.e., check that box). Once you create the repository, you’ll be taken to the main page of the repository, which will always show the README file. The README file is there to explain what is in the repository. Notice, however, that you can’t edit the README file from here.

What did you just do?

Git is a system for storing code. Actually, you can use it to store anything you’re working on. (Almost. There are certain things you should never use it for, or bad things will happen immediately. More on this later.) It stores things, however, in a complicated way, which you will get used to and come to appreciate—after you understand it.

The first thing to understand if you have never programmed before is that you will be storing your code in files. These files will have names, and you will be able to find them someplace, stored away on some computer, in some folder. These things will become clear soon. The data you work with will also be stored in files. If you work on our class server, these will be stored in a folder you have access to on the class server. Let’s call this collection of files your workspace. What git does is to help you create three different copies of the files you’re working on (the index or staging area; the repository or local repository; and the remote repository)—for reasons which will become clear.

In fact, let’s call these four copies The Four Books of Git.

I know we don’t know what they are yet. But hold tight. You have just created the rightmost copy: the remote repository.

Creating your local repository

I’m going to assume you’re working on the class server at Paris V (there are instructions on how to set up Git, R, and RStudio on your own computer at the end of the assignment, but that is optional). Log in using your username and password. (Keep a tab open with your new Github repository, though.)

In the upper right hand corner of RStudio, you will see a button that says “Project: (None)”. Click it to bring down a menu, and create a new project using the option “Version Control: Checkout a project from a version control repository.” Select “Git.” Remember that in the last step, we created a repository on Github, stored on the Github servers. That was one of the Four Books of Git. (Which one?) You’re about to ask RStudio to create the other three.

The remote repository is called “remote” because it’s stored in a different place, remote from the rest. In this case, on Github’s servers. The other Three Books of Git are stored in the same place as each other. They’re called “local” because they’re where you’re working. You have direct access to them from where you are working. Since you’re working on the class server, they’re not stored on the computer right in front of you (you won’t find copies on your hard drive or on the computers in the classroom), but they are stored on the computer you’re working on (i.e., the class server, to which you are connected). But since the remote repository is remote, you need to give RStudio its address. To find this, go back into the tab you left open to your new repository on Github, and click the green “Clone or download” button. Click on “Use HTTPS”. Then, copy the address given (it should be of the form “https://github.com/ … (etc etc) …”).

Now go back into RStudio and paste this in the top field. This is the location of the remote repository. It’s also asking you for a directory name. RStudio wants to a name for the directory which will contain the other Three Books of Git. Call it “assignment-1”.

(Directory equals folder. It is a widely known fact that you can easily use computers quite ably your whole life without thinking too much about directories/folders and filesystems. I am of the opinion that files and folders are really not a very effective way for human beings to interact with computers for most tasks, so it’s not surprising that most people manifestly do not think much about what folders things are in, and get surprised when they’re asked a question about “what folder.” If you feel surprised at the idea that everything on a computer is in a folder/directory, read over the Wikipedia page for Directories.)

RStudio has just created a directory called “assignment-1.” (Again, assuming you’re working on the class server, this directory is on the class server, and not on your own computer.) This directory acts as your workspace (The First Book of Git). Hidden in this directory, in places you can’t easily see, are the index (The Second Book of Git) and the local repository (The Third Book of Git).

You can see what is in this directory in the file browser in the lower right hand panel of RStudio. Notice that it contains the README file you created when you created your repository, called “README.md”. Using the file browser in the lower right hand panel of RStudio, you can see the contents of one folder at a time. You can browse what’s in the parent folder, which is your “Home directory” on the class server. To look at what’s in it, click the button marked “Home” at the top of the file browser. You’ll see that it contains the folder you just created (the “assignment-1” folder). Click on that folder to set the file browser to show its contents again.

Updating the Four Books of Git

The point of the workspace is just to let you work freely and make changes to what you’re doing: changes which are saved somewhere (in your workspace), but which aren’t being tracked. If you open up the “README.md” file by clicking on it in the file browser, you’ll see what it contains: just a title, the name of the repository. Since the README file should actually have some useful information in it, underneath that, you can add in some information: type in something explaining what this repository contains. (Hint: this repository contains your work for Assignment 1, so you could put, “This repository contains my work for Assignment 1.”) You can then save that file in the normal way that you would save things in most other programs (by clicking File > Save, or by clicking the floppy disk icon, or by hitting Control-S on Windows or Linux or Command-S on Mac).

These changes are not being tracked. If you make another change (for example, in the future, you’re probably going to forget what “Assignment 1” refers to, so you might want to change it to say “for Assignment 1, Stats, Fall 2017 (Paris 7)” instead of just “for Assignment 1”), and then save again, you will never be able to go back to the old version. (Yes, there is an Undo button in RStudio, but once you close RStudio, all will be forgotten.) Nor can anyone else see these changes—and that includes you, if you don’t happen to be connected to the class server. They are saved, but they aren’t “in the system.”

To get things “in the system,” you need to go open up the “Git” panel in RStudio. This is a little hidden: it’s the third tab in the upper right hand panel, next to “Environment” and “History”. Before RStudio will let you do anything, you need to tell it who you are, so that Git can save this information when it tracks your changes. Give Git your name and email address. In the Git panel, click “More” (it’s got a picture of a blue gear; make sure you click the one in the “Git” panel, though, as there’s also one of these in the file browser); now hit “Shell…”. This is access to the class server’s command line. Type:

git config --global user.name "John Doe"

and hit return (replace “John Doe” with your name). Now type:

git config --global user.email "johndoe@univ-paris-diderot.fr"

and hit return (replace "johndoe@univ-paris-diderot.fr" with your email).

Close the Shell window. Now you’re ready. You should never have to do this again, unless you set RStudio/Git up on another computer.

You’ll see that there are a couple of things showing in the main “Git” panel. There is the “assignment-1.Rproj” file, which has yellow question marks next to it. And there is the “README.md” file, which has a blue “M” next to it.

The yellow question marks are Git telling you, “This file is unknown to me.” It’s in your workspace, but it hasn’t been registered anywhere else. In fact, RStudio created this file when I asked it to create the project folder for Assignment 1. It’s there to store some of your RStudio settings specific to this project, so that you can go back to working on Assignment 1 with RStudio open exactly the way you left it last time. I don’t usually save this in my Git repository unless I have a good reason to, but it would never really hurt anything, so you can click the empty checkbox to the left. You’ll see the yellow question marks turn into a green “A”. The “A” stands for “Added”. This means that this file is now in the index, the Second Book of Git.

The purpose of the index is to let you choose what gets copied into the local repository, the Third Book of Git. The Git panel in RStudio gives you a list of all the files that are different in your workspace with respect to the local repository. The goal is to pick a single thing that you did (like, when you finish the answer to one of the questions on the assignment, or some sub-part of an answer to one of the questions on the assignment): a change that you would like to track. Remember that Git doesn’t track the individual changes you make to your workspace. It’s up to you to track them in the Books of Git, one at a time, so that you can come back to them later. It’s a good habit to start doing this every time you feel you’ve made some progress (even if later you discover that you were actually wrong, and the thing didn’t work). The index is a temporary space where you can select the changes you’ve made to the workspace that correspond to one small unit of progress. The first bit of progress you made was setting up the project in RStudio, so, having added this change to the index, you’re now ready to copy it into the local repository.

To do this, click the Commit button in the Git panel. This will show the changes that have been added to the index (a.k.a., “staged”), give you one last chance to change your mind about what goes in there, and then it will ask you to write a short message, called the “commit message” (obligatory). Conventionally, you write these messages in the imperative, as if they hadn’t been done yet: “Add RStudio project file” would be a good commit message. The first line of the commit message should be a short summary. If you want to give more detailed information, leave a blank line, and then write a longer explanation. If your commit message isn’t short, your commit may not be a single small unit of progress. (Remember: computers are dumb. In this case, the computer is sufficiently dumb that it can only bring back old changes if you stop and label them, and put them in the local repository. Otherwise it’s going to lose track of them. To match the computer on its level, you need to stop and think about what you’ve accomplished.) Once you’ve written your commit message, hit “Commit”.

What was once in the Second Book of Git is now in the Third Book of Git. The Second Book of Git has been wiped clean. You see a message that tells you how Git went about updating the local repository in some detail that is not important, but which should convince you that Git is very efficient in its way of bookkeeping changes.

You did a second thing, of course, which was to update the README file. Make a second commit registering this change into the local repository. Give it a description like: “Add description of repository to README”.

Once you’ve done this, your Git panel should be empty, because your workspace should match your local repository. At the top of the Git panel, it will say, “Your branch is ahead of ‘origin/master’ by 2 commits.” The reference to ‘origin/master’ is a reference to the remote repository. The remote repository is stored on Github. In order to sync them up, hit “Push” (with the up arrow, for “upload”). You’ll be asked for the username and password you gave when you created your Github account. The Four Books of Git are in harmony. You can check this by refreshing that tab where you were viewing the Github repository.

Looking at older versions

I won’t show you how to load them into your workspace (because you usually don’t want to do that), but you can look at the changes you’ve made by clicking, in the Github page, where it says “3 commits” (this is the total number of commits you’ve made up to now: one when you created the repository; another when you added the RStudio project file; and another when you modified the README). By clicking on the individual commits, you get to see the changes that you made at each step.

Notice that you did not sync with the remote repository three times (“push”). You only pushed once, just now. The Third and Fourth Books of Git don’t just contain your current work. A repository contains everything you have ever done, organized in this convenient fashion. You could also have viewed the same changes by looking at the local repository. Go back into RStudio, and, in the Git panel, hit “More” (the blue gear again). Click “Shell…”. Now type “git log”. You have the same list of commits (without the detailed changes shown like you do in the Github interface; there is a command, “git diff”, that allows you to see them: try “git diff HEAD^ HEAD”). What’s in the local repository, like what’s in the remote repository, is the whole history of your project.

Cloning and pulling

When you eventually collaborate, or, potentially for the nearer future, if you decide to install R on your own computer and work there, instead of on the class server, then the way you’ll stay synced up (with your collaborators or your friends) is to add, for each separate computer where you want a copy of the project, Three More Books of Git: a new workspace, index, and local repository.

Above, after you created your remote repository on Github, you then went into RStudio and created a new project. You gave it the address of the remote repository, and RStudio created Three Matching Books of Git in your directory on the class server. If you wanted to work on the project on your own copy on your own computer (not connecting to the class server, with actual copies of the workspace on your hard drive: for example, you’re taking a long flight soon); or if someone else wanted to work on the project; then you, or they, would do the same thing again on the appropriate computer. This is called “Cloning.” (If someone else wanted to Push, you’d need to add them as a collaborator on the project. By default, only your username and password will work for Push.)

If some changes are pushed to the remote repository, you will then probably want to Pull them, to update your local repository. You can try this now by hitting the “Pull” (down arrow) button in the Git panel, but nothing will happen.

There are a couple of skills you’d need to learn before working with other collaborators (they’re useful for working on your own code as well, but not essential at first). Namely, you’d need to learn to “merge” and to create separate “branches” within a repository. You can find out more about these in the tutorials below.

Using Github without having to enter your username and password (optional)

If you start to get annoyed by having to enter your username and password each time you Push, then you can set up a better way of proving who you are. Follow these steps from the Shell, accessible by hitting “More” in the Git panel in RStudio. (On the class server, you don’t need to follow the steps for MacOS, even if your computer is a Mac. The class server isn’t on MacOS, it’s running Ubuntu. Also, the “pbcopy” command won’t work. You have to type “cat ~/.ssh/id_rsa.pub”, and then copy the output to the clipboard yourself with the mouse.).

You’ll then have to change the way RStudio connects to your repository, from HTTPS to SSH. First, push your latest changes. Then, open a web browser tab to your repository on Github, and click the green “Clone or download” button. Instead of HTTPS, make sure that on “Use SSH” is selected. Copy the address given (it should be of the form “git@/github.com: … (etc etc) …”). In RStudio, go into the Shell, and remove your existing HTTPS connection to the remote repository:

git remote remove origin

Now add the new SSH connection:

git remote add origin [ADDRESS YOU COPIED FROM GITHUB]
git push -u origin master

You can close the Shell and you should be able to Pull and Push from RStudio without a username or password.

If you set up Git on another computer, and you don’t want to have to use your username and password from there, you’ll need to copy your SSH private key to that computer. That’s not included in this tutorial, but I can give you a hand if you’re not sure how. Then just do this again on that computer.

Big binary files and Git’s history (boring, but not optional)

One last thing about Git. I told you at the beginning that you can use Git to store almost anything. You shouldn’t use Git to store large files, especially not large binary files. We’ll get back to the binary part in a second. The “large” part means, “more than a few megabytes”, at least when it applies to binary files.

First, what I’m telling you concretely is that you shouldn’t put large files in the local repository or remote repository. If you accidentally put one in the index (by checking the box and getting the “A” for “Added” in RStudio), you can always uncheck it. When you think about the fact that you want to be able to distribute your code and work across multiple computers or collaborators, you can already see that it would be a bit of a pain if, for every new computer you add, you have to first download a repository that’s really big.

But it’s actually worse than this with Git, because of the “binary” part. “Binary” as opposed to “text” files is something you can understand better by reading this article and the Wikipedia pages linked therein. Some examples of binary files are images, PDFs, Word documents, audio files, and compressed files like Zip files or TGZ files. Some examples of text files are source code, R-markdown files (which you’ll learn about today), most HTML files that make up web pages (but later in this assignment I’m going to give you a warning about some of the ones that R creates, which are partly binary), data that’s saved in CSV (comma-separated value) files, or any document you’ve written that you can open up with Notepad on Windows, TextWrangler on Mac, or GEdit on Linux, or Emacs or vi—these programs are all called “text editors” because they won’t save what you’ve written as a binary file. They’ll just save the actual sequence of letters/numbers/spaces that you’ve written directly. That’s a text file.

So, remember that Git saves your whole history in the repositories. For text files, it’s good at doing this efficiently. When you put a change to a file in the index and then commit, it will scan through the files to see what’s been updated, and only save those changes. When you add a file, it will analyze the file and find an efficient way to store it in the first place. When you “remove” a file, it will not be in the latest version, but it will still be there in the history inside the repository.

But, for binary files, these processes are not efficient at all. Git won’t be able to find the changes if you make changes to binary files. It will just store both copies, the old one, and the new one, completely. That means that every time you make a change to that binary file, it’s going to increase the size of the repository by a lot (even if the old version is no longer in your workspace, on any computer, anywhere; it’s in your repository). On top of that, it’s just not very good at storing binary files efficiently at all, and because, whenever you do a push, it likes to go in and do a bit of internal reorganizing, this means that the simple fact of adding a large binary file to your repository is going to slow down literally every single time you push. Because it’s going to go back and look over that file in detail to see if it can store it better. It can’t. But it’s dumb. And, because Git stores your whole history, even if you remove the file from the remote repository, and all the local repositories on any computer in the world, this will still happen.

Large text files aren’t so bad. They only cause the problem of the initial pull to be slow. It’s probably not the best idea, because it’s inconvenient, but it only happens once. So if you wanted to store a bunch of data as a CSV in your repository, it wouldn’t cause too many problems. If you stored it as a compressed file, or if you were doing speech corpus work and stored the original audio files, it probably would. Large binary files, say, over five megabytes, are Not Fit for Git. This will come up later in this assignment.

Learn more about Git

Git isn’t only in RStudio. It’s its own tool. You may find it useful to try out other tutorials about Git, which will teach you different things (including some things we didn’t talk about), in a different way. Here are a few.

http://r-bio.github.io/intro-git-rstudio/

https://www.youtube.com/watch?v=uUuTYDg9XoI

https://try.github.io/levels/1/challenges/1

Exercise 1: Getting started with RMarkdown files

In this class, you’re going to get used to working using RMarkdown files. RMarkdown files are a way of doing what’s called “literate programming”. This will make more sense if you first look at an example. This is a lesson from my friend Joe’s stats course at the LSA this year.

You can see that the document I sent you to is a web page. And it has a bunch of plots in it, to explain linear regression. You also saw that it also has a bunch of R code in it (for example, right at the top, under “~2 minute setup”). It turns out that this web page was actually automatically generated. Joe didn’t manually paste in those plots and that R code into the web page, the way he would have if he were typing his notes up in, for example, Word, or Latex. Joe didn’t do that (I don’t do that either). Those plots were produced by R, and all Joe did to put them in the document was to write the R code that made them. He wrote them into an RMarkdown file, which was then automatically run and converted into the HTML file for the webpage. He could have also made a PDF, or a Word document, without ever leaving RStudio.

Here’s a snippet of Joe’s RMarkdown file (which you can also download by clicking “Code” in the upper left hand corner of the page), in the form of a screenshot of what he probably saw when he was editing it: