Learning Club 00: Set up your development environment (Getting started with R)

A few weeks ago I became aware of Renee’s (owner of the blog Becoming a data scientist) plan to start a data science learning club and I thought it was a cool idea. In the learning club she will post activities and the first one was about setting up your development environment: Activity 00: Set up your development environment. Since she is a Python user she recently asked on Twitter whether someone could support her finding R resources.

Me and another user (Ryan) replied and since Renee encourages to share the results of each biweekly exercise/task on our own blogs and/or in the forum, I thought I’d summarize

  • Recommended/required downloads
  • How to install R packages from CRAN, bioconductor and github
  • Recommended reading
  • My setup

Recommended/required downloads

To get started with R you need the R framework installed on your computer. You can either install base R or Revolution R Open (RRO). RRO is based on and 100% compatible with R. The nice thing on Windows is that you can switch between R versions easily (which I did for some performance experiments). This is not easily possible on Mac OS X, because it just overwrites your R installation if you install RRO.

R

For installing base R select your mirror https://cran.r-project.org/mirrors.html and download R for your operating system (or just google “download R” and you will get there).

For some Linux distros R might be in the repos, I have only used it on Mac and Windows.

Since RRO is based on R sometimes after a new release R is a little bit more up to date (until RRO releases a new version based on the newest R) but that is the only advantage I see.

Revolution R Open

INSTEAD of base R, you can use Revolution R Open by the company Revolution Analytics.

It has some advantages (some performance improvements on matrix operations, checkpoints can be set for packages, …). It is completely compatible with all R code but currently one version (3.2.2) behind base R (3.2.3). The main advantage I see (and which is also the reason why I now use RRO) is that the packages you use are by default from the same day that your RRO version was released. So far it has happened to me several times that some code stopped working because I transferred it to another machine using the same R version simply because some packages were newer or older. This can not happen with RRO, when you use the same R version you get the same package version. In case you need a newer one, you can pass the date to the method checkpoint. Read more about this awesome feature.

RStudio

Some people use R in the console but I highly recommend using RStudio. The main advantages I see compared to using it in the console is that you have:

  • An area for plots
  • A view on the file system
  • An overview over the environment
  • A wizard for creating projects, packages, shiny projects, …

But there are a lot more features and everyone needs to find out for themselves which are useful.

Attention Python users: There is Rodeo, an RStudio clone for Python…

Installing packages from CRAN, bioconductor and github

There are two main repositories, CRAN and bioconductor. Packages can also be installed from github. All three options are explained below.

CRAN

To check it out go here.
To install packages in R type:

install.packages("myPackageName")

bioconductor

bioconductor contains packages mainly designed for biological data, but they can often also be used for other purposes, so everyone should at least have heard about it. The nasty thing is packages are installed in a different way.

To install a package type:

source("https://bioconductor.org/biocLite.R")
biocLite("myPackageName")

github

Some also host packages on github and for some packages on CRAN (which has certain release dates) newer versions might be available at the packages github page.
To be able to install packages from github you first need the package devtools from CRAN.

install.packages("devtools")
library(devtools) # that's how you load it
install_github("repo/packageName")

Install from within RStudio

If your are in RStudio you can also go to Tools > Install Packages … and here you can install packages from CRAN or a package you have downloaded to your computer from somewhere.

If installing fails

Sometimes for some packages I get the error package ‘abc’ is not available (for R version 123) although I know it is available. First check if this package is really in CRAN/bioconductor or you just used the wrong repo, has happened to me before 😉

The easiest way to install it anyway is to download the tar.gz file from the repo (e.g. for package fabia on bioconductor go to fabia on bioconductor or for Matrix on CRAN go to Matrix on CRAN and click the link next to package source). Then follow the explanation I gave above for “Install from within RStudio”.

Load packages

There are two ways for loading packages:

library(packageName)
require(packageName)

The differences are explained very nicely in this blogpost (did not know that until now!), summary:

  • library() throws an error if the loading does not work
  • require() does not throw an error, it returns TRUE/FALSE to indicate success

So you should use library() when providing methods for other people, so they will know why something fails unless you need to do something like this:

if (require(myPackage)) { 
    // do something with the package
} else {
    // do something without the package
}

One more thing, to get help for a method e.g. plot() type:

?plot

Then a help page will pop up. To get examples, type:

example(plot)

Recommended reading

The first two things people should get started with is data types and how to get data into R (I assume here that you already know at least basic programming and know how loops, if-conditions and so on work – if not, please leave a comment and I will help you find resources).

Get data into R

There are several other types of data you can use in R (other file types, databases, …) but those are the two I encounter most often.

Datatypes

r-tutor.com provides an introduction to basic data types like numeric, integer, character, … More interesting are matrices, data frames and lists (the latter can be counter intuitive some times).

Unfortunately they do not have a tutorial on factors, because they are also often complicated to use. Some time ago I wrote a basic tutorial on factors in R.

Courses

Datacamp provides really cool courses. The one I started contains of short videos with explanations and an online code editor with small tasks to do.

swirl is an R course, that you run directly from your R console. I haven’t tried it but it sounds like a cool concept and several people already recommended it to me.

Further collections of resources

Many other people already have collected resources, those I find especially useful:

Slightly advanced users

If you are already slightly advanced, those are the packages that everyone recommends:

  • dplyr: helps when working with data frames (to my shame: I still haven’t used it) (there is a cheatsheet)
  • magrittr: I’m super excited about this piping operator and when you follow #rstats you’ll know others are excited about it to.
  • ggplot2: makes plots look really nice in R (there is also a cheatsheet

My setup

My main setup is a Surface 3 with RStudio and Revolution R Open 3.2.2. Since it does not have much RAM (4GB) for bigger computations I use our MacPro (16GB RAM) on which Patrick installed Windows 10. On this machine I also use RStudio and RRO 3.2.2.

    Leave a Reply

    Your email address will not be published. Required fields are marked *