Introduction

In this session, you will learn the basic R syntax, including:

  1. General syntax
  2. basic operations
  3. Object & data types
  4. Flow controls
  5. Introduction to the tidyverse and piping.

Brief Introduction: Base-R

The general base-R syntax is, in contrast to other more general-purpose programming language, already very geared towards working with data. MAny slicing/dicing ect. functionalities which for instance in Python require additional libraries (eg. Pandas in Python) are in R (as a dedicated statistical pogramming language) come with R out-of-the-box. However, the traditional R syntax has during the last years been mainly replaced by the tidy-principles pushed by R.Studio and implemented in their ´tidyverse´ ecosystem.

We will not use too much base-R syntax in our time together. However, it is still important to know the basics, since there are always here and there situation where you just cannot avoid it

Basics

Assignments

You can assign a value to an object using assign(), <-, or =.

z <- x + 17*y  # Assignment
z   
[1] 71

Value comparisons

Comparisons return boolean values: TRUE or FALSE (often abbreviated to T and F)

x <= y 
[1] TRUE

Special constants: NA, NULL, Inf, -Inf, NaN

NA indicates missing or undefined data

mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
[1] 3

NULL indicates an empty object, e.g. a null/empty list

10 + NULL     # use returns an empty object (length zero)
numeric(0)
is.null(NULL) # check if NULL
[1] TRUE

Inf and -Inf represent positive and negative infinity. They can be returned by mathematical operations like division of a number by zero.

5/0
[1] Inf
is.finite(5/0) # Check if a number is finite
[1] FALSE
is.infinite(5/0) # Check if a number is infinite
[1] TRUE

NaN (Not a Number) - the result of an operation that cannot be reasonably defined

is.nan(0/0)
[1] TRUE

Object classes

Vectors

v1 <- c(1, 5, 11, 33)       # Numeric vector, length 4
v1
[1]  1  5 11 33
v2 <- c("hello","world")    # Character vector, length 2 (a vector of strings)
v2
[1] "hello" "world"
v3 <- c(TRUE, TRUE, FALSE)  # Logical vector, same as c(T, T, F)
v3
[1]  TRUE  TRUE FALSE

Combining different types of elements in one vector will coerce the elements to the least restrictive type:

v4 <- c(v1,v2,v3,"boo")     # All elements turn into strings
v4
 [1] "1"     "5"     "11"    "33"    "hello" "world" "TRUE"  "TRUE"  "FALSE" "boo"  

Element-wise operations:

v1 + c(1,7) 
[1]  2 12 12 40

Mathematical operations:

cor(v1,v1*5) 
[1] 1

Logical operations:

v1 > 2       # Each element is compared to 2, returns logical vector
[1] FALSE  TRUE  TRUE  TRUE
v1==v2       # Are corresponding elements equivalent, returns logical vector.
[1] FALSE FALSE FALSE FALSE
v1!=v2       # Are corresponding elements *not* equivalent? Same as !(v1==v2)
[1] TRUE TRUE TRUE TRUE
(v1>2) | (v2>0)   # | is the boolean OR, returns a vector.
[1] TRUE TRUE TRUE TRUE
(v1>2) & (v2>0)   # & is the boolean AND, returns a vector.
[1] FALSE  TRUE  TRUE  TRUE
(v1>2) || (v2>0)  # || is the boolean OR, returns a single value (if it is true at least once)
[1] TRUE
(v1>2) && (v2>0)  # && is the boolean AND, returns a single value (if it is true at least once)
[1] FALSE

Adressing vector elements:

v1[v1>3] 
[1]  5 11 33

NOTE: If you are used to languages indexing from 0 (eg. Python), R will surprise you by indexing from 1.

To add more elements to a vector, simply assign them values.

v1[6:10] <- 6:10
v1
 [1]  1  5 11 33 NA  6  7  8  9 10

We can also directly assign the vector a length:

length(v1) <- 15 # the last 5 elements are added as missing data: NA
v1
 [1]  1  5 11 33 NA  6  7  8  9 10 NA NA NA NA NA

Factors

Factors are used to store categorical data.

eye.col.v <- c("brown", "green", "brown", "blue", "blue", "blue")         #vector
eye.col.f <- factor(c("brown", "green", "brown", "blue", "blue", "blue")) #factor
eye.col.v
[1] "brown" "green" "brown" "blue"  "blue"  "blue" 
eye.col.f
[1] brown green brown blue  blue  blue 
Levels: blue brown green

R will identify the different levels of the factor - e.g. all distinct values. The data is stored internally as integers - each number corresponding to a factor level.

levels(eye.col.f)  # The levels (distinct values) of the factor (categorical variable)
[1] "blue"  "brown" "green"
as.numeric(eye.col.f)  # The factor as numeric values: 1 is  blue, 2 is brown, 3 is green
[1] 2 3 2 1 1 1
as.numeric(eye.col.v)  # The character vector, however, can not be coerced to numeric
[1] NA NA NA NA NA NA
as.character(eye.col.f)  
[1] "brown" "green" "brown" "blue"  "blue"  "blue" 
as.character(eye.col.v) 
[1] "brown" "green" "brown" "blue"  "blue"  "blue" 

Matrces & Arrays

A matrix is a vector with dimensions:

m
     [,1] [,2] [,3] [,4]
[1,]    1    1    1    1
[2,]    1    1    1    1
[3,]    1    1    1    1
[4,]    1    1    1    1
[5,]    1    1    1    1

Create a matrix using matrix():

m <- matrix(data=1, nrow=5, ncol=4)  # same matrix as above, 5x4, full of 1s
m <- matrix(1,5,4)                       # same matrix as above (lazy style)
dim(m)                                # What are the dimensions of m?
[1] 5 4
m
     [,1] [,2] [,3] [,4]
[1,]    1    1    1    1
[2,]    1    1    1    1
[3,]    1    1    1    1
[4,]    1    1    1    1
[5,]    1    1    1    1

Create a matrix by combining vectors:

m <- cbind(1:5, 5:1, 5:9)  # Bind 3 vectors as columns, 5x3 matrix
m <- rbind(1:5, 5:1, 5:9)  # Bind 3 vectors as rows, 3x5 matrix
m <- matrix(1:10,10,10)
m
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1    1    1    1    1    1    1    1    1     1
 [2,]    2    2    2    2    2    2    2    2    2     2
 [3,]    3    3    3    3    3    3    3    3    3     3
 [4,]    4    4    4    4    4    4    4    4    4     4
 [5,]    5    5    5    5    5    5    5    5    5     5
 [6,]    6    6    6    6    6    6    6    6    6     6
 [7,]    7    7    7    7    7    7    7    7    7     7
 [8,]    8    8    8    8    8    8    8    8    8     8
 [9,]    9    9    9    9    9    9    9    9    9     9
[10,]   10   10   10   10   10   10   10   10   10    10

Select matrix elements:

m[-1,]     # all rows *except* the first one
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    2    2    2    2    2    2    2    2    2     2
 [2,]    3    3    3    3    3    3    3    3    3     3
 [3,]    4    4    4    4    4    4    4    4    4     4
 [4,]    5    5    5    5    5    5    5    5    5     5
 [5,]    6    6    6    6    6    6    6    6    6     6
 [6,]    7    7    7    7    7    7    7    7    7     7
 [7,]    8    8    8    8    8    8    8    8    8     8
 [8,]    9    9    9    9    9    9    9    9    9     9
 [9,]   10   10   10   10   10   10   10   10   10    10

Conditional operations

m[m > 3] 
 [1]  4  5  6  7  8  9 10  4  5  6  7  8  9 10  4  5  6  7  8  9 10  4  5  6  7  8  9 10  4  5  6  7  8  9 10  4  5
[38]  6  7  8  9 10  4  5  6  7  8  9 10  4  5  6  7  8  9 10  4  5  6  7  8  9 10  4  5  6  7  8  9 10

Other matrix manipulation

m * m  
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1    1    1    1    1    1    1    1    1     1
 [2,]    4    4    4    4    4    4    4    4    4     4
 [3,]    9    9    9    9    9    9    9    9    9     9
 [4,]   16   16   16   16   16   16   16   16   16    16
 [5,]   25   25   25   25   25   25   25   25   25    25
 [6,]   36   36   36   36   36   36   36   36   36    36
 [7,]   49   49   49   49   49   49   49   49   49    49
 [8,]   64   64   64   64   64   64   64   64   64    64
 [9,]   81   81   81   81   81   81   81   81   81    81
[10,]  100  100  100  100  100  100  100  100  100   100

Arrays: more than 2 dimensions

Created with the array() function:

a <- array(data=1:18,dim=c(3,3,2)) # 3d with dimensions 3x3x2
a <- array(1:18,c(3,3,2))          # the same array
a
, , 1

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

, , 2

     [,1] [,2] [,3]
[1,]   10   13   16
[2,]   11   14   17
[3,]   12   15   18

Since arrays have 3 dimensions, also a 3rd element can be used for slicing&dicinhg.

a[1,3,2]
[1] 16

Lists

Lists are collections of objects (e.g. of strings, vectors, matrices, other lists, etc.)

l1$boo 
 [1]  1  5 11 33 NA  6  7  8  9 10 NA NA NA NA NA

Add more elements to a list:

l3[[1]] <- 11 # add an element to the empty list l3
l4[[3]] <- c(22, 23) # add a vector as element 3 in the empty list l4. 
                     # Since we added element 3, elements 1 & 2 will be generated and empty (NULL)
l1[[5]] <- "More elements!" # The list l1 had 4 elements, we're adding a 5th here.
l1[[8]] <- 1:11 # We added an 8th element, but not 6th or 7th. Those will be created empty (NULL)
l1$Something <- "A thing"  # Adds a ninth element - "A thing", named "Something"

Data Frames

The data frame is a special kind of list used for storing dataset tables. Think of rows as cases, columns as variables. Each column is a vector or factor.

Note: While base R uses the data.frame, we later when working with tidyverse use the tibble instead, which is the same but modifies some annoying behaviors of the original data type (eg. no default interpretations of strings as factors, no rownames. More on that later).

Creating a dataframe:

dfr1$FirstName
[1] "Jesper"   "Jonas"    "Pernille" "Helle"   

Notice that R thinks this is a categorical variable and so it’s treating it like a factor, not a character vector. You can tell R you don’t like factors from the start using stringsAsFactors=FALSE. I find that annoying. The tibble (introduced later) does not do that.

dfr2 <- data.frame(FirstName=c("John","Jim","Jane","Jill"), stringsAsFactors=FALSE)
dfr2$FirstName   # Success: not a factor.
[1] "John" "Jim"  "Jane" "Jill"

Access elements of the data frame. Notation is dfr[row, column] Rows can be acessed by number or condition, columns by number or name. Alternatively, columns can be acessed by dfr$column

dfr1[1,]   # First row, all columns
dfr1[,1]   # First column, all rows
[1] 1 2 3 4
dfr1$Age   # Age column, all rows
[1] 22 33 44 55
dfr1[1:2,3:4] # Rows 1 and 2, columns 3 and 4 - the gender and age of John & Jim
dfr1[c(1,3),] # Rows 1 and 3, all columns

Find the names of everyone over the age of 30 in the data

dfr1[dfr1$Age>30,2]
[1] "Jonas"    "Pernille" "Helle"   

Find the average age of all females in the data:

mean (dfr1[dfr1$Female==TRUE,4])
[1] 49.5

Flow Control (loops & friends)

Loops are powerful little helpers to do the same operation iterating over a number of items.

If statements: if (condition) expr1 else expr2

x <- 5; y <- 10
if (x==0) y <- 0 else y <- y/x  
y
[1] 2

for loops: for (variable in sequence) expr

for (i in 1:x)  { print(paste("OMG, i just counted to", i)) }
[1] "OMG, i just counted to 1"
[1] "OMG, i just counted to 2"
[1] "OMG, i just counted to 3"
[1] "OMG, i just counted to 4"
[1] "OMG, i just counted to 5"

While loop: while (condintion) expr

while (x > 0) {print(x); x <- x-1;}
[1] 5
[1] 4
[1] 3
[1] 2
[1] 1

Repeat loop: repeat expr, use break to exit the loop

repeat { print(x); x <- x+1; if (x>7) break}
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7

R troubleshooting

While I generate many (and often very creative) errors in R, there are three simple things that will most often go wrong for me. Those include:

  • Capitalization. R is case sensitive - a graph vertex named “Jack” is not the same as one named “jack”. The function rowSums won’t work as “rowsums” or “RowSums”.
  • Object class. While many functions are willing to take anything you throw at them, some will still surprisingly require character vector or a factor instead of a numeric vector, or a matrix instead of a data frame. Functions will also occasionally return results in an unexpected format.
  • Package namespaces. Occasionally problems will arise when different packages contain functions with the same name. R may warn you about this by saying something like “The following object(s) are masked from ‘package:igraph’” as you load a package. One way to deal with this is to call functions from a package explicitly using ‘::’. For instance, if function ‘blah’ is present in packages A and B, you can call A::blah
# install.packages('dplyr')
# library(dplyr)          # load a package
# detach(package:dplyr)   # detach a package

For more advanced troubleshooting, check out try(), tryCatch(), and debug().

?tryCatch

Generally, just using ?functionyouwonderabout often solves problems. There you can review the functions arguments, inputs, outputs, syntax etc.

R 2.0: The Tidyverse

What is it all about?

Base R comes with quite some functionality for slicing and dicing data, there also exists a myriad specialized packages for more tricky data manipulation. To read others’ code and example as well as to perform some special operations, you all should be able to use standard R syntax.

However, the factors, the [row, column] syntax anhd so forth are not very comfortable and intuitive. Further, for more tricky operation such as certain aggregations etc., one has to rely on a variety of packages, which often come with an own syntax.

The good news is: The efforts of a small set of key-developers (foremost Hadley Wickham) has let to the development of the tidyverse, an ecosystem of R packages particularly designed for data science applications. All packages share an underlying design philosophy, common API, grammar, and data structures.

Among the most amazing contributions here is dplyr, a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. I use dplyr for 90% of my data-manipulation tasks for the following reasons:

  • All the underlying code is runs optimized in C++, making it faster than most base R
  • It consistently unifies the grammar of data manipulation to a small set of operations, which can be flexibly combined to master almost every task
  • It is designed to work neathly with the %>% pipe-operator of magrittr (more on that later)
  • its syntax is very similar to the logic of SQL and other data-management languages
  • It expanded far beoyond its original 5 verbs, and now replaces most base R commands with optimized, clever, and high-performance alternatives
  • It works neathly with many databases, such as SQL (with addon packages DBI and dbplyr)

I will not touch on all packages there, but the complete tidyverse covers almost all issues of data manipulation. They all operate under the same logic, are fast, and usually your best choice for almost any given problem. Particularly dplyr is enourmeously powerfull, and has a lot more functions than the basics I cover here. So, for every given problem, your first question (to yourself or stackoverflow) should be:

1: Is there a way to solve my problem in dplyr? 2: If not, is there another tidyverse package dedicated to this problem?

For the sake of illustration, I will load every package of the tidyverse one-by-one when we need it. However, normally I just load library(tidyverse) all at once, since I need a lot of these packages often anyhow

library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For extra-piping operators (eg. %<>%)

Tibbles

Tibbles are the tidyverse version of the traditional dataframe. They work in exactly the same way, only with some small differences, which are usually from a data science perspctive seen as an improvement:

  1. Strings ae by default not recoded as factors
  2. Rownames are dropped
  3. Default print delivers more convenient overview.

They can be created in 3 different ways.

  1. Creating them from scratch with tibble()
  2. Using explicitly the as_tibble() function on a table
  3. When applying and dplyr function on a dataframe, it will automatically be converted to a tibble.
head(iris) # a dataframe
head(as_tibble(iris))

It is usually the prefered format for data science projects in R.

Piping

in traditional R syntax, data-manipulations are carried out one by one. For example, one would first assign a new variable x$numbers <- 1:5, then maybe manipulate it x$numbers <- x$numbers * 2, and subset it x <- x[x$numbers > 4]. dplyr makes use of margrittr’s pipes, written like %>%.

A pipe means take the output of it’s left-hand side and insert it as first input in the function on the right-hand side. Accordingly, all dplyr functions follow the syntax that their first input is always the data to be manipulated. Therefore, they can all be “piped”.

x <- tibble(numbers = 1:5) 

Lets say we want to multiply all number with 2, and THEN subset the data for observations with a number larger than 4. We could do the following

y <- x
y[,'numbers'] <- y[,'numbers'] * 2
y <- y[y['numbers'] > 4, ]
y

For example, we could pipe as follows (don’t worry about the other syntax yet):

x %>%
  mutate(numbers = numbers * 2) %>%
  filter(numbers > 4)

It basically reads like:

  • Create a dataframe (to be precise, a tibble) with the variable “numbers” and assign the values 1:5.
  • THEN multiply them with 2.
    THEN subset the dataframe to only rows with a nuimber value higher than 4.

It first looks not so intuitive, but it will become your second nature. Using pipes facilitates fast, reproducible and easily readable coding practices, and all of you are encouraged to go on with that.

Note: %>% pipes do not autometically assign their output to the left-hand side object, meaning the original dataset will not per se be overwritten. To do that, there are two ways:

1: Initially, assign the output to the original data with <- 2: Initially, use margrittr’s %<>% command, meaning: Assign and pipe.

# This will create an output, but not change x
x %>%
  filter(numbers > 5)

# This will re-assign x
x <- x %>%
  filter(numbers > 5)
# is equivalent to
x %<>%
  filter(numbers > 5) 

In conclusion: The pipe basically passes on dataframe between functions in the following way:

# Only pseudo code here, does not run
x %>% fun(na.rm = TRUE) %>%
  filter() %>%

# Is equivalent to
fun(x, na.rm = TRUE)

# While
x %<>% fun()
# Is equivalent to
x <- fun(x)

Piping also provides better overview over the flow of actions as compared to nested functions

# Nested functions
went_to_bed(had_dinner(programmed_some_r(had_lunch(programmed_some_r(had_brekfast(got_up(day)))))))

# vs pipes
day %>%
  got_up() %>%
  had_breakfast() %>%
  programmed_some_r() %>%
  had_lunch() %>%
  programmed_some_r() %>%
  had_dinner() %>%
  went_to_bed()

Handling special data formats

It is not part of this introductory lecture, but you soon might encounter that you have to deal with 2 common formats in some way, which are date-times (time-codes) and strings (text). When that point comes, just check the following to get started (and if necessary branch out to further sources suggested):

  • Strings: R for Data Science (Grolemund & Wickham) Chapter 14
  • DateTimes: R for Data Science (Grolemund & Wickham) Chapter 16

Adittional Infos

R, notebooks & markdown

While many people prefer to work with R scripts, computional notebooks are especially in the data science community more popular for R users. This is mainly don in the Rmarkdown format, which combines markdown markup an notation with executable code and result outputs. All the petty html notebooks i create fo you are also done in that way. For further information and to get started, check:

R, google colab & co.

Google colab does not officially support R kernels. However, there is a little trick how to make R run with colab.

If you want to start from scratch, do the following:

  • You can simply run the demo.ipynb from IRkernel Github
  • Make changes and then save a copy to your Google Drive.
  • You can also see all 3 example notebooks here.

If you already have an R-Markdown notebook:

  • use the IRkernel to create a .ipynb out of your .rmd
  • If you dont want to do it locally, use this colab notebook instead. Just upload the .rmd, run the code (alter the filename), and download the resulting .ipynb. This now can be uploaded to colab.

Endnotes

References

Further infos

Session Info

sessionInfo()
