In this session, you will learn the basic R syntax, including:
tidyverse
and piping.The general base-R syntax is, in contrast to other more general-purpose programming language, already very geared towards working with data. MAny slicing/dicing ect. functionalities which for instance in Python require additional libraries (eg. Pandas in Python) are in R (as a dedicated statistical pogramming language) come with R out-of-the-box. However, the traditional R syntax has during the last years been mainly replaced by the tidy-principles pushed by R.Studio and implemented in their ´tidyverse´ ecosystem.
We will not use too much base-R syntax in our time together. However, it is still important to know the basics, since there are always here and there situation where you just cannot avoid it
You can assign a value to an object using assign()
, <-
, or =
.
z <- x + 17*y # Assignment
z
[1] 71
Comparisons return boolean values: TRUE or FALSE (often abbreviated to T
and F
)
x <= y
[1] TRUE
NA
, NULL
, Inf
, -Inf
, NaN
NA
indicates missing or undefined data
mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
[1] 3
NULL
indicates an empty object, e.g. a null/empty list
10 + NULL # use returns an empty object (length zero)
numeric(0)
is.null(NULL) # check if NULL
[1] TRUE
Inf
and -Inf
represent positive and negative infinity. They can be returned by mathematical operations like division of a number by zero.
5/0
[1] Inf
is.finite(5/0) # Check if a number is finite
[1] FALSE
is.infinite(5/0) # Check if a number is infinite
[1] TRUE
NaN
(Not a Number) - the result of an operation that cannot be reasonably defined
is.nan(0/0)
[1] TRUE
v1 <- c(1, 5, 11, 33) # Numeric vector, length 4
v1
[1] 1 5 11 33
v2 <- c("hello","world") # Character vector, length 2 (a vector of strings)
v2
[1] "hello" "world"
v3 <- c(TRUE, TRUE, FALSE) # Logical vector, same as c(T, T, F)
v3
[1] TRUE TRUE FALSE
Combining different types of elements in one vector will coerce the elements to the least restrictive type:
v4 <- c(v1,v2,v3,"boo") # All elements turn into strings
v4
[1] "1" "5" "11" "33" "hello" "world" "TRUE" "TRUE" "FALSE" "boo"
Element-wise operations:
v1 + c(1,7)
[1] 2 12 12 40
Mathematical operations:
cor(v1,v1*5)
[1] 1
Logical operations:
v1 > 2 # Each element is compared to 2, returns logical vector
[1] FALSE TRUE TRUE TRUE
v1==v2 # Are corresponding elements equivalent, returns logical vector.
[1] FALSE FALSE FALSE FALSE
v1!=v2 # Are corresponding elements *not* equivalent? Same as !(v1==v2)
[1] TRUE TRUE TRUE TRUE
(v1>2) | (v2>0) # | is the boolean OR, returns a vector.
[1] TRUE TRUE TRUE TRUE
(v1>2) & (v2>0) # & is the boolean AND, returns a vector.
[1] FALSE TRUE TRUE TRUE
(v1>2) || (v2>0) # || is the boolean OR, returns a single value (if it is true at least once)
[1] TRUE
(v1>2) && (v2>0) # && is the boolean AND, returns a single value (if it is true at least once)
[1] FALSE
Adressing vector elements:
v1[v1>3]
[1] 5 11 33
NOTE: If you are used to languages indexing from 0 (eg. Python), R
will surprise you by indexing from 1.
To add more elements to a vector, simply assign them values.
v1[6:10] <- 6:10
v1
[1] 1 5 11 33 NA 6 7 8 9 10
We can also directly assign the vector a length:
length(v1) <- 15 # the last 5 elements are added as missing data: NA
v1
[1] 1 5 11 33 NA 6 7 8 9 10 NA NA NA NA NA
Factors are used to store categorical data.
eye.col.v <- c("brown", "green", "brown", "blue", "blue", "blue") #vector
eye.col.f <- factor(c("brown", "green", "brown", "blue", "blue", "blue")) #factor
eye.col.v
[1] "brown" "green" "brown" "blue" "blue" "blue"
eye.col.f
[1] brown green brown blue blue blue
Levels: blue brown green
R
will identify the different levels of the factor - e.g. all distinct values. The data is stored internally as integers - each number corresponding to a factor level.
levels(eye.col.f) # The levels (distinct values) of the factor (categorical variable)
[1] "blue" "brown" "green"
as.numeric(eye.col.f) # The factor as numeric values: 1 is blue, 2 is brown, 3 is green
[1] 2 3 2 1 1 1
as.numeric(eye.col.v) # The character vector, however, can not be coerced to numeric
[1] NA NA NA NA NA NA
as.character(eye.col.f)
[1] "brown" "green" "brown" "blue" "blue" "blue"
as.character(eye.col.v)
[1] "brown" "green" "brown" "blue" "blue" "blue"
A matrix is a vector with dimensions:
m
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 1 1 1 1
[3,] 1 1 1 1
[4,] 1 1 1 1
[5,] 1 1 1 1
Create a matrix using matrix()
:
m <- matrix(data=1, nrow=5, ncol=4) # same matrix as above, 5x4, full of 1s
m <- matrix(1,5,4) # same matrix as above (lazy style)
dim(m) # What are the dimensions of m?
[1] 5 4
m
[,1] [,2] [,3] [,4]
[1,] 1 1 1 1
[2,] 1 1 1 1
[3,] 1 1 1 1
[4,] 1 1 1 1
[5,] 1 1 1 1
Create a matrix by combining vectors:
m <- cbind(1:5, 5:1, 5:9) # Bind 3 vectors as columns, 5x3 matrix
m <- rbind(1:5, 5:1, 5:9) # Bind 3 vectors as rows, 3x5 matrix
m <- matrix(1:10,10,10)
m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 1 1 1 1 1 1 1 1
[2,] 2 2 2 2 2 2 2 2 2 2
[3,] 3 3 3 3 3 3 3 3 3 3
[4,] 4 4 4 4 4 4 4 4 4 4
[5,] 5 5 5 5 5 5 5 5 5 5
[6,] 6 6 6 6 6 6 6 6 6 6
[7,] 7 7 7 7 7 7 7 7 7 7
[8,] 8 8 8 8 8 8 8 8 8 8
[9,] 9 9 9 9 9 9 9 9 9 9
[10,] 10 10 10 10 10 10 10 10 10 10
Select matrix elements:
m[-1,] # all rows *except* the first one
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 2 2 2 2 2 2 2 2 2
[2,] 3 3 3 3 3 3 3 3 3 3
[3,] 4 4 4 4 4 4 4 4 4 4
[4,] 5 5 5 5 5 5 5 5 5 5
[5,] 6 6 6 6 6 6 6 6 6 6
[6,] 7 7 7 7 7 7 7 7 7 7
[7,] 8 8 8 8 8 8 8 8 8 8
[8,] 9 9 9 9 9 9 9 9 9 9
[9,] 10 10 10 10 10 10 10 10 10 10
Conditional operations
m[m > 3]
[1] 4 5 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10 4 5
[38] 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10 4 5 6 7 8 9 10
Other matrix manipulation
m * m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 1 1 1 1 1 1 1 1
[2,] 4 4 4 4 4 4 4 4 4 4
[3,] 9 9 9 9 9 9 9 9 9 9
[4,] 16 16 16 16 16 16 16 16 16 16
[5,] 25 25 25 25 25 25 25 25 25 25
[6,] 36 36 36 36 36 36 36 36 36 36
[7,] 49 49 49 49 49 49 49 49 49 49
[8,] 64 64 64 64 64 64 64 64 64 64
[9,] 81 81 81 81 81 81 81 81 81 81
[10,] 100 100 100 100 100 100 100 100 100 100
Created with the array()
function:
a <- array(data=1:18,dim=c(3,3,2)) # 3d with dimensions 3x3x2
a <- array(1:18,c(3,3,2)) # the same array
a
, , 1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
, , 2
[,1] [,2] [,3]
[1,] 10 13 16
[2,] 11 14 17
[3,] 12 15 18
Since arrays have 3 dimensions, also a 3rd element can be used for slicing&dicinhg.
a[1,3,2]
[1] 16
Lists are collections of objects (e.g. of strings, vectors, matrices, other lists, etc.)
l1$boo
[1] 1 5 11 33 NA 6 7 8 9 10 NA NA NA NA NA
Add more elements to a list:
l3[[1]] <- 11 # add an element to the empty list l3
l4[[3]] <- c(22, 23) # add a vector as element 3 in the empty list l4.
# Since we added element 3, elements 1 & 2 will be generated and empty (NULL)
l1[[5]] <- "More elements!" # The list l1 had 4 elements, we're adding a 5th here.
l1[[8]] <- 1:11 # We added an 8th element, but not 6th or 7th. Those will be created empty (NULL)
l1$Something <- "A thing" # Adds a ninth element - "A thing", named "Something"
The data frame is a special kind of list used for storing dataset tables. Think of rows as cases, columns as variables. Each column is a vector or factor.
Note: While base R
uses the data.frame
, we later when working with tidyverse
use the tibble
instead, which is the same but modifies some annoying behaviors of the original data type (eg. no default interpretations of strings as factors, no rownames
. More on that later).
Creating a dataframe:
dfr1$FirstName
[1] "Jesper" "Jonas" "Pernille" "Helle"
Notice that R
thinks this is a categorical variable and so it’s treating it like a factor, not a character vector. You can tell R
you don’t like factors from the start using stringsAsFactors=FALSE
. I find that annoying. The tibble
(introduced later) does not do that.
dfr2 <- data.frame(FirstName=c("John","Jim","Jane","Jill"), stringsAsFactors=FALSE)
dfr2$FirstName # Success: not a factor.
[1] "John" "Jim" "Jane" "Jill"
Access elements of the data frame. Notation is dfr[row, column]
Rows can be acessed by number or condition, columns by number or name. Alternatively, columns can be acessed by dfr$column
dfr1[1,] # First row, all columns
dfr1[,1] # First column, all rows
[1] 1 2 3 4
dfr1$Age # Age column, all rows
[1] 22 33 44 55
dfr1[1:2,3:4] # Rows 1 and 2, columns 3 and 4 - the gender and age of John & Jim
dfr1[c(1,3),] # Rows 1 and 3, all columns
Find the names of everyone over the age of 30 in the data
dfr1[dfr1$Age>30,2]
[1] "Jonas" "Pernille" "Helle"
Find the average age of all females in the data:
mean (dfr1[dfr1$Female==TRUE,4])
[1] 49.5
Loops are powerful little helpers to do the same operation iterating over a number of items.
If statements: if (condition) expr1 else expr2
x <- 5; y <- 10
if (x==0) y <- 0 else y <- y/x
y
[1] 2
for loops: for (variable in sequence) expr
for (i in 1:x) { print(paste("OMG, i just counted to", i)) }
[1] "OMG, i just counted to 1"
[1] "OMG, i just counted to 2"
[1] "OMG, i just counted to 3"
[1] "OMG, i just counted to 4"
[1] "OMG, i just counted to 5"
While loop: while (condintion) expr
while (x > 0) {print(x); x <- x-1;}
[1] 5
[1] 4
[1] 3
[1] 2
[1] 1
Repeat loop: repeat expr, use break to exit the loop
repeat { print(x); x <- x+1; if (x>7) break}
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
While I generate many (and often very creative) errors in R, there are three simple things that will most often go wrong for me. Those include:
rowSums
won’t work as “rowsums” or “RowSums”.# install.packages('dplyr')
# library(dplyr) # load a package
# detach(package:dplyr) # detach a package
For more advanced troubleshooting, check out try()
, tryCatch()
, and debug()
.
?tryCatch
Generally, just using ?functionyouwonderabout
often solves problems. There you can review the functions arguments, inputs, outputs, syntax etc.
Base R
comes with quite some functionality for slicing and dicing data, there also exists a myriad specialized packages for more tricky data manipulation. To read others’ code and example as well as to perform some special operations, you all should be able to use standard R
syntax.
However, the factors, the [row, column]
syntax anhd so forth are not very comfortable and intuitive. Further, for more tricky operation such as certain aggregations etc., one has to rely on a variety of packages, which often come with an own syntax.
The good news is: The efforts of a small set of key-developers (foremost Hadley Wickham) has let to the development of the tidyverse
, an ecosystem of R
packages particularly designed for data science applications. All packages share an underlying design philosophy, common API, grammar, and data structures.
Among the most amazing contributions here is dplyr
, a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. I use dplyr
for 90% of my data-manipulation tasks for the following reasons:
C++
, making it faster than most base R
%>%
pipe-operator of magrittr (more on that later)SQL
and other data-management languagesSQL
(with addon packages DBI
and dbplyr
)I will not touch on all packages there, but the complete tidyverse
covers almost all issues of data manipulation. They all operate under the same logic, are fast, and usually your best choice for almost any given problem. Particularly dplyr
is enourmeously powerfull, and has a lot more functions than the basics I cover here. So, for every given problem, your first question (to yourself or stackoverflow) should be:
1: Is there a way to solve my problem in dplyr
? 2: If not, is there another tidyverse
package dedicated to this problem?
For the sake of illustration, I will load every package of the tidyverse
one-by-one when we need it. However, normally I just load library(tidyverse)
all at once, since I need a lot of these packages often anyhow
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For extra-piping operators (eg. %<>%)
Tibbles are the tidyverse
version of the traditional dataframe. They work in exactly the same way, only with some small differences, which are usually from a data science perspctive seen as an improvement:
They can be created in 3 different ways.
tibble()
as_tibble()
function on a tabledplyr
function on a dataframe, it will automatically be converted to a tibble.head(iris) # a dataframe
head(as_tibble(iris))
It is usually the prefered format for data science projects in R.
in traditional R
syntax, data-manipulations are carried out one by one. For example, one would first assign a new variable x$numbers <- 1:5
, then maybe manipulate it x$numbers <- x$numbers * 2
, and subset it x <- x[x$numbers > 4]
. dplyr
makes use of margrittr
’s pipes, written like %>%
.
A pipe means take the output of it’s left-hand side and insert it as first input in the function on the right-hand side. Accordingly, all dplyr
functions follow the syntax that their first input is always the data to be manipulated. Therefore, they can all be “piped”.
x <- tibble(numbers = 1:5)
Lets say we want to multiply all number with 2, and THEN subset the data for observations with a number larger than 4. We could do the following
y <- x
y[,'numbers'] <- y[,'numbers'] * 2
y <- y[y['numbers'] > 4, ]
y
For example, we could pipe as follows (don’t worry about the other syntax yet):
x %>%
mutate(numbers = numbers * 2) %>%
filter(numbers > 4)
It basically reads like:
tibble
) with the variable “numbers” and assign the values 1:5.It first looks not so intuitive, but it will become your second nature. Using pipes facilitates fast, reproducible and easily readable coding practices, and all of you are encouraged to go on with that.
Note: %>%
pipes do not autometically assign their output to the left-hand side object, meaning the original dataset will not per se be overwritten. To do that, there are two ways:
1: Initially, assign the output to the original data with <-
2: Initially, use margrittr
’s %<>%
command, meaning: Assign and pipe.
# This will create an output, but not change x
x %>%
filter(numbers > 5)
# This will re-assign x
x <- x %>%
filter(numbers > 5)
# is equivalent to
x %<>%
filter(numbers > 5)
In conclusion: The pipe basically passes on dataframe between functions in the following way:
# Only pseudo code here, does not run
x %>% fun(na.rm = TRUE) %>%
filter() %>%
# Is equivalent to
fun(x, na.rm = TRUE)
# While
x %<>% fun()
# Is equivalent to
x <- fun(x)
Piping also provides better overview over the flow of actions as compared to nested functions
# Nested functions
went_to_bed(had_dinner(programmed_some_r(had_lunch(programmed_some_r(had_brekfast(got_up(day)))))))
# vs pipes
day %>%
got_up() %>%
had_breakfast() %>%
programmed_some_r() %>%
had_lunch() %>%
programmed_some_r() %>%
had_dinner() %>%
went_to_bed()
It is not part of this introductory lecture, but you soon might encounter that you have to deal with 2 common formats in some way, which are date-times (time-codes) and strings (text). When that point comes, just check the following to get started (and if necessary branch out to further sources suggested):
While many people prefer to work with R
scripts, computional notebooks are especially in the data science community more popular for R
users. This is mainly don in the Rmarkdown
format, which combines markdown markup an notation with executable code and result outputs. All the petty html notebooks i create fo you are also done in that way. For further information and to get started, check:
Google colab does not officially support R
kernels. However, there is a little trick how to make R
run with colab.
If you want to start from scratch, do the following:
demo.ipynb
from IRkernel GithubIf you already have an R-Markdown notebook:
.ipynb
out of your .rmd
.rmd
, run the code (alter the filename), and download the resulting .ipynb
. This now can be uploaded to colab.sessionInfo()