Statistical Programming with R

Recap

  • So far we have learned the basics of programming in R:
    • How to assign elements to objects (<-)
    • How to run code
    • How to save R-scripts
    • How to manage projects in RStudio
    • How to create notebooks or markdown HTML files
    • How to find help
  • Now we will add a bit more building blocks

Lecture B - Datatypes and syntax

  • In this lecture we will look the following subjects:
    • Basic datatypes
    • Vectors, matrices, data frames, arrays, and lists
    • Logical operators
    • Factors
    • Missingness

But first…

  • RStudio has a feature where it saves all values in your environment
  • This is bad for reproducibility…
  • To disable it:
    • Click on Tools
    • Global Options…
    • Remove checkmark in “Restore .RData into workspace at startup”
    • Set the option “Save workspace to .RData on exit” to “Never”
  • This plays nicely together with “Restart R” (Ctrl + Shift + F10)

But first…

HTML5 Icon

Datatypes (and conversion)

Datatypes

  • The smallest object in R is called an atomic element.
  • Three basic datatypes:
    • Numeric (double, integer and complex)
    • Character
    • Logical

Datatypes

  • Numeric
a_number <- 5.2
  • Numeric operators
a_number+2
## [1] 7.2
  • Is it numeric?
is.numeric(a_number)
## [1] TRUE

Datatypes

  • Character
a_character <- "A"
a_sentence <- "A full sentence"
not_a_number <- "5.2"
  • Difference in printing
not_a_number
## [1] "5.2"
  • Is it a character
is.character(a_character)
## [1] TRUE

Datatypes

  • Logical
logic <- TRUE
no_logic <- FALSE
  • Logical operators
TRUE & FALSE
## [1] FALSE
TRUE | FALSE
## [1] TRUE

Conversion

  • You can convert types with as."datatype":
a_number + not_a_number
## Error in a_number + not_a_number: non-numeric argument to binary operator
a_number + as.numeric(not_a_number)
## [1] 10.4
  • Not all conversions are possible
as.numeric(a_sentence)
## Warning: NAs introduced by coercion
## [1] NA

Conversion

  • You can always convert
    • Logical to numeric (FALSE=0 and TRUE=1)
    • Numeric to logical (FALSE if numeric is 0 else always TRUE)
    • Numeric to character (6.7 will become "6.7")
  • You can convert convert character to numeric if it ONLY contains numbers
  • You can convert character to logical if it is ONLY "TRUE" or "FALSE"
  • If R do not know how to convert you get a missing value (NA)

Check datatype

  • If you do not know the datatype you can use the typeof-function to get it
typeof(a_number)
## [1] "double"

Vectors and matrices

Numerical vectors

  • Several elements of the same type, e.g. the integers from 1 to 5, can be concatenated into a vector
a <- c(1, 2, 3, 4, 5)
a
## [1] 1 2 3 4 5

Character vectors

  • Similarly we can add a character to our vector
a.new <- c(a, "A")
a.new
## [1] "1" "2" "3" "4" "5" "A"
  • Notice the difference with a from the previous slide
a
## [1] 1 2 3 4 5
  • This is because vectors can only contain one datatype!

Vectors

  • To call single elements in vectors, e.g. if we want to refer to the third element of a, we type
a[3]
## [1] 3
  • Or several elements
a[c(1,3)]
## [1] 1 3
  • This is called indexing and used to be the way to subset data - we will see an easier way to do this later…

Matrices

  • A rectangular organization of n x m elements of same datatype in n rows and m columns is known as a matrix, both in R and more generally
c <- matrix(a, nrow = 5, ncol = 2)
c
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
## [4,]    4    4
## [5,]    5    5
  • Notice that R is recycling a without lettting us know!
  • Also notice that the name of the matrix could cause trouble!!

  • Referencing elements in a matrix is done by indices, e.g. the first row is called by
c[1, ]
## [1] 1 1

  • The second column is called by
c[, 2]
## [1] 1 2 3 4 5

  • The intersection of the first row and second column is called by
c[1, 2]
## [1] 1
  • In short, the square brackets [] are used to call elements, rows, columns (and much more beyond the scope of this course)

Trying to mix numerics and chars

  • If we add a character column to matrix c everything becomes chars
cbind(c, letters[1:5])
##      [,1] [,2] [,3]
## [1,] "1"  "1"  "a" 
## [2,] "2"  "2"  "b" 
## [3,] "3"  "3"  "c" 
## [4,] "4"  "4"  "d" 
## [5,] "5"  "5"  "e"
  • Vectors and matrices are 1D or 2D structures holdning elements of the same type (numerical, characters or logicals)
  • If we try to mix two types R will automatically cast in the following order:
    logical->numeric->character

Data frames

Data frames

  • To keep variables of different types in a 2D structure, the simplest answer is a data frame
d <- data.frame("V1" = rnorm(5),
                "V2" = rnorm(5, mean = 5, sd = 2), 
                "V3" = letters[1:5])
d
##            V1       V2 V3
## 1 -0.56047565 8.430130  a
## 2 -0.23017749 5.921832  b
## 3  1.55870831 2.469878  c
## 4  0.07050839 3.626294  d
## 5  0.12928774 4.108676  e
  • Here a data frame is constructed with two randomly generated sets from the normal distribution where \(V1\) is standard normal and \(V2 \sim N(5,2^2)\) and then an interesting character vector

Data frames

  • Data frames can contain all datatypes
  • Each column can only contain one datatype
  • Each row has the same columns

Data frames

  • The way to obtain the third row from the data frame d is:
d[3, ]
##         V1       V2 V3
## 3 1.558708 2.469878  c

Data frames

  • Calling columns in data frames can be done in precisely the same way
d[, "V2"]
## [1] 8.430130 5.921832 2.469878 3.626294 4.108676
d[, 2]
## [1] 8.430130 5.921832 2.469878 3.626294 4.108676

Data frames

  • And the intersection can be called, just like with matrices:
d[3:5,"V2"]
## [1] 2.469878 3.626294 4.108676
  • Notice the 3:5 notation. In R this is shorthand for the integers from 3 to 5
3:5
## [1] 3 4 5

Data frames

  • However, we can also use $ to call variable names in data frame objects
d$V2
## [1] 8.430130 5.921832 2.469878 3.626294 4.108676
  • Thus calling column vectors we can even write
d$V2[3:4]
## [1] 2.469878 3.626294

Beyond two dimensions:
Arrays and lists

Arrays

  • If you wish to use single datatype objects that have more than two dimensions, an array would be a suitable object.

  • The following code yields a 3-dimensional array, which may be thought of as 2 layers of a 3x4 matrix:

e <- array(1:24, dim = c(3, 4, 2))
e
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24

Arrays

  • The square bracket identification works similarly to the identification of matrices and dataframes, but with the added dimension(s).
  • So to get the element in the first row of the third column in the second matrix you type
e[1, 3, 2]
## [1] 19
  • This is exactly the downside to an array: it is a series of matrices meaning that characters and numerical elements may not be mixed…

Arrays

  • So, we get problems if we replace the second matrix in the array by a character version of that same matrix
e[, , 2] <- as.character(e[, , 2])
e
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,] "1"  "4"  "7"  "10"
## [2,] "2"  "5"  "8"  "11"
## [3,] "3"  "6"  "9"  "12"
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,] "13" "16" "19" "22"
## [2,] "14" "17" "20" "23"
## [3,] "15" "18" "21" "24"
  • As for matrices the entire array is cast to chars…

The solution: Lists

  • List are just what it says they are (i.e. lists), and you can have a list where the elements are of different types
  • For example, create a simple list by typing
f <- list(a)
f
## [[1]]
## [1] 1 2 3 4 5
  • Elements or objects within lists can be called by using double square brackets, e.g the first (and only) element in list f is vector a
f[[1]]
## [1] 1 2 3 4 5

List

  • We can simply add an object or element to an existing list
f[[2]] <- d
f
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
##            V1       V2 V3
## 1 -0.56047565 8.430130  a
## 2 -0.23017749 5.921832  b
## 3  1.55870831 2.469878  c
## 4  0.07050839 3.626294  d
## 5  0.12928774 4.108676  e
  • Now our list consists of a vector and a data frame!

List

  • We can add names to the list as follows
names(f) <- c("vector", "data frame")
f
## $vector
## [1] 1 2 3 4 5
## 
## $`data frame`
##            V1       V2 V3
## 1 -0.56047565 8.430130  a
## 2 -0.23017749 5.921832  b
## 3  1.55870831 2.469878  c
## 4  0.07050839 3.626294  d
## 5  0.12928774 4.108676  e

List

  • With names to the elements we can call the vector a from the list as follows
f[[1]]
## [1] 1 2 3 4 5
f[["vector"]]
## [1] 1 2 3 4 5
f$vector
## [1] 1 2 3 4 5

Lists in lists

  • A construct more useful that it initially seems is a list of lists
  • Here we have a list with two identical elements
g <- list(f, f)

Lists in lists

  • To call the vector from the second list within the list g, use either of the following code
g[[2]][[1]]
## [1] 1 2 3 4 5
g[[2]]$vector
## [1] 1 2 3 4 5

Logical operators

Logical operators

  • Logical operators are operators that evaluate to either TRUE or FALSE
  • The most common statement include
    • == (equal to)
    • < (smaller than)
    • > (greater than)
    • <= (smaller than or equal to)
    • >= (greater than or equal to)
  • The operators may be combined using | (OR) as well as & (AND)
  • Typing ! before a logical operator takes the complement of that action
  • There are more operations, but these are the most useful

Logical operators

  • Logical operators are great for subsetting, e.g. if we would like elements out of matrix c that are larger than 3, we would type:
c[c > 3]
## [1] 4 5 4 5
  • Why does a logical statement on a matrix return a vector?
c > 3
##       [,1]  [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE
  • The column values for TRUE may be of different length. A vector as a return is therefore more appropriate

Logical operators

  • If we would like the elements that are smaller than 3 or larger than 3, we could type
c[c < 3 | c > 3]
## [1] 1 2 4 5 1 2 4 5
  • Analysing the query we realize that this is equivalent to asking for the elements not equal to 3
c[c != 3] #c not equal to 3
## [1] 1 2 4 5 1 2 4 5

Logical operators

  • To understand the mechanism see how c != 3 actually returns a Boolean matrix
##       [,1]  [,2]
## [1,]  TRUE  TRUE
## [2,]  TRUE  TRUE
## [3,] FALSE FALSE
## [4,]  TRUE  TRUE
## [5,]  TRUE  TRUE
  • And then recall the structure of c
##      [,1] [,2]
## [1,]    1    1
## [2,]    2    2
## [3,]    3    3
## [4,]    4    4
## [5,]    5    5

Factors

Factor

  • Factor is a vector with predefined “allowed” values (levels)
f <- factor(c("A", "B", "A", "B", "D"),
            levels = c("A", "B", "C"))
f
## [1] A    B    A    B    <NA>
## Levels: A B C
  • Why is “D” not included?

Factor

  • Can be useful for categoricals
fruits <- factor(c("Apple", "Potato"),
                 levels = c("Apple", "Banana"))
fruits
## [1] Apple <NA> 
## Levels: Apple Banana

Beware of factors - they are not alwyas what they seem!

Consider a data frame A:

##   V1 V2
## 1  1  4
## 2  2  3
## 3  3  2
## 4  4  1

We can try to add elements of A:

A[2,1] + A[2,2]
## Warning in Ops.factor(A[2, 1], A[2, 2]): '+' not meaningful for factors
## [1] NA

But it seems logical enough to add them!

typeof(A[2,1]) 
## [1] "integer"
typeof(A[2,2])
## [1] "integer"

Something weird is going on here!

A[2,1] 
## [1] 2
## Levels: 1 2 3 4
A[2,2]
## [1] 3
## Levels: 1 2 3 4

The problem is caused by the elements of A being factors.

A[2,2]
## [1] 3
## Levels: 1 2 3 4

It’s the keyword Levels that tells us that A is in fact a dataframe of factors.

Factors do cause a lot of trouble for this exact reason. So when you troubleshoot, check if your elements actually have the type that you think they have!

Factors in R

  • Sometimes R converts data to factors automatically when reading in the data.

    • This is not as big a problem as it used to be, but it is still a good idea to check
  • Strings are sometimes converted to factors automatically (unless you use a data structure called a tibble).

  • Sometimes we have to do it ourselves (if we want the data to be factors).

  • You can make an element into a factor by using the function ´as.factor()´

  • If you want something not to be a factor, you have to convert it back to the type of element you need it to be:

as.integer(A[2,1]) + as.integer(A[2,2])
## [1] 5

Then why use factors?

  • The factor is a data structure which is used for fields that take only predefined finite number of values.

  • These are the data objects which are used to categorise the data and to store it on multiple levels.

  • It can store both integers and strings values, and are useful in data that has a limited number of unique values.

  • It generates a quick overview of the unique values of the data.

Functions

A few final notes

Missingness - Special numerics

  • Expressions that have no representation in real number space (at least not without tremendous analytical effort) results in a “Not a Number”
0 / 0
## [1] NaN
  • Another special value is “Infinity”
1/0
## [1] Inf

Missingness - Not Available

  • Unknown values are coded as NA/“Not Available”
h <- c(1, 2, NA, 4, 5)
h
## [1]  1  2 NA  4  5
  • Special type that can be included in all vectors
j <- c("A", "B", NA, "D")
j
## [1] "A" "B" NA  "D"

Working with NA

  • Of course the mean of a set of number including at least in NA is also impossible and therefore also missing
mean(c(1, 2, NA, 4, 5))
## [1] NA
  • Often there is a argument called na.rm:
mean(c(1, 2, NA, 4, 5), na.rm = TRUE)
## [1] 3

Another note: Beware of comparisons that include floats

(3 - 2.9)
## [1] 0.1
(3 - 2.9) <= 0.1
## [1] FALSE
(3 - 2.9) - .1
## [1] 8.326673e-17

Practical

How to approach the next practical

  • Aim to make the exercises without looking at the answers
    • Use the answers to evaluate your work
    • Use the help to identify the workings of functions
  • If this does not work out then switch to the answer-based practical
  • In any case feel free to ask for help when needed
    • Do not struggle for too long since we only have limited time!