Practical B

Exercises

In this practical we are going to play around with the different types of elements in R.

Make two vectors: one named vec1 with values 1 through 6 and one named vec2 with letters A through F.

vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c("A", "B", "C", "D", "E", "F")

To create a vector we used c(), which stands for ‘concatenation’. It is just a series of numbers or letters.

Create two matrices, one from vec1 and one from vec2. The dimensions for both matrices are 3 rows by 2 columns.

mat1 <- matrix(vec1, nrow = 3, ncol = 2)
mat2 <- matrix(vec2, nrow = 3, ncol = 2)

To create a matrix we used matrix(). For a matrix we need to specify the dimensions (in this case 3 rows and 2 columns) and the input (in this case vec1 or vec2) needs to match these dimensions.

Inspect your vectors and matrices. Are all numerical?

vec1

## [1] 1 2 3 4 5 6

vec2

## [1] "A" "B" "C" "D" "E" "F"

mat1

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

mat2

##      [,1] [,2]
## [1,] "A"  "D" 
## [2,] "B"  "E" 
## [3,] "C"  "F"

vec1 and mat1 contain numbers and vec2 and mat2 contain characters.

Make a matrix from both vec1 and vec2 with 6 rows and 2 columns. Inspect this matrix.

mat3 <- matrix(c(vec1, vec2), 6, 2)
is.matrix(mat3)

## [1] TRUE

mat3

##      [,1] [,2]
## [1,] "1"  "A" 
## [2,] "2"  "B" 
## [3,] "3"  "C" 
## [4,] "4"  "D" 
## [5,] "5"  "E" 
## [6,] "6"  "F"

mat3b <- cbind(vec1, vec2)
is.matrix(mat3b)

## [1] TRUE

mat3b

##      vec1 vec2
## [1,] "1"  "A" 
## [2,] "2"  "B" 
## [3,] "3"  "C" 
## [4,] "4"  "D" 
## [5,] "5"  "E" 
## [6,] "6"  "F"

If one or more elements in the matrix represent characters, all other elements are also converted to characters. A matrix is just for either numeric or character elements. Notice that the second approach (the column bind approach from mat3b) returns a matrix where the column names are already set to the name of the bound objects.

To solve the problem of charactered numbers we can create a dataframe. A dataframe is essentially a matrix that allows for character elements. The use of a dataframe is often preferred over the use of a matrix in R, except for purposes where pure numerical calculations are done, such as in matrix algebra. However, most datasets do contain character information and a dataframe would normally be your preferred choice when working with your own collected datasets in R.

Make a dataframe called dat3 where vec1 and vec2 are both columns. Name the columns V1 and V2, respectively. Use function data.frame().

dat3 <- data.frame(V1 = vec1, V2 = vec2)
dat3

##   V1 V2
## 1  1  A
## 2  2  B
## 3  3  C
## 4  4  D
## 5  5  E
## 6  6  F

Again, make a dataframe called dat3b where vec1 and vec2 are both columns. Name the columns V1 and V2, respectively. Use function as.data.frame() on the matrix obtained from Question 4.

This is a tricky situation. At face value, everything may seem to be in order. But, be aware that the code

dat3b <- as.data.frame(mat3b)
dat3b

##   vec1 vec2
## 1    1    A
## 2    2    B
## 3    3    C
## 4    4    D
## 5    5    E
## 6    6    F

names(dat3b) <- c("V1", "V2")
dat3b

##   V1 V2
## 1  1  A
## 2  2  B
## 3  3  C
## 4  4  D
## 5  5  E
## 6  6  F

does not work properly (at least not as intended) as the matrix nature of mat3 turned everything into a character value and you have lost the numerical nature of vec1. It may appear to be working, but if we check if column 1 is numerical, it turns out not to be the case.

Check if the first column in the data frames from Question 5 and Question 6 are indeed numeric. If not, determine what they are.

is.numeric(dat3[, 1])

## [1] TRUE

is.numeric(dat3b[, 1])

## [1] FALSE

The first column in matrix dat3b obtained from Question 5 is indeed not numeric. Instead of asking “Is object X of type Y?” we simply ask R to report the type of the object of interest:

typeof(dat3b[,1])

## [1] "character"

Previous versions of these practicals had to deal with factors at this point, since there was an implicit stringsAsFactors=TRUE in for example the as.matrix function. For good reasons, the behavior of R in this regards has changed. (If you are interested, see the following blogpost by Kurt Hornik.)

Select 1) the third row, 2) the second column and 3) the intersection of these two in the dataframe dat3 that you have created in Question 5.

dat3[3, ] #3rd row

##   V1 V2
## 3  3  C

dat3[, 2] #2nd column

## [1] "A" "B" "C" "D" "E" "F"

dat3$V2   #also 2nd column

## [1] "A" "B" "C" "D" "E" "F"

dat3[3,2] #intersection

## [1] "C"

The [3,2] index is very useful. The first number (before the comma) represents the row and the second number (after the comma) represents the column. For a vector there are no two dimensions and only one dimension can be called. For example, vec1[3] would yield 3. Try it.

Columns can also be called by the $ sign, but only if a name has been assigned. With dataframes assigning names happens automatically.

A useful function to inspect the structure of a dataframe is str(). Try running it.

str(dat3)

## 'data.frame':    6 obs. of  2 variables:
##  $ V1: num  1 2 3 4 5 6
##  $ V2: chr  "A" "B" "C" "D" ...

Inspecting the structure of your data is vital, as you probably have imported your data from some other source. If we, at a later stage, start analyzing our data without the correct measurement level, we may run into problems. One problem that often occurs is that categorical variables (factors in R) are not coded as such (or the reverse).

Imagine that the first variable V1 in our dataframe dat3 is not coded correctly, but actually represents grouping information about cities. Convert the variable to a factor and add the labels Utrecht, New York, London, Singapore, Rome and Cape Town.

dat3$V1 <- factor(dat3$V1,
                  labels = c("Utrecht", "New York", "London", "Singapore", "Rome", "Cape Town"))
dat3

##          V1 V2
## 1   Utrecht  A
## 2  New York  B
## 3    London  C
## 4 Singapore  D
## 5      Rome  E
## 6 Cape Town  F

Rerunning the str function confirms that we now have converted the data in the first column to factors:

str(dat3)

## 'data.frame':    6 obs. of  2 variables:
##  $ V1: Factor w/ 6 levels "Utrecht","New York",..: 1 2 3 4 5 6
##  $ V2: chr  "A" "B" "C" "D" ...

Note that this conversion was rather lazy, and that the factor function offers more controls if needed.

Open the dataset boys.RData. You need to download this file and put it in the project folder.

If you have organized your code in a project the boys.RData file will be visible in the Files-panel, where you can just click on it and confirm, that you want to load the data file into your global environment. However, the preferred way is (almost) always to do this using code:

load("boys.RData")

Yet another option is to double-click the boys.RData file on your machine, eg. in the Windows Explorer (right-click and open with RStudio if it does not open by default in RStudio, but in R).

Most packages have datasets included. Since we have not learned to load packages yet, you are presented with such a data set in a workspace. Open the boys dataset (it is from package mice, by the way) by typing boys in the console, and subsequently by using the function View(). Tip: To understand the numbers you are looking at look at the documentation by typing ?mice::boys

The output is not displayed here as it is simply too large.

Using View() is well suited for inspecting datasets that are large (but bot huge). View() opens the dataset in a spreadsheet-like window (like Excel), but you cannot edit the content.

Find out the dimensions of the boys data set and inspect the first and final 6 cases in the data set.

To do it numerically, find out what the dimensions of the boys dataset are.

dim(boys)

## [1] 748   9

There are 748 cases on 9 variables. To select the first and last six cases, use

boys[c(1:6,743:748), ]

##         age   hgt    wgt   bmi   hc  gen  phb tv   reg
## 3     0.035  50.1  3.650 14.54 33.7 <NA> <NA> NA south
## 4     0.038  53.5  3.370 11.77 35.0 <NA> <NA> NA south
## 18    0.057  50.0  3.140 12.56 35.2 <NA> <NA> NA south
## 23    0.060  54.5  4.270 14.37 36.7 <NA> <NA> NA south
## 28    0.062  57.5  5.030 15.21 37.3 <NA> <NA> NA south
## 36    0.068  55.5  4.655 15.11 37.0 <NA> <NA> NA south
## 7410 20.372 188.7 59.800 16.79 55.2 <NA> <NA> NA  west
## 7418 20.429 181.1 67.200 20.48 56.6 <NA> <NA> NA north
## 7444 20.761 189.1 88.000 24.60   NA <NA> <NA> NA  west
## 7447 20.780 193.5 75.400 20.13   NA <NA> <NA> NA  west
## 7451 20.813 189.0 78.000 21.83 59.9 <NA> <NA> NA north
## 7475 21.177 181.8 76.500 23.14   NA <NA> <NA> NA  east

or, more efficiently:

head(boys, n=6)

##      age  hgt   wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south

tail(boys)

##         age   hgt  wgt   bmi   hc  gen  phb tv   reg
## 7410 20.372 188.7 59.8 16.79 55.2 <NA> <NA> NA  west
## 7418 20.429 181.1 67.2 20.48 56.6 <NA> <NA> NA north
## 7444 20.761 189.1 88.0 24.60   NA <NA> <NA> NA  west
## 7447 20.780 193.5 75.4 20.13   NA <NA> <NA> NA  west
## 7451 20.813 189.0 78.0 21.83 59.9 <NA> <NA> NA north
## 7475 21.177 181.8 76.5 23.14   NA <NA> <NA> NA  east

The functions head() and tail() are very useful functions. For example, from looking at both functions we can observe that the data are very likely sorted based on age.

It seems that the boys data are sorted based on age. Verify this.

To verify if the data are indeed sorted, we can run the following command to test the complement of that statement. Remember that we can always search the help for functions: e.g. we could have searched here for ?sort and we would quickly have ended up at function is.unsorted() which tests whether an object is not sorted

is.unsorted(boys$age)

## [1] FALSE

which returns FALSE, indicating that boys’ age is not unsorted, which is the same as saying that it is sorted (double negative). To directly test if it is sorted, we could have used

!is.unsorted(boys$age)

## [1] TRUE

which tests if data data are not unsorted. In other words the values TRUE and FALSE under is.unsorted() turn into FALSE and TRUE under !is.unsorted(), respectively.

Inspect the boys dataset with str(). Use one or more functions to find distributional summary information (at least information about the minimum, the maximum, the mean and the median) for all of the variables. Give the standard deviation for age and bmi. Tip: make use of the help (?) and help search (??) functionality in R.

str(boys)

## 'data.frame':    748 obs. of  9 variables:
##  $ age: num  0.035 0.038 0.057 0.06 0.062 0.068 0.068 0.071 0.071 0.073 ...
##  $ hgt: num  50.1 53.5 50 54.5 57.5 55.5 52.5 53 55.1 54.5 ...
##  $ wgt: num  3.65 3.37 3.14 4.27 5.03 ...
##  $ bmi: num  14.5 11.8 12.6 14.4 15.2 ...
##  $ hc : num  33.7 35 35.2 36.7 37.3 37 34.9 35.8 36.8 38 ...
##  $ gen: Ord.factor w/ 5 levels "G1"<"G2"<"G3"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ phb: Ord.factor w/ 6 levels "P1"<"P2"<"P3"<..: NA NA NA NA NA NA NA NA NA NA ...
##  $ tv : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ reg: Factor w/ 5 levels "north","east",..: 4 4 4 4 4 4 4 3 3 2 ...

summary(boys) # summary info

##       age              hgt              wgt              bmi       
##  Min.   : 0.035   Min.   : 50.00   Min.   :  3.14   Min.   :11.77  
##  1st Qu.: 1.581   1st Qu.: 84.88   1st Qu.: 11.70   1st Qu.:15.90  
##  Median :10.505   Median :147.30   Median : 34.65   Median :17.45  
##  Mean   : 9.159   Mean   :132.15   Mean   : 37.15   Mean   :18.07  
##  3rd Qu.:15.267   3rd Qu.:175.22   3rd Qu.: 59.58   3rd Qu.:19.53  
##  Max.   :21.177   Max.   :198.00   Max.   :117.40   Max.   :31.74  
##                   NA's   :20       NA's   :4        NA's   :21     
##        hc          gen        phb            tv           reg     
##  Min.   :33.70   G1  : 56   P1  : 63   Min.   : 1.00   north: 81  
##  1st Qu.:48.12   G2  : 50   P2  : 40   1st Qu.: 4.00   east :161  
##  Median :53.00   G3  : 22   P3  : 19   Median :12.00   west :239  
##  Mean   :51.51   G4  : 42   P4  : 32   Mean   :11.89   south:191  
##  3rd Qu.:56.00   G5  : 75   P5  : 50   3rd Qu.:20.00   city : 73  
##  Max.   :65.00   NA's:503   P6  : 41   Max.   :25.00   NA's :  3  
##  NA's   :46                 NA's:503   NA's   :522

sd(boys$age)  # standard deviation for age

## [1] 6.894052

sd(boys$bmi)  # standard deviation for bmi

## [1] NA

sd(boys$bmi, na.rm = TRUE) # standard deviation for bmi (observed cases)

## [1] 3.053421

Note that bmi contains 21 missing values, e.g. by looking at the summary information. Therefor we need to use na.rm = TRUE to calculate the standard deviation on the observed cases only.

Select all boys that are 20 years or older. How many are there?

The logical operators (TRUE vs FALSE) are a very powerful tool in R. For example, we can just select the rows (respondents) in the data that are older than 20 by putting the logical operator within the row index of the dataset:

boys2 <- boys[boys$age >= 20, ]
nrow(boys2)

## [1] 12

or, alternatively,

boys2.1 <- subset(boys, age >= 20)
nrow(boys2.1)

## [1] 12

To understand the first construction, we can do this in steps by constructing a logical vector a:

a <- boys$age >= 20
tail(a, n=100)

##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE

sum(a)

## [1] 12

boys_20 <- boys[a,]
nrow(boys_20)

## [1] 12

Select all boys that are older than 19, but younger than 19.5. How many are there?

boys3 <- boys[boys$age > 19 & boys$age < 19.5, ]
nrow(boys3)

## [1] 18

or, alternatively,

boys3.2 <- subset(boys, age > 19 & age < 19.5)
nrow(boys3.2)

## [1] 18

What is the mean age of boys younger than 15 years of age that do not live in region north?

mean(boys$age[boys$age < 15 & boys$reg != "north" ], na.rm = TRUE)

## [1] 6.044461

or, alternatively,

mean(subset(boys, age < 15 & reg != "north")$age, na.rm=TRUE)

## [1] 6.044461

The mean age is 6.04 years.

End of practical B.

Practical B

Datatypes and syntax

Exercises