In this practical we are going to play around with the different types of elements in R.
vec1
with values 1
through 6 and one named vec2
with letters A through F.
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c("A", "B", "C", "D", "E", "F")
To create a vector we used c()
, which stands for
‘concatenation’. It is just a series of numbers or letters.
vec1
and one from
vec2
. The dimensions for both matrices are 3 rows by 2
columns. mat1 <- matrix(vec1, nrow = 3, ncol = 2)
mat2 <- matrix(vec2, nrow = 3, ncol = 2)
To create a matrix we used matrix()
. For a matrix we
need to specify the dimensions (in this case 3 rows and 2 columns) and
the input (in this case vec1
or vec2
) needs to
match these dimensions.
vec1
## [1] 1 2 3 4 5 6
vec2
## [1] "A" "B" "C" "D" "E" "F"
mat1
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
mat2
## [,1] [,2]
## [1,] "A" "D"
## [2,] "B" "E"
## [3,] "C" "F"
vec1
and mat1
contain numbers and
vec2
and mat2
contain characters.
vec1
and
vec2
with 6 rows and 2 columns. Inspect this
matrix.mat3 <- matrix(c(vec1, vec2), 6, 2)
is.matrix(mat3)
## [1] TRUE
mat3
## [,1] [,2]
## [1,] "1" "A"
## [2,] "2" "B"
## [3,] "3" "C"
## [4,] "4" "D"
## [5,] "5" "E"
## [6,] "6" "F"
or
mat3b <- cbind(vec1, vec2)
is.matrix(mat3b)
## [1] TRUE
mat3b
## vec1 vec2
## [1,] "1" "A"
## [2,] "2" "B"
## [3,] "3" "C"
## [4,] "4" "D"
## [5,] "5" "E"
## [6,] "6" "F"
If one or more elements in the matrix represent characters, all other
elements are also converted to characters. A matrix is just for either
numeric or character elements. Notice that the second approach (the
column bind approach from mat3b
) returns a matrix where the
column names are already set to the name of the bound objects.
To solve the problem of charactered numbers we can create a
dataframe. A dataframe is essentially a matrix that allows for character
elements. The use of a dataframe is often preferred over the use of a
matrix in R
, except for purposes where pure numerical
calculations are done, such as in matrix algebra. However, most datasets
do contain character information and a dataframe would normally be your
preferred choice when working with your own collected datasets in R.
dat3
where
vec1
and vec2
are both columns. Name the
columns V1
and V2
, respectively. Use function
data.frame()
.dat3 <- data.frame(V1 = vec1, V2 = vec2)
dat3
## V1 V2
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
## 6 6 F
dat3b
where
vec1
and vec2
are both columns. Name the
columns V1
and V2
, respectively. Use function
as.data.frame()
on the matrix obtained from
Question 4
. This is a tricky situation. At face value, everything may seem to be in order. But, be aware that the code
dat3b <- as.data.frame(mat3b)
dat3b
## vec1 vec2
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
## 6 6 F
names(dat3b) <- c("V1", "V2")
dat3b
## V1 V2
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
## 5 5 E
## 6 6 F
does not work properly (at least not as intended) as the matrix
nature of mat3
turned everything into a character value and
you have lost the numerical nature of vec1
. It may appear
to be working, but if we check if column 1 is numerical, it turns out
not to be the case.
is.numeric(dat3[, 1])
## [1] TRUE
is.numeric(dat3b[, 1])
## [1] FALSE
The first column in matrix dat3b
obtained from Question
5 is indeed not numeric. Instead of asking “Is object X of type
Y?” we simply ask R
to report the type of the
object of interest:
typeof(dat3b[,1])
## [1] "character"
Previous versions of these practicals had to deal with factors at
this point, since there was an implicit
stringsAsFactors=TRUE
in for example the
as.matrix
function. For good reasons, the behavior of
R
in this regards has changed. (If you are interested, see
the following blogpost
by Kurt Hornik.)
dat3
that you
have created in Question 5. dat3[3, ] #3rd row
## V1 V2
## 3 3 C
dat3[, 2] #2nd column
## [1] "A" "B" "C" "D" "E" "F"
dat3$V2 #also 2nd column
## [1] "A" "B" "C" "D" "E" "F"
dat3[3,2] #intersection
## [1] "C"
The [3,2]
index is very useful. The first number (before
the comma) represents the row and the second number (after the comma)
represents the column. For a vector there are no two dimensions and only
one dimension can be called. For example, vec1[3]
would
yield 3
. Try it.
Columns can also be called by the $
sign, but only if a
name has been assigned. With dataframes assigning names happens
automatically.
A useful function to inspect the structure of a dataframe is
str()
. Try running it.
str(dat3)
## 'data.frame': 6 obs. of 2 variables:
## $ V1: num 1 2 3 4 5 6
## $ V2: chr "A" "B" "C" "D" ...
Inspecting the structure of your data is vital, as you probably have
imported your data from some other source. If we, at a later stage,
start analyzing our data without the correct measurement level, we may
run into problems. One problem that often occurs is that categorical
variables (factors in R
) are not coded as such (or the
reverse).
V1
in our
dataframe dat3
is not coded correctly, but actually
represents grouping information about cities. Convert the variable to a
factor and add the labels Utrecht, New York, London, Singapore, Rome and
Cape Town.dat3$V1 <- factor(dat3$V1,
labels = c("Utrecht", "New York", "London", "Singapore", "Rome", "Cape Town"))
dat3
## V1 V2
## 1 Utrecht A
## 2 New York B
## 3 London C
## 4 Singapore D
## 5 Rome E
## 6 Cape Town F
Rerunning the str
function confirms that we now have
converted the data in the first column to factors:
str(dat3)
## 'data.frame': 6 obs. of 2 variables:
## $ V1: Factor w/ 6 levels "Utrecht","New York",..: 1 2 3 4 5 6
## $ V2: chr "A" "B" "C" "D" ...
Note that this conversion was rather lazy, and that the
factor
function offers more controls if needed.
boys.RData
. You need to
download this file and put it in the project folder.If you have organized your code in a project the
boys.RData
file will be visible in the Files-panel, where
you can just click on it and confirm, that you want to load the data
file into your global environment. However, the preferred way is
(almost) always to do this using code:
load("boys.RData")
Yet another option is to double-click the boys.RData
file on your machine, eg. in the Windows Explorer (right-click and
open with RStudio
if it does not open by default in
RStudio
, but in R
).
boys
dataset (it is from package
mice
, by the way) by typing boys
in the
console, and subsequently by using the function View()
.
Tip: To understand the numbers you are looking at look at the
documentation by typing ?mice::boys
The output is not displayed here as it is simply too large.
Using View()
is well suited for inspecting datasets that
are large (but bot huge). View()
opens the dataset in a
spreadsheet-like window (like Excel), but you cannot edit the
content.
boys
data set
and inspect the first and final 6 cases in the data set. To do it numerically, find out what the dimensions of the boys dataset are.
dim(boys)
## [1] 748 9
There are 748 cases on 9 variables. To select the first and last six cases, use
boys[c(1:6,743:748), ]
## age hgt wgt bmi hc gen phb tv reg
## 3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
## 7410 20.372 188.7 59.800 16.79 55.2 <NA> <NA> NA west
## 7418 20.429 181.1 67.200 20.48 56.6 <NA> <NA> NA north
## 7444 20.761 189.1 88.000 24.60 NA <NA> <NA> NA west
## 7447 20.780 193.5 75.400 20.13 NA <NA> <NA> NA west
## 7451 20.813 189.0 78.000 21.83 59.9 <NA> <NA> NA north
## 7475 21.177 181.8 76.500 23.14 NA <NA> <NA> NA east
or, more efficiently:
head(boys, n=6)
## age hgt wgt bmi hc gen phb tv reg
## 3 0.035 50.1 3.650 14.54 33.7 <NA> <NA> NA south
## 4 0.038 53.5 3.370 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.140 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.270 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.030 15.21 37.3 <NA> <NA> NA south
## 36 0.068 55.5 4.655 15.11 37.0 <NA> <NA> NA south
tail(boys)
## age hgt wgt bmi hc gen phb tv reg
## 7410 20.372 188.7 59.8 16.79 55.2 <NA> <NA> NA west
## 7418 20.429 181.1 67.2 20.48 56.6 <NA> <NA> NA north
## 7444 20.761 189.1 88.0 24.60 NA <NA> <NA> NA west
## 7447 20.780 193.5 75.4 20.13 NA <NA> <NA> NA west
## 7451 20.813 189.0 78.0 21.83 59.9 <NA> <NA> NA north
## 7475 21.177 181.8 76.5 23.14 NA <NA> <NA> NA east
The functions head()
and tail()
are very
useful functions. For example, from looking at both functions we can
observe that the data are very likely sorted based on
age
.
boys
data are sorted based on
age
. Verify this.To verify if the data are indeed sorted, we can run the following
command to test the complement of that statement. Remember that we can
always search the help for functions: e.g. we could have searched here
for ?sort
and we would quickly have ended up at function
is.unsorted()
which tests whether an object is not
sorted
is.unsorted(boys$age)
## [1] FALSE
which returns FALSE
, indicating that boys’ age is not
unsorted, which is the same as saying that it is sorted (double
negative). To directly test if it is sorted, we could have used
!is.unsorted(boys$age)
## [1] TRUE
which tests if data data are not unsorted. In other words the values
TRUE
and FALSE
under
is.unsorted()
turn into FALSE
and
TRUE
under !is.unsorted()
, respectively.
boys
dataset with
str()
. Use one or more functions to find distributional
summary information (at least information about the minimum, the
maximum, the mean and the median) for all of the variables. Give the
standard deviation for age
and bmi
.
Tip: make use of the help (?) and help search (??) functionality in
R
.str(boys)
## 'data.frame': 748 obs. of 9 variables:
## $ age: num 0.035 0.038 0.057 0.06 0.062 0.068 0.068 0.071 0.071 0.073 ...
## $ hgt: num 50.1 53.5 50 54.5 57.5 55.5 52.5 53 55.1 54.5 ...
## $ wgt: num 3.65 3.37 3.14 4.27 5.03 ...
## $ bmi: num 14.5 11.8 12.6 14.4 15.2 ...
## $ hc : num 33.7 35 35.2 36.7 37.3 37 34.9 35.8 36.8 38 ...
## $ gen: Ord.factor w/ 5 levels "G1"<"G2"<"G3"<..: NA NA NA NA NA NA NA NA NA NA ...
## $ phb: Ord.factor w/ 6 levels "P1"<"P2"<"P3"<..: NA NA NA NA NA NA NA NA NA NA ...
## $ tv : int NA NA NA NA NA NA NA NA NA NA ...
## $ reg: Factor w/ 5 levels "north","east",..: 4 4 4 4 4 4 4 3 3 2 ...
summary(boys) # summary info
## age hgt wgt bmi
## Min. : 0.035 Min. : 50.00 Min. : 3.14 Min. :11.77
## 1st Qu.: 1.581 1st Qu.: 84.88 1st Qu.: 11.70 1st Qu.:15.90
## Median :10.505 Median :147.30 Median : 34.65 Median :17.45
## Mean : 9.159 Mean :132.15 Mean : 37.15 Mean :18.07
## 3rd Qu.:15.267 3rd Qu.:175.22 3rd Qu.: 59.58 3rd Qu.:19.53
## Max. :21.177 Max. :198.00 Max. :117.40 Max. :31.74
## NA's :20 NA's :4 NA's :21
## hc gen phb tv reg
## Min. :33.70 G1 : 56 P1 : 63 Min. : 1.00 north: 81
## 1st Qu.:48.12 G2 : 50 P2 : 40 1st Qu.: 4.00 east :161
## Median :53.00 G3 : 22 P3 : 19 Median :12.00 west :239
## Mean :51.51 G4 : 42 P4 : 32 Mean :11.89 south:191
## 3rd Qu.:56.00 G5 : 75 P5 : 50 3rd Qu.:20.00 city : 73
## Max. :65.00 NA's:503 P6 : 41 Max. :25.00 NA's : 3
## NA's :46 NA's:503 NA's :522
sd(boys$age) # standard deviation for age
## [1] 6.894052
sd(boys$bmi) # standard deviation for bmi
## [1] NA
sd(boys$bmi, na.rm = TRUE) # standard deviation for bmi (observed cases)
## [1] 3.053421
Note that bmi
contains 21 missing values, e.g. by
looking at the summary information. Therefor we need to use
na.rm = TRUE
to calculate the standard deviation on the
observed cases only.
The logical operators (TRUE vs FALSE) are a very powerful tool in
R
. For example, we can just select the rows (respondents)
in the data that are older than 20 by putting the logical operator
within the row index of the dataset:
boys2 <- boys[boys$age >= 20, ]
nrow(boys2)
## [1] 12
or, alternatively,
boys2.1 <- subset(boys, age >= 20)
nrow(boys2.1)
## [1] 12
To understand the first construction, we can do this in steps by
constructing a logical vector a
:
a <- boys$age >= 20
tail(a, n=100)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [97] TRUE TRUE TRUE TRUE
sum(a)
## [1] 12
boys_20 <- boys[a,]
nrow(boys_20)
## [1] 12
boys3 <- boys[boys$age > 19 & boys$age < 19.5, ]
nrow(boys3)
## [1] 18
or, alternatively,
boys3.2 <- subset(boys, age > 19 & age < 19.5)
nrow(boys3.2)
## [1] 18
north
?mean(boys$age[boys$age < 15 & boys$reg != "north" ], na.rm = TRUE)
## [1] 6.044461
or, alternatively,
mean(subset(boys, age < 15 & reg != "north")$age, na.rm=TRUE)
## [1] 6.044461
The mean age is 6.04 years.
End of practical B.