Program

In this lecture we will look at:

  • Packages
  • Reading CSV, SAS and Excel-files
  • Writing CSV and Excel
  • Reading and writing to a database
  • Creating functions

Packages

Packages

  • Until now we have worked with only the functionality in base
    • All R-installations have it
    • Powerful, robust and basic
  • But often we need more - that’s where packages come in…

What is a package?

  • Packages are a collection of functions and datasets
  • Extends the functionality of R
    • Read different file-formats (Excel, CSV, JSON, SAS…)
    • Create more advanced statistical models
    • Create beautiful figures, reports and webpages
    • Introduce more convenient methods for data manipulation
    • … and much more
  • An example is the package mice which can be used for Multivariate Imputation by Chained Equations (don’t worry about this, but is quite useful)

Packages

  • Installation
install.packages("mice")
  • Load it
library(mice)
## Warning: package 'mice' was built under R version 4.4.2
  • Use it
boys[1:5,]
##      age  hgt  wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.65 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.37 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.14 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.27 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.03 15.21 37.3 <NA> <NA> NA south

Installation

  • New packages are installed with the function install.packages
  • It searches the CRAN-repository for packages with the provided name
  • More than 19470 packages. Generally of high quality.
  • Automatically install dependencies if needed
install.packages("mice")
## Warning: package 'mice' is in use and will not be installed

CRAN

Load it

  • Once a package is installed you can load it with library.
  • This attaches it to your session and you can use all the functions and datasets in it
  • But you can not see it in your environment?
library(mice)
boys[1:3,]
##      age  hgt  wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.65 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.37 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.14 12.56 35.2 <NA> <NA> NA south
ls()
## character(0)

Load it

If for some reasons you want to add the boys data set to your environment, just do it like this:

data (boys)

Now it is there:

ls()
## [1] "boys"

Load it

  • It is hidden underneath your global environment, so all the available functions do not clutter up you workspace
  • But you can use everything in it as though it is
  • See help-file for package with ?mice

See contents of package

  • To see all the functions in a package you can write help(package = "mice")
  • Takes you to the index help-page for the package
HTML5 Icon

Use it

  • Now we can call the functions and datasets in the package as other object we have created. For instance we now have access to the boys-dataset without loading the boys.RData
boys[1:5,]
##      age  hgt  wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.65 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.37 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.14 12.56 35.2 <NA> <NA> NA south
## 23 0.060 54.5 4.27 14.37 36.7 <NA> <NA> NA south
## 28 0.062 57.5 5.03 15.21 37.3 <NA> <NA> NA south

Use it

  • To only load a single object from a package use the package::object-notation
  • Then you do not have to load the entire package
mice::boys[1:3,]
##      age  hgt  wgt   bmi   hc  gen  phb tv   reg
## 3  0.035 50.1 3.65 14.54 33.7 <NA> <NA> NA south
## 4  0.038 53.5 3.37 11.77 35.0 <NA> <NA> NA south
## 18 0.057 50.0 3.14 12.56 35.2 <NA> <NA> NA south

Finding packages

  • Can be overwhelming
  • Good places to start:
  • Else Google is your friend, but do not take the first and best!
    • Is it a popular package?
    • Check who the creator is and decide if you trust them
    • Is the project still maintained (i.e. when was the last update on github or where-ever the project lives)

Getting more help

  • Many of the popular packages have more user-friendly documentation called vignettes
browseVignettes("readr")

Getting more help

Getting more help

  • Most popular packages should have either a good help page, a vignette or a cheat sheet… Or preferably all of them
  • …but often when you have problems the best source of help is the internet.

Packages

We will use these packages in this lecture

## For reading and writing files
library(readr)
library(haven)
library(openxlsx)
## Warning: package 'openxlsx' was built under R version 4.4.2
## For working with databases
library(DBI)
library(RSQLite)
## Warning: package 'RSQLite' was built under R version 4.4.1

Reading datafiles

Path to files

  • When you start R it has a Working Directory. It is either your systems userdirectory or the directory of your project.
  • To see your current Working Directory use the function getwd
getwd()
## [1] "C:/Users/Morten"
  • R will first look for files in your Working Directory

Example

  • I have created a small textfile called “example.txt” and put it in the path “C:/Users/Morten/example.txt”
  • R can now find it easily because it is in my working directory.
  • You just need to write:
file("example.txt")
  • Alternatively I could write the full path
  • Notice that we use slash (/) and not backslash(\):
file("C:/Users/Morten/example.txt")
  • This is called a relative and an absolute path.

Absolute vs. relative path

As a rule of thumb:

  • When the file is in your project folder you should use relative paths
    • This makes your project portable and avoids unnecessary long paths
  • If the file is a system file or on a shared drive you should use absolute paths

Writing to files

  • Often you would like to export your datasets into a permanent storage once you are done working with them. An example we already used was boys.RData.
  • Today we will export the same dataset to some more common formats.

Writing to files

  • Can be simple
# CSV (using the package readr)
readr::write_csv2(boys, "files/boys.csv")

# Excel (using the package openxlsx)
openxlsx::write.xlsx(boys, "files/boys.xlsx")

# STATA (using the package haven)
haven::write_dta(boys, "files/boys.dta")
  • All these functions have a lot of arguments!

Reading from files

  • Also simple
# CSV
boys_csv <- readr::read_csv2("files/boys.csv")

# Excel
boys_xlsx <- openxlsx::read.xlsx("files/boys.xlsx")

# SAS
boys_dta <- haven::read_dta("files/boys.dta")
  • Notice that we use write_csv2/read_csv2 and not write_csv/read_csv.
  • This uses ; to separate values instead of , which can be problematic in our region

Reading from files

  • But something has changed
boys_csv[1:5,]
## # A tibble: 5 × 9
##     age   hgt   wgt   bmi    hc gen   phb      tv reg  
##   <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 0.035  50.1  3.65  14.5  33.7 <NA>  <NA>     NA south
## 2 0.038  53.5  3.37  11.8  35   <NA>  <NA>     NA south
## 3 0.057  50    3.14  12.6  35.2 <NA>  <NA>     NA south
## 4 0.06   54.5  4.27  14.4  36.7 <NA>  <NA>     NA south
## 5 0.062  57.5  5.03  15.2  37.3 <NA>  <NA>     NA south
  • CSV-files does not have the datatype factor and it is cast as character
  • Always inspect your data after importing!

Practical (tasks 1 - 6)

Databases

Databases

  • Tabular data can be stored in relational databases such as
    • Oracle (ROracle)
    • PostGres (RPostgresql)
    • MySql (RMysql)
    • SQLite (RSQLite)
  • Working with databases is an entire course in itself, but we will look at how to use the DBI-package to interact with SQLite.
  • Very similar syntax when using other database-backends

DBI

  • Short for DataBase Interface
  • Strives to make a common interface to all the different backends
  • All functions start with db and then a verb s.a. dbConnect
  • Most major backends are supported.
  • Workflow:
    • Connect to database
    • Do operations (writing or reading data)
    • Disconnect
  • Well maintained and good documentation at dbi.r-dbi.org

R and Databases

HTML5 Icon

Connecting

  • We will use SQLite as an example since we can run it ourselves on our computers.
  • To connect to the database use dbConnect:
library(DBI)
con <- dbConnect(drv = RSQLite::SQLite(),
                 dbname = "files/boys_database.sqlite")

con
## <SQLiteConnection>
##   Path: C:\data\08. International Consulting\ ...\files\boys_database.sqlite
##   Extensions: TRUE
  • The database is a simple file in SQLite.
  • The argument dbname could be any valid pathname

Write to database

  • To write a dataset to a database use dbWriteTable
dbWriteTable(con,
             name = "boys_table",
             value = boys)
  • We can now see that a table has been create
dbListTables(con)
## [1] "boys_table"

Reading from database

  • To read a table into R we use dbReadTable
boys_from_db <- dbReadTable(con, "boys_table")
boys_from_db[1:5,]
##     age  hgt  wgt   bmi   hc  gen  phb tv   reg
## 1 0.035 50.1 3.65 14.54 33.7 <NA> <NA> NA south
## 2 0.038 53.5 3.37 11.77 35.0 <NA> <NA> NA south
## 3 0.057 50.0 3.14 12.56 35.2 <NA> <NA> NA south
## 4 0.060 54.5 4.27 14.37 36.7 <NA> <NA> NA south
## 5 0.062 57.5 5.03 15.21 37.3 <NA> <NA> NA south

SQL-queries

  • You can also use SQL to make more specific and powerful queries from your database
  • It is beyond the scope of this course, but if you work with databases it is worth it to learn some basic SQL
  • To send a query to the database and fetch the result use dbGetQuery
old_boys <- dbGetQuery(con, "SELECT * FROM BOYS_TABLE WHERE age > 20")
old_boys[1:5,]
##      age   hgt  wgt   bmi   hc  gen  phb tv   reg
## 1 20.010 170.0 68.8 23.80 55.5   G5   P6 25  east
## 2 20.030 178.6 71.0 22.25 57.2   G5   P5 25  city
## 3 20.032 184.0 73.0 21.56 56.0 <NA> <NA> NA north
## 4 20.117 188.7 89.4 25.10 58.1   G5   P6 25  east
## 5 20.281 185.1 81.1 23.67 58.8   G5   P6 20 south

Deleting tables

  • To remove a table from the database use dbRemoveTable
dbRemoveTable(con, "boys_table")

Disconnecting

  • When you are done working with your database, remember to disconnect:
dbDisconnect(con)

Other databases

  • SQlite is often not a good candidate for a database. When working with official statistics you want a more featurecomplete database running on a dedicated server.
  • Everything apart from dbConnect is the same no matter what kind of database-backend you use.
  • You have such a server, maybe an Oracle server?
  • A good place to start is solutions.posit.co/connections/db/databases/

Practical (tasks 7 - 14)