Import data ⬆️

# Import data <br> ⬆️

---

layout: true
  
<div class="my-footer">
<span>
<a href="https://rstd.io/bootcamper" target="_blank">rstd.io/bootcamper</a>
</span>
</div>

---

# Reading rectangular data into R

---

.pull-left[
<img src="img/readr.png" width="80%" />
]
.pull-right[
<img src="img/readxl.png" width="80%" />
]

---

## readr

- `read_csv()` - comma delimited files
- `read_csv2()` - semicolon separated files (common in countries where , is used as the decimal place)
- `read_tsv()` - tab delimited files
- `read_delim()` - reads in files with any delimiter
- `read_fwf()` - fixed width files
- `read_table()` - common variation of fixed width files where columns are separated by white space
- ...

---

## Reading data

```r
nobel <- read_csv(file = "data/nobel.csv")
```

```
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   id = col_double(),
##   year = col_double(),
##   born_date = col_date(format = ""),
##   died_date = col_date(format = ""),
##   share = col_double()
## )
```

```
## See spec(...) for full column specifications.
```

---

```
## ── Data Summary ────────────────────────
##                            Values
## Name                       nobel 
## Number of rows             935   
## Number of columns          26    
## _______________________          
## Column type frequency:           
##   character                21    
##   Date                     2     
##   numeric                  3     
## ________________________         
## Group variables                  
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────
##    skim_variable         n_missing complete_rate   min   max empty n_unique whitespace
##  1 firstname                     0        1          2    59     0      720          0
##  2 surname                      29        0.969      2    26     0      851          0
##  3 category                      0        1          5    10     0        6          0
##  4 affiliation                 250        0.733      4   110     0      303          0
##  5 city                        255        0.727      4    27     0      185          0
##  6 country                     254        0.728      3    14     0       27          0
##  7 gender                        0        1          3     6     0        3          0
##  8 born_city                    28        0.970      3    29     0      613          0
##  9 born_country                 28        0.970      3    28     0       80          0
## 10 born_country_code            28        0.970      2     2     0       77          0
## 11 died_city                   327        0.650      4    29     0      303          0
## 12 died_country                321        0.657      3    16     0       48          0
## 13 died_country_code           321        0.657      2     2     0       46          0
## 14 overall_motivation          918        0.0182    55   114     0        7          0
## 15 motivation                    0        1         24   337     0      656          0
## 16 born_country_original        28        0.970      3    52     0      122          0
## 17 born_city_original           28        0.970      3    36     0      616          0
## 18 died_country_original       321        0.657      3    35     0       52          0
## 19 died_city_original          327        0.650      4    29     0      303          0
## 20 city_original               255        0.727      4    27     0      185          0
## 21 country_original            254        0.728      3    35     0       29          0
## 
## ── Variable type: Date ─────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min        max        median     n_unique
## 1 born_date            33         0.965 1817-11-30 1997-07-12 1916-06-28      885
## 2 died_date           308         0.671 1903-11-01 2019-08-07 1983-03-09      616
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate    mean      sd    p0   p25   p50   p75  p100 hist 
## 1 id                    0             1  475.   278.        1  234.   470  716.   969 ▇▇▇▇▇
## 2 year                  0             1 1970.    33.3    1901 1947   1976 1999   2018 ▃▃▅▆▇
## 3 share                 0             1    1.99   0.936     1    1      2    3      4 ▇▇▁▅▂
```
]

---

## Writing data

- Write a file

```r
df <- tribble(
  ~x, ~y,
  1,  "a",
  2,  "b",
  3,  "c"
)

write_csv(df, path = "data/df.csv")
```

- Check that it got written out

```r
fs::dir_ls("data")
```

```
## data/df-na.csv           data/df.csv              data/edi-airbnb.csv      data/favourite-food.xlsx 
## data/nobel.csv           data/sales               data/sales.xlsx
```

---

Read in the `nobels.csv` file from the `data-raw/` folder, split into two (STEM and 
non-STEM), and write out these two pieces to `data/`.

* Import data from `data-raw/`.
* Create a new data frame, `nobel_stem`, that filters for the STEM fields 
(Physics, Medicine, Chemistry, and Economics).
* Create another data frame, `nobel_nonstem`, that filters for the remaining 
fields.
* Write out the two data frames to `nobel-stem.csv` and `nobel-nonstem.csv`, 
respectively, to `data/`.

**Hint:** Use the `%in%` operator when filtering.
]

.footnote[
RStudio Cloud workspace for this bootcamp  
is at [rstd.io/bootcamper-cloud](https://rstd.io/bootcamper-cloud).
]
---

# Variable names

---

```r
edi_airbnb <- read_csv("data/edi-airbnb.csv")
names(edi_airbnb)
```

```
##  [1] "ID"                   "Price"                "neighbourhood"        "accommodates"        
##  [5] "Number of bathrooms"  "Number of Bedrooms"   "n beds"               "Review Scores Rating"
##  [9] "Number of reviews"    "listing_url"
```

... but R doesn't allow spaces in variable names

```r
ggplot(edi_airbnb, aes(x = Number of bathrooms, y = Price)) +
  geom_point()
```

```
## Error: <text>:1:35: unexpected symbol
## 1: ggplot(edi_airbnb, aes(x = Number of
##                                       ^
```

---

## Option 1 - Define column names

```r
edi_airbnb_col_names <- read_csv("data/edi-airbnb.csv",
  col_names = c("id", "price", "neighbourhood", "accommodates",
                "bathroom", "bedroom", "bed", 
                "review_scores_rating", "n_reviews", "url"))
```

```
## Parsed with column specification:
## cols(
##   id = col_character(),
##   price = col_character(),
##   neighbourhood = col_character(),
##   accommodates = col_character(),
##   bathroom = col_character(),
##   bedroom = col_character(),
##   bed = col_character(),
##   review_scores_rating = col_character(),
##   n_reviews = col_character(),
##   url = col_character()
## )
```

```r
names(edi_airbnb_col_names)
```

```
##  [1] "id"                   "price"                "neighbourhood"        "accommodates"        
##  [5] "bathroom"             "bedroom"              "bed"                  "review_scores_rating"
##  [9] "n_reviews"            "url"
```
]

---

## Option 2 - Format text to snake_case

```r
edi_airbnb_cleaned_names <- edi_airbnb %>%
  janitor::clean_names()

names(edi_airbnb_cleaned_names)
```

```
##  [1] "id"                   "price"                "neighbourhood"        "accommodates"        
##  [5] "number_of_bathrooms"  "number_of_bedrooms"   "n_beds"               "review_scores_rating"
##  [9] "number_of_reviews"    "listing_url"
```

---

# Variable types

---

```r
read_csv("data/df-na.csv")
```

```
## # A tibble: 9 x 3
##   x     y              z     
##   <chr> <chr>          <chr> 
## 1 1     a              hi    
## 2 <NA>  b              hello 
## 3 3     Not applicable 9999  
## 4 4     d              ola   
## 5 5     e              hola  
## 6 .     f              whatup
## # … with 3 more rows
```
]

---

## Option 1. Explicit NAs

```r
read_csv("data/df-na.csv", 
         na = c("", "NA", ".", "9999", "Not applicable"))
```

```
## # A tibble: 9 x 3
##       x y     z     
##   <dbl> <chr> <chr> 
## 1     1 a     hi    
## 2    NA b     hello 
## 3     3 <NA>  <NA>  
## 4     4 d     ola   
## 5     5 e     hola  
## 6    NA f     whatup
## # … with 3 more rows
```
]

---

## Option 2. Specify column types

```r
read_csv("data/df-na.csv", 
  col_types = list(col_double(), col_character(), col_character()))
```

```
## Warning: 1 parsing failure.
## row col expected actual             file
##   6   x a double      . 'data/df-na.csv'
```

```
## # A tibble: 9 x 3
##       x y              z     
##   <dbl> <chr>          <chr> 
## 1     1 a              hi    
## 2    NA b              hello 
## 3     3 Not applicable 9999  
## 4     4 d              ola   
## 5     5 e              hola  
## 6    NA f              whatup
## # … with 3 more rows
```
]

---

## Column types

**type function**  | **data type**
------------------ | -------------
`col_character()`  | character
`col_date()`       | date
`col_datetime()`   | POSIXct (date-time)
`col_double()`     | double (numeric)
`col_factor()`     | factor
`col_guess()`      | let readr guess (default)
`col_integer()`    | integer
`col_logical()`    | logical
`col_number()`     | numbers mixed with non-number characters
`col_numeric()`    | double or integer
`col_skip()`       | do not read
`col_time()`       | time

---

**Exercise 1**
* Read in the Excel file called `favourite-food.xlsx` from the `data-raw/` folder.
* Clean up `NA`s and make sure you're happy with variable types.
* Convert SES (socio economic status) to a factor variables with levels in the 
following order: `Low`, `Middle`, `High`.
* Write out the resulting data frame to `favourite-food.csv` in the `data/` folder.
* Finally, read `favourite-food.csv` back in from  
the `data/` folder and observe the variable types.  
Are they as you left them?
]

.footnote[
RStudio Cloud workspace for this bootcamp  
is at [rstd.io/bootcamper-cloud](https://rstd.io/bootcamper-cloud).
]

---

## `read_rds()` and `write_rds()`

- CSVs can be unreliable for saving interim results if there is specific 
variable type information you want to hold on to.
- An alterive is RDS files, you can read and write them with `read_rds()` and 
`write_rds()`, respectively.

```r
read_rds(path)
write_rds(x, path)
```

---

**Exercise 2**
* Repeat the first three steps from Exercise 1.
* Write out the resulting data frame to `favourite-food.rds` in the `data/` folder.
* Read `favourite-food.rds` back in from  the `data/` folder and observe the 
variable types.  Are they as you left them?
]

.footnote[
RStudio Cloud workspace for this bootcamp  
is at [rstd.io/bootcamper-cloud](https://rstd.io/bootcamper-cloud).
]

---

Load the `sales.xlsx` file from the `data-raw/` folder, using appropriate 
arguments for the `read_excel()` function such that it looks like the following.
If done, see next slide for stretch goal.
]

```
## # A tibble: 9 x 2
##   id      n    
##   <chr>   <chr>
## 1 Brand 1 n    
## 2 1234    8    
## 3 8721    2    
## 4 1822    3    
## 5 Brand 2 n    
## 6 3333    1    
## # … with 3 more rows
```
]

.footnote[
RStudio Cloud workspace for this bootcamp  
is at [rstd.io/bootcamper-cloud](https://rstd.io/bootcamper-cloud).
]

---

**Stretch goal:** Manipulate the sales data such that it looks like the 
following.
]

```
## # A tibble: 7 x 3
##   brand      id     n
##   <chr>   <dbl> <dbl>
## 1 Brand 1  1234     8
## 2 Brand 1  8721     2
## 3 Brand 1  1822     3
## 4 Brand 2  3333     1
## 5 Brand 2  2156     3
## 6 Brand 2  3987     6
## # … with 1 more row
```
]

---

# Importing many files

---

```r
sales_files <- fs::dir_ls("data/sales")
sales_files
```

```
## data/sales/01-sales.csv data/sales/02-sales.csv data/sales/03-sales.csv
```

```r
library(vroom)
sales <- vroom(sales_files, id = "filename")
sales
```

```
## # A tibble: 19 x 6
##   filename                month    year brand  item     n
##   <chr>                   <chr>   <dbl> <dbl> <dbl> <dbl>
## 1 data/sales/01-sales.csv January  2019     1  1234     3
## 2 data/sales/01-sales.csv January  2019     1  8721     9
## 3 data/sales/01-sales.csv January  2019     1  1822     2
## 4 data/sales/01-sales.csv January  2019     2  3333     1
## 5 data/sales/01-sales.csv January  2019     2  2156     9
## 6 data/sales/01-sales.csv January  2019     2  3987     6
## # … with 13 more rows
```

---

## vroom vroom!!

.pull-left[
<img src="img/vroom.png" width="80%" style="display: block; margin: auto;" />
]
.pull-right[
- **vroom** is most useful for reading large amounts of data in, fast!
- and it has nice bells-and-whistles like delimiter guessing, reading many files in at once, etc.
- Learn more at [vroom.r-lib.org](https://vroom.r-lib.org/)
]

---

# Other types of data

---

## Other types of data

- **googlesheets4:** Google Sheets
- **haven**: SPSS, Stata, and SAS files
- **DBI**, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): allows you to run SQL queries against a database and return a data frame
- **jsonline**: JSON
- **xml2**: xml
- **rvest**: web scraping
- **httr**: web APIs
- **sparklyr**: data loaded into spark