class: center, middle, inverse, title-slide # Import data
⬆️ --- layout: true <div class="my-footer"> <span> <a href="https://rstd.io/bootcamper" target="_blank">rstd.io/bootcamper</a> </span> </div> --- class: middle # Reading rectangular data into R --- class: middle .pull-left[ <img src="img/readr.png" width="80%" /> ] .pull-right[ <img src="img/readxl.png" width="80%" /> ] --- ## readr - `read_csv()` - comma delimited files - `read_csv2()` - semicolon separated files (common in countries where , is used as the decimal place) - `read_tsv()` - tab delimited files - `read_delim()` - reads in files with any delimiter - `read_fwf()` - fixed width files - `read_table()` - common variation of fixed width files where columns are separated by white space - ... --- ## Reading data ```r nobel <- read_csv(file = "data/nobel.csv") ``` ``` ## Parsed with column specification: ## cols( ## .default = col_character(), ## id = col_double(), ## year = col_double(), ## born_date = col_date(format = ""), ## died_date = col_date(format = ""), ## share = col_double() ## ) ``` ``` ## See spec(...) for full column specifications. ``` --- .tiny[ ``` ## ── Data Summary ──────────────────────── ## Values ## Name nobel ## Number of rows 935 ## Number of columns 26 ## _______________________ ## Column type frequency: ## character 21 ## Date 2 ## numeric 3 ## ________________________ ## Group variables ## ## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 firstname 0 1 2 59 0 720 0 ## 2 surname 29 0.969 2 26 0 851 0 ## 3 category 0 1 5 10 0 6 0 ## 4 affiliation 250 0.733 4 110 0 303 0 ## 5 city 255 0.727 4 27 0 185 0 ## 6 country 254 0.728 3 14 0 27 0 ## 7 gender 0 1 3 6 0 3 0 ## 8 born_city 28 0.970 3 29 0 613 0 ## 9 born_country 28 0.970 3 28 0 80 0 ## 10 born_country_code 28 0.970 2 2 0 77 0 ## 11 died_city 327 0.650 4 29 0 303 0 ## 12 died_country 321 0.657 3 16 0 48 0 ## 13 died_country_code 321 0.657 2 2 0 46 0 ## 14 overall_motivation 918 0.0182 55 114 0 7 0 ## 15 motivation 0 1 24 337 0 656 0 ## 16 born_country_original 28 0.970 3 52 0 122 0 ## 17 born_city_original 28 0.970 3 36 0 616 0 ## 18 died_country_original 321 0.657 3 35 0 52 0 ## 19 died_city_original 327 0.650 4 29 0 303 0 ## 20 city_original 255 0.727 4 27 0 185 0 ## 21 country_original 254 0.728 3 35 0 29 0 ## ## ── Variable type: Date ───────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate min max median n_unique ## 1 born_date 33 0.965 1817-11-30 1997-07-12 1916-06-28 885 ## 2 died_date 308 0.671 1903-11-01 2019-08-07 1983-03-09 616 ## ## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────────────── ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 id 0 1 475. 278. 1 234. 470 716. 969 ▇▇▇▇▇ ## 2 year 0 1 1970. 33.3 1901 1947 1976 1999 2018 ▃▃▅▆▇ ## 3 share 0 1 1.99 0.936 1 1 2 3 4 ▇▇▁▅▂ ``` ] --- ## Writing data - Write a file ```r df <- tribble( ~x, ~y, 1, "a", 2, "b", 3, "c" ) write_csv(df, path = "data/df.csv") ``` - Check that it got written out ```r fs::dir_ls("data") ``` ``` ## data/df-na.csv data/df.csv data/edi-airbnb.csv data/favourite-food.xlsx ## data/nobel.csv data/sales data/sales.xlsx ``` --- .your-turn[ Start the assignment 07 - Import data and open `01-import-export.Rmd`. Read in the `nobels.csv` file from the `data-raw/` folder, split into two (STEM and non-STEM), and write out these two pieces to `data/`. * Import data from `data-raw/`. * Create a new data frame, `nobel_stem`, that filters for the STEM fields (Physics, Medicine, Chemistry, and Economics). * Create another data frame, `nobel_nonstem`, that filters for the remaining fields. * Write out the two data frames to `nobel-stem.csv` and `nobel-nonstem.csv`, respectively, to `data/`. **Hint:** Use the `%in%` operator when filtering. ]
10
:
00
.footnote[ RStudio Cloud workspace for this bootcamp is at [rstd.io/bootcamper-cloud](https://rstd.io/bootcamper-cloud). ] --- class: middle # Variable names --- ```r edi_airbnb <- read_csv("data/edi-airbnb.csv") names(edi_airbnb) ``` ``` ## [1] "ID" "Price" "neighbourhood" "accommodates" ## [5] "Number of bathrooms" "Number of Bedrooms" "n beds" "Review Scores Rating" ## [9] "Number of reviews" "listing_url" ``` -- ... but R doesn't allow spaces in variable names ```r ggplot(edi_airbnb, aes(x = Number of bathrooms, y = Price)) + geom_point() ``` ``` ## Error: <text>:1:35: unexpected symbol ## 1: ggplot(edi_airbnb, aes(x = Number of ## ^ ``` --- ## Option 1 - Define column names .small[ ```r edi_airbnb_col_names <- read_csv("data/edi-airbnb.csv", col_names = c("id", "price", "neighbourhood", "accommodates", "bathroom", "bedroom", "bed", "review_scores_rating", "n_reviews", "url")) ``` ``` ## Parsed with column specification: ## cols( ## id = col_character(), ## price = col_character(), ## neighbourhood = col_character(), ## accommodates = col_character(), ## bathroom = col_character(), ## bedroom = col_character(), ## bed = col_character(), ## review_scores_rating = col_character(), ## n_reviews = col_character(), ## url = col_character() ## ) ``` ```r names(edi_airbnb_col_names) ``` ``` ## [1] "id" "price" "neighbourhood" "accommodates" ## [5] "bathroom" "bedroom" "bed" "review_scores_rating" ## [9] "n_reviews" "url" ``` ] --- ## Option 2 - Format text to snake_case ```r edi_airbnb_cleaned_names <- edi_airbnb %>% janitor::clean_names() names(edi_airbnb_cleaned_names) ``` ``` ## [1] "id" "price" "neighbourhood" "accommodates" ## [5] "number_of_bathrooms" "number_of_bedrooms" "n_beds" "review_scores_rating" ## [9] "number_of_reviews" "listing_url" ``` --- class: middle # Variable types --- .discussion[ Which class is `x`? Why? ] .pull-left[ <br><br><br> <img src="img/df-na.png" width="100%" /> ] .pull-right[ ```r read_csv("data/df-na.csv") ``` ``` ## # A tibble: 9 x 3 ## x y z ## <chr> <chr> <chr> ## 1 1 a hi ## 2 <NA> b hello ## 3 3 Not applicable 9999 ## 4 4 d ola ## 5 5 e hola ## 6 . f whatup ## # … with 3 more rows ``` ] --- ## Option 1. Explicit NAs ```r read_csv("data/df-na.csv", na = c("", "NA", ".", "9999", "Not applicable")) ``` .pull-left[ <br> <img src="img/df-na.png" width="100%" /> ] .pull-right[ ``` ## # A tibble: 9 x 3 ## x y z ## <dbl> <chr> <chr> ## 1 1 a hi ## 2 NA b hello ## 3 3 <NA> <NA> ## 4 4 d ola ## 5 5 e hola ## 6 NA f whatup ## # … with 3 more rows ``` ] --- ## Option 2. Specify column types .midi[ ```r read_csv("data/df-na.csv", col_types = list(col_double(), col_character(), col_character())) ``` ``` ## Warning: 1 parsing failure. ## row col expected actual file ## 6 x a double . 'data/df-na.csv' ``` ``` ## # A tibble: 9 x 3 ## x y z ## <dbl> <chr> <chr> ## 1 1 a hi ## 2 NA b hello ## 3 3 Not applicable 9999 ## 4 4 d ola ## 5 5 e hola ## 6 NA f whatup ## # … with 3 more rows ``` ] --- ## Column types **type function** | **data type** ------------------ | ------------- `col_character()` | character `col_date()` | date `col_datetime()` | POSIXct (date-time) `col_double()` | double (numeric) `col_factor()` | factor `col_guess()` | let readr guess (default) `col_integer()` | integer `col_logical()` | logical `col_number()` | numbers mixed with non-number characters `col_numeric()` | double or integer `col_skip()` | do not read `col_time()` | time --- .your-turn[ In the assignment 07 - Import data and open `02-read-excel.Rmd`. **Exercise 1** * Read in the Excel file called `favourite-food.xlsx` from the `data-raw/` folder. * Clean up `NA`s and make sure you're happy with variable types. * Convert SES (socio economic status) to a factor variables with levels in the following order: `Low`, `Middle`, `High`. * Write out the resulting data frame to `favourite-food.csv` in the `data/` folder. * Finally, read `favourite-food.csv` back in from the `data/` folder and observe the variable types. Are they as you left them? ]
15
:
00
.footnote[ RStudio Cloud workspace for this bootcamp is at [rstd.io/bootcamper-cloud](https://rstd.io/bootcamper-cloud). ] --- ## `read_rds()` and `write_rds()` - CSVs can be unreliable for saving interim results if there is specific variable type information you want to hold on to. - An alterive is RDS files, you can read and write them with `read_rds()` and `write_rds()`, respectively. ```r read_rds(path) write_rds(x, path) ``` --- .your-turn[ In the assignment 07 - Import data, open `02-read-excel.Rmd`. **Exercise 2** * Repeat the first three steps from Exercise 1. * Write out the resulting data frame to `favourite-food.rds` in the `data/` folder. * Read `favourite-food.rds` back in from the `data/` folder and observe the variable types. Are they as you left them? ]
05
:
00
.footnote[ RStudio Cloud workspace for this bootcamp is at [rstd.io/bootcamper-cloud](https://rstd.io/bootcamper-cloud). ] --- .your-turn[ In the assignment 07 - Import data, open `03-tidy-fun.Rmd`. Load the `sales.xlsx` file from the `data-raw/` folder, using appropriate arguments for the `read_excel()` function such that it looks like the following. If done, see next slide for stretch goal. ] .midi[ ``` ## # A tibble: 9 x 2 ## id n ## <chr> <chr> ## 1 Brand 1 n ## 2 1234 8 ## 3 8721 2 ## 4 1822 3 ## 5 Brand 2 n ## 6 3333 1 ## # … with 3 more rows ``` ]
15
:
00
.footnote[ RStudio Cloud workspace for this bootcamp is at [rstd.io/bootcamper-cloud](https://rstd.io/bootcamper-cloud). ] --- .your-turn[ 07 - Import data / `03-tidy-fun.Rmd`: **Stretch goal:** Manipulate the sales data such that it looks like the following. ] .midi[ ``` ## # A tibble: 7 x 3 ## brand id n ## <chr> <dbl> <dbl> ## 1 Brand 1 1234 8 ## 2 Brand 1 8721 2 ## 3 Brand 1 1822 3 ## 4 Brand 2 3333 1 ## 5 Brand 2 2156 3 ## 6 Brand 2 3987 6 ## # … with 1 more row ``` ] --- class: middle # Importing many files --- ```r sales_files <- fs::dir_ls("data/sales") sales_files ``` ``` ## data/sales/01-sales.csv data/sales/02-sales.csv data/sales/03-sales.csv ``` ```r library(vroom) sales <- vroom(sales_files, id = "filename") sales ``` ``` ## # A tibble: 19 x 6 ## filename month year brand item n ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 data/sales/01-sales.csv January 2019 1 1234 3 ## 2 data/sales/01-sales.csv January 2019 1 8721 9 ## 3 data/sales/01-sales.csv January 2019 1 1822 2 ## 4 data/sales/01-sales.csv January 2019 2 3333 1 ## 5 data/sales/01-sales.csv January 2019 2 2156 9 ## 6 data/sales/01-sales.csv January 2019 2 3987 6 ## # … with 13 more rows ``` --- ## vroom vroom!! .pull-left[ <img src="img/vroom.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ - **vroom** is most useful for reading large amounts of data in, fast! - and it has nice bells-and-whistles like delimiter guessing, reading many files in at once, etc. - Learn more at [vroom.r-lib.org](https://vroom.r-lib.org/) ] --- class: middle # Other types of data --- ## Other types of data - **googlesheets4:** Google Sheets - **haven**: SPSS, Stata, and SAS files - **DBI**, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc): allows you to run SQL queries against a database and return a data frame - **jsonline**: JSON - **xml2**: xml - **rvest**: web scraping - **httr**: web APIs - **sparklyr**: data loaded into spark