class: center, middle, inverse, title-slide # Visualize data
π --- layout: true <div class="my-footer"> <span> <a href="https://rstd.io/bootcamper" target="_blank">rstd.io/bootcamper</a> </span> </div> --- class: middle # Data visualization --- ## Data visualization > *"The simple graph has brought more information to the data analystβs mind than any other device." β John Tukey* - Data visualization is the creation and study of the visual representation of data. - There are many tools for visualizing data (R is one of them), and many approaches/systems within R for making data visualizations (**ggplot2** is one of them, and that's what we're going to use). --- ## ggplot2 `\(\in\)` tidyverse .pull-left[ <img src="img/ggplot2-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ ```r library(tidyverse) ``` - **ggplot2** is tidyverse's data visualization package - The `gg` in "ggplot2" stands for Grammar of Graphics - It is inspired by the book **Grammar of Graphics** by Leland Wilkinson ] --- ## Grammar of Graphics A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. <img src="img/grammar-of-graphics.png" width="60%" style="display: block; margin: auto;" /> .footnote[ Source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html) ] --- ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + labs(title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)") ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-5-1.png" width="60%" /> --- .discussion[ - What are the functions doing the plotting? - What is the dataset being plotted? - Which variable is on the x-axis and which variable is on the y-axis? - What does the warning mean? ] ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + labs( title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)" ) ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` --- .discussion[ What does `geom_smooth()` do? ] ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + * geom_smooth() + labs(title = "Mass vs. height of Starwars characters", x = "Height (cm)", y = "Weight (kg)") ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-7-1.png" width="50%" /> --- ## Hello ggplot2! - `ggplot()` is the main function in ggplot2 - Plots are constructed in layers - Structure of the code for plots can be summarized as ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` - For help with the ggplot2 + [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/) + [ggplot cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/data-visualization-2.1.pdf) --- class: middle # Visualizing Star Wars --- ## Dataset terminology - Each row is an **observation** - Each column is a **variable** ``` ## # A tibble: 87 x 15 ## name height mass hair_color skin_color eye_color birth_year sex gender homeworld ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 Lukeβ¦ 172 77 blond fair blue 19 male mascuβ¦ Tatooine ## 2 C-3PO 167 75 <NA> gold yellow 112 none mascuβ¦ Tatooine ## 3 R2-D2 96 32 <NA> white, blβ¦ red 33 none mascuβ¦ Naboo ## 4 Dartβ¦ 202 136 none white yellow 41.9 male mascuβ¦ Tatooine ## 5 Leiaβ¦ 150 49 brown light brown 19 femaβ¦ feminβ¦ Alderaan ## 6 Owenβ¦ 178 120 brown, grβ¦ light blue 52 male mascuβ¦ Tatooine ## 7 Beruβ¦ 165 75 brown light blue 47 femaβ¦ feminβ¦ Tatooine ## 8 R5-D4 97 32 <NA> white, red red NA none mascuβ¦ Tatooine ## 9 Biggβ¦ 183 84 black light brown 24 male mascuβ¦ Tatooine ## 10 Obi-β¦ 182 77 auburn, wβ¦ fair blue-gray 57 male mascuβ¦ Stewjon ## # β¦ with 77 more rows, and 5 more variables: species <chr>, films <list>, ## # vehicles <list>, starships <list>, hair_color2 <fct> ``` --- ## Luke Skywalker ![luke-skywalker](img/luke-skywalker.png) --- ## What's in the Star Wars data? ```r glimpse(starwars) ``` ``` ## Rows: 87 ## Columns: 15 ## $ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Organa", "β¦ ## $ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 228, 180,β¦ ## $ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0, 84.0,β¦ ## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", NA, "blacβ¦ ## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "light", β¦ ## $ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue", "red", β¦ ## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, 41.9, 64β¦ ## $ sex <chr> "male", "none", "none", "male", "female", "male", "female", "none",β¦ ## $ gender <chr> "masculine", "masculine", "masculine", "masculine", "feminine", "maβ¦ ## $ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "Tatooine"β¦ ## $ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human", "Droβ¦ ## $ films <list> [<"The Empire Strikes Back", "Revenge of the Sith", "Return of theβ¦ ## $ vehicles <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imperial Sβ¦ ## $ starships <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1", <>, <>β¦ ## $ hair_color2 <fct> Other, NA, NA, none, brown, Other, brown, NA, black, Other, Other, β¦ ``` --- ## Another look at Star Wars data .pull-left[ The **skimr** package provides summary statistics the user can skim quickly to understand their data. ```r library(skimr) skim(starwars) ``` ] .pull-right[ <img src="img/skimr.png" width="50%" style="display: block; margin: auto;" /> ] --- .xsmall[ ``` ## ββ Data Summary ββββββββββββββββββββββββ ## Values ## Name starwars ## Number of rows 87 ## Number of columns 15 ## _______________________ ## Column type frequency: ## character 8 ## factor 1 ## list 3 ## numeric 3 ## ________________________ ## Group variables ## ## ββ Variable type: character ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate min max empty n_unique whitespace ## 1 name 0 1 3 21 0 87 0 ## 2 hair_color 5 0.943 4 13 0 12 0 ## 3 skin_color 0 1 3 19 0 31 0 ## 4 eye_color 0 1 3 13 0 15 0 ## 5 sex 4 0.954 4 14 0 4 0 ## 6 gender 4 0.954 8 9 0 2 0 ## 7 homeworld 10 0.885 4 14 0 48 0 ## 8 species 4 0.954 3 14 0 37 0 ## ## ββ Variable type: factor βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate ordered n_unique top_counts ## 1 hair_color2 5 0.943 FALSE 4 non: 37, bro: 18, Oth: 14, bla: 13 ## ## ββ Variable type: list βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate n_unique min_length max_length ## 1 films 0 1 24 1 7 ## 2 vehicles 0 1 11 0 2 ## 3 starships 0 1 17 0 5 ## ## ββ Variable type: numeric ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ ## skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist ## 1 height 6 0.931 174. 34.8 66 167 180 191 264 ββββ β ## 2 mass 28 0.678 97.3 169. 15 55.6 79 84.5 1358 βββββ ## 3 birth_year 44 0.494 87.6 155. 8 35 52 72 896 βββββ ``` ] --- ## What's in the Star Wars data? .pull-left[ .discussion[ How many rows and columns does this dataset have? What does each row represent? What does each column represent? ] ```r ?starwars ``` ] .pull-right[ <img src="img/starwars-help-annotated.png" width="100%" /> ] --- ## Mass vs. height ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() ``` ``` ## Warning: Removed 28 rows containing missing values (geom_point). ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-16-1.png" width="60%" /> --- .your-turn[ - Go to RStudio Cloud and start the second assignment: `02 - Visualize Data`. - Open the first R Markdown file: `01-starwars.Rmd`. - Answer the first two questions, and if time allows also the third one. - But, a mini R Markdown review before you get started! ]
05
:
00
.footnote[ RStudio Cloud workspace for this bootcamp is at [rstd.io/bootcamper-cloud](https://rstd.io/bootcamper-cloud). ] --- ## Labels .small[ ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + * labs(title = "Mass vs. height of Starwars characters", * x = "Height (cm)", * y = "Weight (kg)") ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-18-1.png" width="70%" /> ] --- ## Mass vs. height .discussion[ How would you describe this relationship? What other variables would help us understand data points that don't follow the overall trend? Who is the not so tall but really chubby character? ] <img src="02-visualize-data_files/figure-html/unnamed-chunk-19-1.png" width="60%" /> --- ## Jabba! <img src="img/jabbaplot.png" width="768" /> --- ## Additional variables We can map additional variables to various features of the plot: - aesthetics - shape - colour - fill - size - alpha (transparency) - faceting: small multiples displaying different subsets --- class: middle # Aesthetics --- ## Aesthetics options Visual characteristics of plotting characters that can be **mapped to a specific variable** in the data are - `color` - `size` - `fill` - `shape` - `alpha` (transparency) --- ## Mass vs. height + gender ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = gender)) + geom_point() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-21-1.png" width="70%" /> --- ## Mass vs. height + gender ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = gender, * size = birth_year)) + geom_point() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-22-1.png" width="65%" /> --- ## Fix it up! .midi[ ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = gender, size = birth_year)) + geom_point(alpha = 0.7) + labs(title = "Mass vs. height of Starwars characters", subtitle = "by gender and birth year", x = "Height (cm)", y = "Weight (kg)", color = "Gender", size = "Birth year") + theme_minimal() + theme(legend.direction = "horizontal", legend.position = "bottom", legend.box = "vertical") ``` ] --- ![](02-visualize-data_files/figure-html/unnamed-chunk-24-1.png)<!-- --> --- ## Mass vs. height + gender Let's now increase the size of all points *not* based on the values of a variable in the data, i.e. **set** size instead of **map** size: .midi[ ```r ggplot(data = starwars, mapping = aes(x = height, y = mass, color = gender)) + * geom_point(size = 2) ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-25-1.png" width="63%" /> ] --- ## Aesthetics summary - Continuous variable are measured on a continuous scale - Discrete variables are measured (or often counted) on a discrete scale aesthetics | discrete | continuous ------------- | ------------------------ | ------------ color | rainbow of colors | gradient size | discrete steps | linear mapping between radius and value shape | different shape for each | *shouldn't (and doesn't) work* - Use aesthetics for mapping features of a plot to a variable, define the features in the geom for customization **not** mapped to a variable --- class: middle # Faceting --- ## Faceting - Smaller plots that display different subsets of the data - Useful for exploring conditional relationships and large data --- ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + * facet_grid(. ~ gender) + geom_point() + labs(title = "Mass vs. height of Starwars characters", * subtitle = "Faceted by gender") ``` ![](02-visualize-data_files/figure-html/unnamed-chunk-26-1.png)<!-- --> --- .your-turn[ Look through the next three slides titled Facet 1, 2, and 3 describe what each plot displays. Think about how the code relates to the output. **Note:** The plots in the next few slides do not have proper titles, axis labels, etc. because we want you to figure out what's happening in the plots. But you should always label your plots! ]
03
:
00
--- ### Facet 1 ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_grid(gender ~ .) ``` ![](02-visualize-data_files/figure-html/unnamed-chunk-28-1.png)<!-- --> --- ### Facet 2 ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_grid(. ~ gender) ``` ![](02-visualize-data_files/figure-html/unnamed-chunk-29-1.png)<!-- --> --- ### Facet 3 ```r ggplot(data = starwars, mapping = aes(x = height, y = mass)) + geom_point() + facet_wrap(~ eye_color) ``` ![](02-visualize-data_files/figure-html/unnamed-chunk-30-1.png)<!-- --> --- ## Facet summary - `facet_grid()`: - 2d grid - `rows ~ cols` - use `.` for no split - `facet_wrap()`: 1d ribbon wrapped into 2d --- class: middle # Why do we visualize? --- .discussion[ Do you see anything out of the ordinary? ] ![](02-visualize-data_files/figure-html/unnamed-chunk-31-1.png)<!-- --> --- .discussion[ How are people reporting lower vs. higher values of FB visits? ] ![](02-visualize-data_files/figure-html/unnamed-chunk-32-1.png)<!-- --> --- class: middle # Identifying variables --- ## Types of variables - **Numerical variables** can be classified as **continuous** or **discrete** based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - If the variable is **categorical**, we can determine if it is **ordinal** based on whether or not the levels have a natural ordering. R uses the term **factor** for most categorical data. --- class: middle # Visualizing numerical data --- ## Histograms ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_histogram(binwidth = 10) ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-33-1.png" width="75%" /> --- ## Density plots ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_density() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-34-1.png" width="75%" /> --- ## Box plots ```r ggplot(data = starwars, mapping = aes(y = height)) + geom_boxplot() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-35-1.png" width="60%" /> --- class: middle # Visualizing relationships between numerical and categorical data --- ## Side-by-side box plots ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_boxplot() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-36-1.png" width="75%" /> --- ## Scatter plot... This is not a great representation of these data. ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_point() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-37-1.png" width="60%" /> --- ## Violin plots ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_violin() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-38-1.png" width="75%" /> --- ## Jitter plot ```r ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_jitter() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-40-1.png" width="75%" /> --- ## Beeswarm plots ```r library(ggbeeswarm) ggplot(data = starwars, mapping = aes(y = height, x = gender)) + geom_beeswarm() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-41-1.png" width="70%" /> --- class: middle # Visualizing categorical data --- ## Bar plots ```r ggplot(data = starwars, mapping = aes(x = gender)) + geom_bar() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-42-1.png" width="70%" /> --- ## Segmented bar plots, counts .midi[ ```r ggplot(data = starwars, mapping = aes(x = gender, fill = hair_color)) + geom_bar() ``` ![](02-visualize-data_files/figure-html/unnamed-chunk-43-1.png)<!-- --> ] --- ## Recode hair color Using the `fct_other()` function from the **forcats** package, which is also part of the **tidyverse**. ```r starwars <- starwars %>% mutate( hair_color2 = fct_lump_min(hair_color, min = 10) ) ``` --- ## Segmented bar plots, counts ```r ggplot(data = starwars, mapping = aes(y = gender, fill = hair_color2)) + geom_bar() ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-45-1.png" width="65%" /> --- ## Segmented bar plots, proportions ```r ggplot(data = starwars, mapping = aes(y = gender, fill = hair_color2)) + geom_bar(position = "fill") + labs(x = "proportion") ``` <img src="02-visualize-data_files/figure-html/unnamed-chunk-46-1.png" width="65%" /> --- .discussion[ Which bar plot is a more useful representation for visualizing the relationship between gender and hair color? ] .pull-left[ <img src="02-visualize-data_files/figure-html/unnamed-chunk-47-1.png" width="95%" /> ] .pull-right[ <img src="02-visualize-data_files/figure-html/unnamed-chunk-48-1.png" width="95%" /> ] --- .your-turn[ - Go to RStudio Cloud and start the second assignment: `02 - Visualize Data` - Open the first R Markdown file: `02-why.visualize.Rmd` - Knit the document ]
05
:
00
.footnote[ RStudio Cloud workspace for this bootcamp is at [rstd.io/bootcamper-cloud](https://rstd.io/bootcamper-cloud). ]