HW 08 - Exploring the GSS

Photo by Mauro Mora on Unsplash Photo by Mauro Mora on Unsplash

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviours, and attributes. Hundreds of trends have been tracked since 1972. In addition, since the GSS adopted questions from earlier surveys, trends can be followed for up to 70 years.

The GSS contains a standard core of demographic, behavioural, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

In this assignment we analyze data from the 2016 GSS, using it to estimate values of population parameters of interest about US adults.1 Smith, Tom W, Peter Marsden, Michael Hout, and Jibum Kim. General Social Surveys, 1972-2016 [machine-readable data file] /Principal Investigator, Tom W. Smith; Co-Principal Investigator, Peter V. Marsden; Co-Principal Investigator, Michael Hout; Sponsored by National Science Foundation. -NORC ed.- Chicago: NORC at the University of Chicago [producer and distributor]. Data accessed from the GSS Data Explorer website at gssdataexplorer.norc.org.

Getting started

Go to the course GitHub organization and locate your homework repo, clone it in RStudio and open the R Markdown document. Knit the document to make sure it compiles without errors.

Warm up

Before we introduce the data, let’s warm up with some simple exercises. Update the YAML of your R Markdown file with your information, knit, commit, and push your changes. Make sure to commit with a meaningful commit message. Then, go to your repo on GitHub and confirm that your changes are visible in your Rmd and md files. If anything is missing, commit and push again.

Packages

We’ll use the tidyverse package for much of the data wrangling and visualisation and the data lives in the dsbox package. These packages are already installed for you. You can load them by running the following in your Console:

library(tidyverse)
library(dsbox)

Data

The data can be found in the dsbox package, and it’s called gss16. Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package. You can find out more about the dataset by inspecting its documentation, which you can access by running ?gss16 in the Console or using the Help menu in RStudio to search for gss16. You can also find this information here.

Exercises

Part 1: Harassment at work

In 2016, the GSS added a new question on harassment at work. The question is phrased as the following.

Over the past five years, have you been harassed by your superiors or co-workers at your job, for example, have you experienced any bullying, physical or psychological abuse?

Answers to this question are stored in the harass5 variable in our dataset.

  1. What are the possible responses to this question and how many respondents chose each of these answers?

  2. What percent of the respondents for whom this question is applicable
    (i.e. excluding NAs and Does not applys) have been harassed by their superiors or co-workers at their job.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Part 2: Time spent on email

The 2016 GSS also asked respondents how many hours and minutes they spend on email weekly. The responses to these questions are recorded in the emailhr and emailmin variables. For example, if the response is 2.5 hrs, this would be recorded as emailhr = 2 and emailmin = 30.

  1. Create a new variable called email that combines these two variables to reports the number of minutes the respondents spend on email weekly.

  2. Visualize the distribution of this new variable. Find the mean and the median number of minutes respondents spend on email weekly. Is the mean or the median a better measure of the typical among of time Americans spend on email weekly? Why?

  3. Create another new variable, snap_insta that is coded as “Yes” if the respondent reported using any of Snapchat (snapchat) or Instagram (instagrm), and “No” if not. If the recorded value was NA for both of these questions, the value in your new variable should also be NA.

  4. Calculate the percentage of Yes’s for snap_insta among those who answered the question, i.e. excluding NAs.

  5. What are the possible responses to the question Last week were you working full time, part time, going to school, keeping house, or what? and how many respondents chose each of these answers? Note that this information is stored in the wrkstat variable.

  6. Fit a model predicting email (number of minutes per week spent on email) from educ (number of years of education), wrkstat, and snap_insta. Interpret the slopes for each of these variables.

  7. Create a predicted values vs. residuals plot for this model. Are there any issues with the model? If yes, describe them.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.

Part 3: Political views and science research

The 2016 GSS also asked respondents whether they think of themselves as liberal or conservative (polviews) and whether they think science research is necessary and should be supported by the federal government (advfront).

Even if it brings no immediate benefits, scientific research that advances the frontiers of knowledge is necessary and should be supported by the federal government.

And possible responses to this question are Strongly agree, Agree, Disagree, Strongly disagree, Don’t know, No answer, Not applicable.

We hear a lot of talk these days about liberals and conservatives. I’m going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal–point 1–to extremely conservative–point 7. Where would you place yourself on this scale?

Note: The levels of this variables are spelled inconsistently: “Extremely liberal” vs. “Extrmly conservative”. Since this is the spelling that shows up in the data, you need to make sure this is how you spell the levels in your code.

And possible responses to this question are Extremely liberal, Liberal, Slightly liberal, Moderate, Slghtly conservative, Conservative, Extrmly conservative. Responses that were originally Don’t know, No answer and Not applicable are already mapped to NAs upon data import.

  1. In a new variable, recode advfront such that Strongly Agree and Agree are mapped to "Yes", and Disagree and Strongly disagree are mapped to "No". The remaining levels can be left as is. Don’t overwrite the existing advfront, instead pick a different, informative name for your new variable.

  2. In a new variable, recode polviews such that Extremely liberal, Liberal, and Slightly liberal, are mapped to "Liberal", and Slghtly conservative, Conservative, and Extrmly conservative disagree are mapped to "Conservative". The remaining levels can be left as is. Make sure that the levels are in a reasonable order. Don’t overwrite the existing polviews, instead pick a different, informative name for your new variable.

  3. Create a visualization that displays the relationship between these two new variables and interpret it.

🧶 ✅ ⬆️ Knit, commit, and push your changes to GitHub with an appropriate commit message. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards and review the md document on GitHub to make sure you’re happy with the final state of your work.