Instructions with working with World Values Survey in RStudio

Working with the World Values Survey in RStudio1

The purpose of replication analysis is to determine if a particular result gleaned from a sample of data can be reproduced with a second sample of similar data. If the results using the same variables from two different samples are similar, we can be more confident that a conclusion we have reached based on the data analysis is likely correct. However, if the results differ, it is necessary to assess critically where there is a problem. Most likely, the problem lies with one or both of the samples used in the analysis. For example, even though the researchers collecting the data may have had tried to achieve a representative sample based on randomization, it is possible that one or both samples fall short of this important goal. As a result, when the ability to generate a true random sample is compromised, a sample is likely not representative of a particular population, and the results from the sample may not be replicable.

Furthermore, it is important to consider the time period in which data was collected for each sample; this is especially true for survey data. The seventh wave of the World Values Survey data for the United States was collected during 2017, while the Dataprac data was collected in the United States in 2019. The difference in time between the two datasets may influence the results to some degree. Importantly, how survey respondents thought about the issues asked in the survey may have differed between the two time periods.

In terms of how the size of a sample influences the results, it is important to note, too, that the seventh wave of the World Values Survey for the United States includes over 2,600 observations, while the Dataprac has a little over 1,000. If both samples are representative, this difference alone makes the World Values Survey superior because the margin of error for any particular sample statistic is lower when the number of observations is higher. This makes the estimates from the larger sample more precise (and hence more likely to be significant if there is indeed a true relationship between variables in the population from which a sample was generated).

It is also possible, too, that some of the questions and potential responses may also differ between the two surveys. While the overwhelming majority of the questions between the Dataprac and World Values Survey are worded exactly alike, there are a few subtle differences between the two surveys. For example, for the question “Which party would you vote for if there were a national election tomorrow?” the question as worded is the same for the two surveys, but the possible responses differ slightly. Specifically, respondents for the World Values Survey were given additional options from which to choose (such as the Libertarian Party and the Green Party) while the respondents in the Dataprac were not given these additional options. This slight discrepancy might also compromise the comparability of the responses. (Overall, however, the wording of the questions between the Dataprac and World Values Survey is mostly the same.)

Finally, it is important to recognize that obtaining a ‘significant’ result from one sample while the result from the second sample is ‘insignificant’ does not necessarily mean that the significant result is more correct. Rather, it is important to assess these particular issues to assess which dataset might be more representative and hence a more adequate reflection of the population from which the sample was derived.

Downloading the World Values Survey for replication analysis:

Go to worldvaluessurvey.org to find the following page.

Click on Data & Documentation on the left hand side of the page.

From here click on Documentation/Downloads.

Click on Wave 7 (2017-2020).

From here, find the WVS-7 Master Questionnaire 2017-2020 English.pdf. This will display all the questions as they were asked in the survey. It will also show you how the responses to each question were coded in the dataset. Note that some responses that are unique to the United States (like ethnicity or which party would you vote for if there were a national election tomorrow) are not included in the generic questionnaire. See the bottom of this document for a list of the specific codes for these two particular variables.

After you study the questionnaire, continue to scroll down to the bottom of the page to find “Statistical Data Files” and click on the first file, “WVS Cross-National Wave 7 R v1 6 zip.”

Here you will need to enter your name and the reason you are downloading the data. You could enter “university coursework requirement” or something similar. Select “Academic Research Project” under Intended Use. Check the “I have read the ‘Conditions of Use’” box before clicking Download.

This will download a zipped file. Click “Extract All” to save the file on your computer. Before you double click on the file, right click on the file’s name to change long file name to WVS (leaving it as .rds file). This will make working with the dataset much easier. Now, double click on the zipped file and you will see a prompt to select a program. Scroll down to choose RStudio (since you already have it on your computer) and click OK. Now double click on the WVS R file and then click OK when asked if you want to Load R Object.

You now have the WVS survey loaded into RStudio. This file, however, contains every observation in the World Values Survey from every country (over 70,000 observations!). What we need to do now is filter the data so that only respondents from the United States remain.

The first step is to tell R to read the tidyverse library. (You already installed tidyverse when you worked with Chapter 7 so you will not need to install the package a second time, but you do need to call the library every time you reopen RStudio).

library(tidyverse)

Next, create a duplicate of this dataset. Call the new dataset “US” to indicate that the data is for the United States.

US<-WVS

To filter out all other countries so that only the United States remains, type the following command. The number 840 represents the United States. The filter command will tell RStudio to utilize observations from country 840 only, i.e. the United States.

US<-WVS%>%filter(B_COUNTRY==840)

After you click enter, you should see that a second dataset has been added to the top right of your RStudio screen; the US dataset has 2,596 observations, representing all respondents in the United States.

Once you have the US dataset loaded, you should create a copy of the dataset that you can work with (just like you did with the Dataprac). Let’s call the copy US1. From this point forward, we will work with the US1 dataset. Type:

US1<-US

Next we need to limit the dataset to the appropriate research population. However, since you are no longer working with the Dataprac, you will need to know what the new variables are in the World Values Survey. Refer to the Excel file in the online compendium for a list of all the DP variables and their corresponding numbers in the WVS (denoted by the letter Q). The picture below is of the Excel file with the WVS and DP values displayed (and the individual codes for most of the variables, too).

To limit the dataset to people aged 35 and below, we need to know that age is now Q262 in the World Values Survey. To limit the dataset, use the following code:

US1<-filter(US1,Q262<=35)

You should see that US1 on the top right now has 1,048 observations; these are the respondents aged 35 and below.2

From here you can use the tutorials from Chapters 7, 8, and 9 to complete your replication analysis.

Remember that you are working with three variables in the dataset: one dependent variable and two independent variables. For example, let’s work with social status as the first independent variable (DP9 – now Q287), the left/right political spectrum (DP70 – now Q240) as the second independent variable, and views on corruption (DP56 – now Q112) as the dependent variable as examples. Since social status is a nominal-level variable with more than two categories, we need to transform it into a binary as we create the new variable.

To create a binary variable for social status, we can use the same recode command from Chapter 7 (transforming the lower classes to 0 and the upper classes to 1). Be sure to give the variable a new name, like socialstatusbinary. After you recode the variable, use the cbind command to add it to the dataset US1.

socialstatusbinary<-recode(US1$Q287,'1'=0,'2'=0,'3'=0,'4'=1,'5'=1)

US1<-cbind(US1,socialstatusbinary)

For the frequency distribution for your nominal-level variables (like the binary variable), you may opt to use an alternate command, which will provide the valid percentages of each category in the variable, rather than the actual percentages for each value. Remember from Chapter 7 that the valid percentage removes missing values. (This was not an issue in the Dataprac since the Dataprac had no missing values.) Below is the new code to display the frequency distribution with valid percentages:

prop.table(table(US1$socialstatusbinary))

This will display the valid percentages for each of the two categories in the social status binary variable, removing any missing observations.

For summary statistics, use the following code. Note that you must include additional line of (different) code to generate the standard deviation for variables in the World Values Survey because of the missing values. The first code is to generate summary statistics for the left/right spectrum (Q240) and the second is to generate summary statistics for views on corruption (Q112).

summary(US1$Q240)

sd(US1$Q240,na.rm=TRUE)

summary(US1$Q112)

sd(US1$Q112,na.rm=TRUE)

For the difference in means test (using social status (socialstatusbinary) as the binary independent variable and views on corruption (Q112) as the dependent variable):

t.test(filter(US1,socialstatusbinary==0)$Q112,filter(US1,socialstatusbinary==1)$Q112)

For the correlation coefficient (using the left/right ideological spectrum as the independent variable and views on corruption as the dependent variable):

cor.test(US1$Q240,US1$Q112)

For regression the code for the two single variable regressions and the multivariate regression is as follows. You must create the model, generate a summary for it, and then produce the confidence interval for each coefficient:

Model1<-lm(US1$Q112~US1$socialstatusbinary)

summary(Model1)

confint(Model1)

To get the number of observations used in the model, take the total number of observations in the dataset, as seen in the top right corner by the US1 dataset and then subtract the number of “observations deleted due to missingness” from the summary of Model 1. This manipulation is necessary because of the missing observations present in the World Values Survey. The “dim” function we used in Chapter 9 for the Dataprac will not work here because dim will list all observations in the dataset (including missing ones).

Model2<-lm(US1$Q112~US1$Q240)

summary(Model2)

confint(Model2)

Refer to the instructions above to obtain the number of observations in the analysis.

Model3<-lm(US1$Q112~US1$Q287+US1$Q240)

summary(Model3)

confint(Model3)

Refer to the instructions above to obtain the number of observations in the analysis.

Be advised that RStudio and tidyverse can be tricky. Go slow as you work with the data

A few tips:

If RStudio automatically brings up your dataset or variables when you begin typing the dataset or variable names, use your mouse to click on the appropriate selection in the pull down menu, rather than typing all the values out yourself. One small mistake in your typing will result in an error message.

Keep in mind that there are many ways to perform manipulations in the different ‘packages’ in R; your instructor may have different ideas for how to do these same manipulations using a different package. R is continually evolving and requires the user to be aware of the most recent updates and additions. Also, consider using some of the ‘cheat sheets’ that you can find online to help you do more with R.

Special codes in the World Values Survey3

DP9/Q287: Ethnic group

840001= US: White, non-Hispanic

840002= US: Black, non-Hispanic

840003= US: Other, non-Hispanic

840004= US: Hispanic

840005= US: Two plus, non-Hispanic

DP14/Q223: Which party would you vote for if there were a national election tomorrow?

840001= USA: Republican Party

840002= USA: Democratic Party

840004= USA: Libertarian

840006= USA Green Party of the United States

1 Note that there are many ways to perform basic manipulations in RStudio. What is presented here is what the author of the text considers the most basic way to work with the World Values Survey in RStudio. It is possible, however, that your instructor has a different way to perform these manipulations.

2 If the filter command doesn’t work (due to conflicting packages), try the following code:

WVS2<-dplyr::filter(WVS1, Q262<=35)

3 The World Values Survey utilizes these special country-specific codes for certain variables that were reproduced in the Dataprac: ethnicity and which party would you vote for if there were a national election tomorrow. Note that these are both nominal-level variables that would require transformation into a binary variable for use as an independent variable. The country code for the United States in the World Values Survey is 840, thus every special country-specific code begins with this number.