Step 1: Download the practice dataset (Data for regression example) from the OUP website

The dataset that was used for the running example in the chapter (for the professor who wanted to understand which independent variables influence students’ test grades) is online as an Excel file (with the exciting name “Data for regression example”). Open RStudio and make sure Tidyverse is loaded (see instructions from Chapter 7 – every time you open RStudio, you will need to type library(Tidyverse) to tell R Studio that you would like to use the code from Tidyverse in RStudio). Then download and open the regression example dataset from OUP’s website (DataforRegressionExampleRStudio). Once opened, you should see a small dataset with 30 observations and 4 variables (Grade on Test, Hours Studied, Interest in Political Science, and Attendance).

Step 2: Running single variable regression equations

Let’s begin by following the example in Chapter 9. The professor would like to know how study time (measured in hours) influences students’ grades on the test. To perform a single variable regression, you first need to create a Model in RStudio; we’ll call it Model1. To do this, type the following into RStudio (don’t forget the tilde (~) in between the two variables) and then press enter:

Model1<-lm(DataforRegressionExampleRStudio$GradeonTest~DataforRegressionExampleRStudio$HoursStudied)

Now that you have told RStudio that Model 1 is based on a linear model (lm) that uses one dependent variable (the first variable – Grade on Test) and one independent variable (the second variable – Hours Studied), now tell RStudio to display the results. To do this, type the following and press enter (make sure the ‘s’ in summary is lowercase).

summary(Model1)

These results provide most of the information you will need about this first regression model. First, notice the estimate of the constant (called Intercept in R). Here we see that the constant is 63.2. This means that studying 0 hours is associated with a predicted grade of 63 points.

Now consider the 3 S’s (sign, size, and significance) of the coefficient for Hours Studied. The coefficient itself is found under the constant (intercept): here we see that the coefficient for hours studied is +2.02. The sign is positive (meaning that as hour studied increases, grades also increase) and the size is a little over two points, meaning that for each additional hour studied, a student’s grade is expected to increase by two points. To determine if this particular coefficient is significant, look across the last line of the of this table to find the Pr(>|t|), which is the p value that determines if the coefficient is significantly different from zero. Here we see that the significance (p value) is .000 (9.78e-09). Since this value is below the threshold of .05, the null hypothesis that the true coefficient in the population from which this sample was taken is zero can be rejected.

There are other important pieces of information in these tables as well. First, when you create your table, you will need the standard error for each coefficient, which is located next to the independent variable’s coefficient value. In this example, for Hours Studied, we see that the standard error is .252. The next important piece of information is the Adjusted R Squared, which is on the bottom right on the second to the last line in the table. Here we see that is .686, which means that Hours Studied explains about 69% of the variation in Grades on Test. Finally, you will need the number of observations (the ‘n’) included in the analysis. To obtain the number of observations in the dataset in R type the following and press return:

dim(DataforRegressionExampleRStudio)

Dim (dimension) displays the number of observations and the number of variables in a dataset. From this, we see that the Data for Regression Example dataset has 30 observations (students) and 4 variables. Thus, the number of observations for this analysis is 30.

There are other statistics you may wish to review, but may not be included in your final table. See the chapter for a description of these statistics. First is the standard error of the estimate. The standard error of the estimate is in the results from the summary(Model1) output; in R Studio this is called the “Residual Standard Error,” which we see from the results is 7.547.

Next is the confidence interval for each coefficient. To obtain the confidence interval for each coefficient used in the analysis type the following and press enter:

confint(Model1)

This displays the 95% confidence interval for the Hours Studied coefficient; here we see that the interval is between 1.5 and 2.5.

 

Note that the standard error of the estimate and each coefficient’s confidence interval are not usually reported when researchers create new regression tables like the one presented in the Paper Progress section of Chapter 9 for their papers. See the Paper Progress section of Chapter 9 to see which pieces of information you will need for the new regression table you will create based on the statistics in these tables.

Now the professor wants to switch independent variables and include Interest in Political Science as the independent variable. To do this in RStudio, create a new model (Model 2) by typing the following commands, pressing enter after each one:

Model2<- lm(DataforRegressionExampleRStudio$GradeonTest~DataforRegressionExampleRStudio$InterestinPoliticalScience)

summary(Model2)

confint(Model2)

We already know that the number of observations in the Data for Regression Example dataset is 30.

Finally, for the last variable in the professor’s example (attendance), you simply repeat the steps from above to create a new model (Model3) for the Attendance variable and create a summary for it. Below we also generated the confidence interval for the Attendance coefficient. Type the following and press enter after each command:

Model3<-lm(DataforRegressionExampleRStudio$GradeonTest~DataforRegressionExampleRStudio$Attendance)

summary(Model3)

confint(Model3)

Now, to perform a regression equation with several independent variables at the same time, you simply add the independent variables to a new model. Below we create Model 4. Notice the plus signs (+) in between the independent variables. You can also generate the confidence interval for each coefficient with the confint command.

Model4<-lm(DataforRegressionExampleRStudio$GradeonTest~DataforRegressionExampleRStudio$HoursStudied+DataforRegressionExampleRStudio$InterestinPoliticalScience+DataforRegressionExampleRStudio$Attendance)

summary(Model4)

confint(Model4)

 

See Chapter 9 for a description of what these statistics mean.

 

Now that we have worked with the professor’s example, let’s turn to the DatapracRStudio. First, import the DatapracRStudio dataset into RStudio. Keep in mind that the following examples did not limit the dataset to any particular research population.

You can perform the same steps to generate statistics for linear regression equations using the variables in the DatapracRStudio. Let’s work with the example from the text, using a binary variable for education and the number of children (DP8) as independent variables to explain differences in democratic satisfaction (DP71), the dependent variable.

First, you will need to create a binary variable for education. For this new variable, the values of 1, 2 and 3 will become one category (0 = less than a 2-year college degree) and 4, 5, 6, 7 and 8 will become a second category (1 = 2-year degree or more education).

Educationbinary<-recode(DatapracRStudio$DP5,'1'=0,'2'=0,'3'=0,'4'=1,'5'=1,'6'=1,'7'=1,'8'=1)

Now add this new variable to the dataset using the cbind command:

DatapracRStudio<-cbind(DatapracRStudio,Educationbinary)

Next, create a model for the first single variable regression. (Since we used numbers for the previous examples, we will use letters for these models to keep them separate; you can give the models any name you like.)

ModelA<-lm(DatapracRStudio$DP71~DatapracRStudio$Educationbinary)

summary(ModelA)

confint(ModelA)

 

Continue generating results by creating new models and use the summary and confint commands to generate the regression results.

 

ModelB<-lm(DatapracRStudio$DP71~DatapracRStudio$DP8)

summary(ModelB)

confint(ModelB)

 

ModelC<-lm(DatapracRStudio$DP71~DatapracRStudio$Educationbinary+DatapracRStudio$DP8)

summary(ModelC)

confint(ModelC)

 

 

Back to top