Chapter 9 Stata tutorial

Step 1: Download the practice dataset (Data for regression example) from the OUP website

The dataset that was used for the running example in the chapter (for the professor who wanted to understand which independent variables influence students’ test grades) is online as a .dta file (it has the exciting name “Data for regression example”). Open Stata and then download and open this file (DataforRegressionExampleStata). Once opened, you should see a small dataset with 30 observations and 4 variables (Grade on Test, Hours Studied, Interest in Political Science, and Attendance).

Step 2: Running single variable regression equations

We’ll begin by following the examples in Chapter 9. The professor would like to know how study time (measured in hours) influences students’ grades on the test. To perform a single variable regression in Stata, type the following and press enter (notice that the dependent variable is listed first, followed by the single independent variable):

regress GradeonTest HoursStudied

This will produce the following results:

These results provide crucial information about the regression model. First, notice the constant in the bottom line of the results (_cons). It is 63.2. This means that studying 0 hours is associated with a predicted grade of 63 points.

Now consider the 3 S’s (sign, size, and significance) of the coefficient for hours studied. The coefficient (Coef.) for Hours Studied itself is found directly above the constant: here we see that the coefficient for hours studied is +2.02. The sign is positive (meaning that as hour studied increases, grades also increase) and the size is a little over two points, meaning that for each additional hour studied, a student’s grade is expected to increase by two points. To determine if this coefficient is significant, look across the last line of the Coefficients table to find value of the P>|t| for the Hours Studied coefficient. This is the p value for the Hours Studied coefficient. Here we see that the p value is .000. Since this value is below the threshold of .05, the null hypothesis that the true coefficient in the population from which this sample was taken can be rejected.

There are other important pieces of information in these tables as well. First, when you create your table, you will need the standard error for each coefficient, which is located next to the independent variable’s coefficient value. In this example, for Hours Studied, we see that the standard error is .252. Next is the number of observations included in the analysis. This information at the top right of the results, the Number of obs, where we see that 30 students were represented in this regression model. The next important piece of information is the Adjusted R Squared, which is four rows below the number of observations (.686). See the chapter for what these statistics represent. You will need to report these specific numbers when you create your regression table. (You can also find the standard error of the estimate (which in Stata is the Root MSE, which here is 7.54) and the confidence interval for the Hours Studied coefficient at the bottom of the results, but these statistics are not usually reported when researchers create their new regression tables for their papers. See the Paper Progress section of Chapter 9 to see which pieces of information you will need for the new regression table you will create based on the statistics in these tables.)

Now the professor wants to switch independent variables and include Interest in Political Science as the independent variable. To do this in Stata, type the following and then press enter:

regress GradeonTest InterestinPoliticalScience

This produces the following results:

For the last variable in the professor’s example (attendance), you simply repeat the steps from above to remove and include Attendance. Type the following and press enter:

regress GradeonTest Attendance

Now, to perform a regression equation with several independent variables at the same time, simply include all the independent variables after the dependent variable. Type the following and then press enter:

regress GradeonTest HoursStudied InterestinPoliticalScience Attendance

See Chapter 9 for a description of what these statistics mean.

Now that we have worked with the professor’s example, let’s turn to the DatapracStata. You can perform the same steps to generate statistics for linear regression equations using the variables in the DatapracStata. Let’s work with the example from the text, using education (as a binary variable) and the number of children (DP8) as independent variables to explain differences in democratic satisfaction (DP71), the dependent variable. Be sure that education is a binary variable (the values of 1, 2, and 3 were combined into a single category (0 represents respondents who have not completed a two-year college degree) and the values of 4, 5, 6, 7, and 8 were combined into a second category (1 represents respondents who have completed at least a two-year college degree)). To do this:

generate educationbinary = DP5

recode educationbinary (1=0)(2=0)(3=0)(4=1)(5=1)(6=1)(7=1)(8=1)

Now that you have the education binary variable, type the following commands and then press enter after each one:

regress DP71 educationbinary

regress DP71 DP9

regress DP71 educationbinary DP9

See Chapter 9 for a description of the results from these tests.