Final Project Information
The final project consists of conducting data analysis in the pursuit of answering a research question. These reports are typically in the 5 to 15 pages range, some of which consists of tables and plots. As we get closer to the due dates for the final report, we will provide a rubric we will use for grading along with a template for producing the reports. Below, we outline the steps involved in this project.
Step 1: Finding a data source
The biggest part of the final project in Gov 51 is finding a data source. You are, of course, free to use any of the data from the QSS book (found in the qss
package you have installed). Here are some other resources and repositories with data sets:
- List of links to political science data sets
- Harvard Dataverse - Social Science
- Data.gov - Data sets released by the US government
- Data published by FiveThirtyEight
- Harvard OpenData Group Directory
- Harvard Program on Survey Research Data Collections
- Roper Center : Public Opinion in the US
- Pew Research Center Data Sets
If you find a data set that you think is interesting, but you have problems loading the data set into R, please contact the course staff. R can load almost anything, so we can likely help you.
We also have the following specific datasets available to use:
- American National Election Survey, 2016: Data Codebook
- Civil Wars: Data Codebook
- Political Violence: Data Codebook
- Fox News rollout: Data Codebook
- Afrobarometer: Data Codebook
General advice for choosing data sources
- If you want to analyze the relationship between X and Y, make sure that these two variables are included in the data set. If you want to look at effects for subgroups, make sure there is a variable that you can use for subsetting.
- Try to look for a ‘codebook’ or some other document that explains what the variables mean and how they are coded.
- For most projects, preparing the data for analysis takes longer than the actual analysis itself. Try to find a data set where you do not need to extensively recode / clean up the data before you run your analyses, this makes the final project easier.
- In similar vein, if the data set is greater than about 50MB (this is not a hard cutoff), R commands and analyses tend to take longer.
- Data from experiments is usually simple to analyze, since the analysis commonly involves simple comparisons of group means.
Step 2: Proposal
You (or your group) should write a one-paragraph note to describe what data set you will use and what your tentative research question is. Your research question should ask how one dependent variable is related to one or more independent variables. That is, your research question should be able to be answered by a regression analysis. In this paragraph, you should do the following:
- State your research question.
- Formulate a hypothesis related to the research question. This hypothesis should be rooted in some sort of theory. In other words, you need to present a plausible story why the hypothesis might be true. Often, this is in the form of a behaviorial or institutional explanation. As social scientists, we are not interested in idiosyncratic or personality driven explanations; we want to understand systematic patterns and relationships!
- Describe your explanatory variable(s) of interest and how it is measured. Importantly, we need to observe variation in this variable in order to study it!
- Describe your outcome variable of interest and how it is measured.
- What observed pattern in the data would provide support for your hypothesis? More importantly, what observed pattern would disprove your hypothesis?
For instance, this would be a comprehensive paragraph that address each of these points in detail:
Does unified government enhance legislative productivity? In this study, I plan to examine the extent to which periods of unified government produce more landmark laws. I hypothesize that legislative productivity increases during periods of unified government in which one party controls both Houses of Congress and the presidency relative to periods of divided government. During periods of unified government, I expect that it is more likely for major bills to pass both Houses and gain the president’s signature. During periods of divided government, it is more difficult to reach a consensus around legislation that can pass each House and gain the president’s approval. My sample is comprised of each of the 79th (1945-1946) through 103rd (1993-1994) Congresses. My unit of analysis is a Congress (e.g., the 88th Congress). The explanatory variable of interest is whether there is unified government (both Houses and the presidency are controlled by the same party) or divided government. The variable is coded =1 for unified government and =0 for divided government. My outcome variable is the count of landmark pieces of legislation passed in a given Congress. For instance, if the variable were coded =11, it would mean that 11 pieces of landmark legislation were signed into law in that Congress. This variable is measured from David Mayhew’s data set on landmark legislation and relies on Mayhew’s expert knowledge to classify legislation as “landmark.” If I observe greater landmark legislative productivity under unified government relative to divided government, this would provide support for my hypothesis. If, on the other hand, I observe less productivity or the same level of productivity under unified government, this would provide evidence against my hypothesis. When I run my regression of the count of landmark legislation on the unified government indicator variable, a positive, significant coefficient would indicate support for my hypothesis.
Your group’s paragraph might be less detailed and may not refer to the political science literature, but it should address the above items.
Note that you are not fully committing to any specific question or data in this exercise. If you want to change data or questions later, that is fine. This is just a milestone to keep you on track. This proposal is due on Wednesday, October 28th, 2020 at 11:59pm ET. You should upload the PDF of your proposal to Gradescope, where you can indicate your group.
You can find a template Rmd file for the proposal here: final-project-proposal.Rmd
Step 3: Produce analyses
The next step will be to create an Rmarkdown file that loads the data you have selected, runs any preprocessing that you need to conduct (i.e., prepares your data for analysis), produces summary statistics (or plots) of the relevant dependent and independent variables, and conducts the main regression of interest for the project.
The Rmd file and the resulting knitted PDF should be uploaded to Gradescope by Wednesday, November 18th at 11:59pm ET. Note that these analyses do not have to be final. You may change them between this milestone and the final report. We just want to make sure you are making some progress.
Step 4: Write up final report
The final report will include the following sections: (1) an introduction where you introduce the research question and hypothesis and briefly describe why it is interesting; (2) a data section that briefly describes the data source, describes how the key dependent and independent variables are measured (e.g., a survey, statistical model, or expert coding), and also produces a plot that summarizes the dependent variable; (3) a results section that contains a scatterplot of the main relationship of interest and output for the main regression of interest; and (4) a brief (one paragraph) concluding section that summarizes your results, assesses the extent to which you find support for your hypothesis, describes limitations of your analysis and threats to inference, and states how your analysis could be improved (e.g., improved data that would be useful to collect).
For the data section, you should note if your research design is cross-sectional (most projects will be of this type) or one of the other designs we discussed (randomized experiment, before-and-after, differences-in-differences). For the results section, you should interpret (in plain English** the main coefficient of interest in your regression. You should also comment on the statistical significance of the estimated coefficient and whether or not you believe the coefficient to represent a causal effect.
This final report will be due on December 14th, 2020 at 11:59pm ET