The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.
Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by Jonathan Kropko, 2018-01-11 15:54:09

syllabus

syllabus

UNIVERSITY OF VIRGINIA SPRING SEMESTER 2018

Analysis of Political Data Using R

PLAD 4500

Tuesday/Thursday 4:00-5:15pm, New Cabell Hall 187

Instructor:
Jonathan Kropko

Office
Gibson Hall 383

What is R and What Can it Do? Office hours
Fridays, 9am -
R, along with Python, is one of two primary coding languages 11am, and by
used in the growing field of data science. It’s used for appointment
statistical research in many fields such as political science.
Email

[email protected]

Office phone
(434) 924-4660

Data analysis and statistics have been primary Before the course begins, please
tools for research in political science for more download the following free software,
than 60 years. In the early days, researchers or update them to the latest versions:
had to program their analyses on stacks of
punch cards and take them to the single and 12
enormous computer on campus, feeding the
cards one at a time through a slot and praying R R STUDIO
that none of the cards get jammed.
Available from the An excellent user
Starting about 20 years ago, researchers Comprehensive R interface for R
started to use software that can be operated Archive Network: https://
digitally on a local machine. Most cran.r-project.org/
available at: https://
researchers used software named SPSS, SAS, www.rstudio.com/
or Stata.

1

PLAD 4500: ANALYSIS OF POLITICAL DATA SPRING SEMESTER 2018

The problem with these software packages is that
they are proprietary. That means two things.
First, it means that these software packages are
slow to update and incorporate new statistical
methods. Second, it means that the software is
expensive.

While these software packages are still widely
used, things are changing quickly. More and more
researchers, both within academia and outside of
it, are using open-source platforms for statistical
computing. R has become the primary choice for statistics, and python has become the primary
choice for other applications in data science.

There are several reasons why R is now our best option. First, R is and will continue to be absolutely
100% free. Second, R produces beautiful, logical, interactive graphics. Third, R is the most widely
used software platform by serious statisticians, and when researchers develop a new statistical
method, they usually will write a new R library to implement the method and release it to the public.
So R’s functionality increases at about the same rate as new research in statistics. Fourth, R very
recently developed powerful and user-friendly tools for data management and for easily weaving
code and results into a text document. Finally, knowledge of R is an important skill that will get the
attention of employers.

R PACKAGES

R comes with a large set of commonly used functions for statistics and data management. But for more
specialized topics we will need to download a package of additional functions from the CRAN
repository. For example, one additional package we will use is ggplot2, which is a powerful engine for
producing visualizations and graphics. To download this package open RStudio and follow these steps:
1. RStudio’s interface includes several different windows. Find the

window that says Console in the upper-left corner and click inside
this window. You should see a cursor that will allow you to type.
2. Type install.packages("ggplot2") and push enter. The ggplot2
package will now be downloaded. (You only have to do this one
time after updating R)
3. To use the functions inside the ggplot2 package, type
library(ggplot2) in the console. Note the lack of

quotation marks. Now you can use the functions inside ggplot2.

download or update other packages. (You will have to do this
each time you start an R session to use ggplot)

2

PLAD 4500: ANALYSIS OF POLITICAL DATA SPRING SEMESTER 2018

Course Organization 
 R: THE WILD WEST OF STATS
COMPUTING
We are going to build up to cool things, but we
have to take our time and start from the basics. No one person
That might get frustrating. Early on the lab or company
assignments may not be very interesting. Bear creates R. It’s a
with me. It's better than accelerating the course collective effort
and leaving people behind. This course is of thousands of
divided into five parts: researchers in
many fields
1. Introduction to R and RStudio: we’ll get around the
used to the user interface, learn about world. When a researcher develops a new
object oriented programming, and scripts. technique to conduct statistics or work
We’ll talk about how to use the help with data, he or she writes R code to
resources well and post on webpages like perform this task and distributes it through
Stack Overflow. And we’ll learn about knitr, CRAN or another repository. That’s one of
an R package for creating documents that the best things about R.
weave text, code, and the results of the code
all in one HTML or PDF document. But the drawback is that there are often
many, many ways to do the same thing in
2. Advanced data cleaning techniques: most R. The approaches we will discuss are not
data is messy. And no further analysis is the only way to perform the tasks we need
possible until the data have been cleaned. to accomplish. Whenever possible, I will
We’ll work to reformat data to a tidy format try to present the approaches in class that
by managing the observations and variables, are easiest to teach and to understand,
reshaping, and merging datasets together. that use fewer lines of code, and run more
quickly. You might find another way to do
3. Descriptive statistics and graphics: we’ll these things, and that's great. But if you
cover the functions to calculate means, stick to the ways we talk about in class, I
medians, variances, quantiles, and can more easily follow your work and give
correlations, and we will also learn to use you full credit.
ggplot, a package to create beautiful data
visualizations. 5. Analyzing text as data: all of the work we
will do will lead up to an introduction to
4. Linear and logistic regression: regressions using text as data. In text analysis, we take
are tools for assessing whether one variable a document, such as the U.S. Constitution,
has an effect on another. We’ll learn how to the presidential state of the union
run, interpret, and graph the results of these addresses, or speeches by delegates at the
regressions. UN, and use R to extract information to use
as data. This is a cutting edge topic in
political science and a core topic in data
science.


3

PLAD 4500: ANALYSIS OF POLITICAL DATA SPRING SEMESTER 2018

Readings

There are three textbooks for this course. Please complete the readings prior to class.


R for Data Science
by Hadley Wickham and Garrett Grolemund
A free, online copy is available here (for now):
http://r4ds.had.co.nz/

Text Mining with R
by Julia Slige and David Robinson,
A free, online copy is available here (for now):
https://www.tidytextmining.com/

Political Analysis Using R
by James E. Monogan
Available from the UVA bookstore or through
Amazon: https://www.amazon.com/Political-
Analysis-Using-James-Monogan/dp/3319234455/

Course Website

All grades, labs, and readings will be posted on the UVa Collab site for the course, accessible at
https://collab.itc.virginia.edu/portal. If you are officially enrolled in the course, the course website
should already be accessible to you. If you are not officially enrolled, please speak to me so I can
arrange for you to have access to the course material.


Equipment

If you own a laptop computer capable of running R, please bring it to class so that you can work on
labs in class. If you do not have a computer capable of running R, let me know so we can make other
arrangements.


4

PLAD 4500: ANALYSIS OF POLITICAL DATA SPRING SEMESTER 2018

Assessment


Your grade will be determined by two factors:

• Your performance on nine lab reports (65% of the grade)

• Your final data analysis project (35% of the grade)

There is no midterm or final exam in this course. The final grades will be determined from the final
percents as follows:


Labs Lab In class Due date

There will be 9 lab sessions during the semester, each on a Feb 8
Feb 15
Thursday. The labs will help you practice the techniques in R we 1 Feb 1 Feb 22
March 8
will discuss in class. We will devote class time to working on 2 Feb 8 March 29
labs so that I can be available to answer questions about the lab April 12
April 19
and help get everyone started. You will work in groups of 3 on 3 Feb 15 April 26
May 3
the labs. You and your group will write a lab report together 4 March 1
using the R Markdown language and the knitr package in R, 5 March 22
saved as an HTML file, to be submitted on Collab before class

the following Thursday. I will assign groups, and I will rotate 6 April 5

who gets grouped with whom throughout the semester. I only 7 April 12
need one lab report per group. Labs will be graded out of 10

points each, based on the following criteria: 8 April 19

• Completeness (4 points) — is every part of the lab 9 April 26
completed? Are the techniques employed correctly?

• Communication (4 points) — are the results communicated clearly and accurately? Does the
lab report demonstrate that the authors understand the intuition of the techniques and code?

• Formatting (2 points) — is the lab report formatted cleanly? How close are the tables and
graphics to publishable quality?

Chelsea Goforth will serve as the grader for this course, and she will assign the grades for the labs.
You should feel free to contact her at [email protected] with questions about the course.

Final Data Project

At the end of the semester you will complete a data management and analysis project. I will give you
multiple data sources, some of which may be quite messy. You will need to load, clean, and combine
the data, perform an analysis and compile a clear report on the results of the analysis. I will provide
more detail about this project later in the semester.

5

PLAD 4500: ANALYSIS OF POLITICAL DATA SPRING SEMESTER 2018

Collaboration and Cheating

The following actions are examples of cheating:

• fabricating the results of statistical analyses,

• plagiarizing any part of any lab report or any part of the term paper, including prose, graphs,
tables, and equations,

• directly copying answers on labs from another student, or allowing answers to be copied.

Although you are working in groups, I want you each to work together on the entire lab, rather
than dividing the work between the three of you.

If you are unsure about the difference between collaboration on labs and copying, please come speak
to me before there are any potential problems.

Any student who is caught cheating on an assignment will be reported to the Honor Committee.


Schedule and Readings

R4DS = R for Data Science; PAUR = Political Analysis Using R; TMWR = Text Mining With R

Week Date Topic Reading

1 Thursday, January 18 What we can do with R?

Tuesday, January 23 Data, R, R Studio, packages and CRAN PAUR ch. 1, R4DS ch 1, 4

2 Objects and scripts R4DS ch. 6

Thursday, January 25 Getting help, using R markdown and knitr R4DS ch. 27, 29, 30
Tuesday, January 30
Lab 1
3
Tidy data, opening and saving data files, R4DS section 12.1, 12.2,
Thursday, February 1
Tuesday, February 6 managing rows and columns ch. 5

4 Lab 2

Thursday, February 8 Managing categorical objects (factors) R4DS ch. 15
Tuesday, February 13
Lab 3
5
Strings, dates and times, using pipes R4DS ch. 14, 16, 18
Thursday, February 15
Tuesday, February 20 Reshaping R4DS ch. 12

6

Thursday, February 22

6

PLAD 4500: ANALYSIS OF POLITICAL DATA SPRING SEMESTER 2018

Tuesday, February 27 Merging R4DS ch. 13 (not 13.7)

7 Lab 4

Thursday, March 1 SPRING BREAK
Tuesday, March 6
SPRING BREAK
8
Descriptive statistics PAUR ch. 4, 5
Thursday, March 8
Tuesday, March 13 ggplot R4DS ch. 3, 28

9 ggplot, continued, interactive graphics https://plot.ly/r/#fundamentals

Thursday, March 15 https://cran.r-project.org/web/
Tuesday, March 20 packages/googleVis/vignettes/
googleVis_examples.html
10
Lab 5
Thursday, March 22
Tuesday, March 27 Statistical models, bivariate linear PAUR ch. 6
regression
11
Inference and confidence TBD
Thursday, March 30
Tuesday, April 3 Multiple regression, control variables, TBD
model fit
12
Lab 6
Thursday, April 5
Tuesday, April 10 Logistic regression, probability plots PAUR section 7.1

13 Lab 7

Thursday, April 12 Text as data TMWR ch. 1, 3
Tuesday, April 17
Lab 8
14
Sentiment analysis TMWR ch. 2
Thursday, April 19
Tuesday, April 24 Lab 9

15 N-grams, predictive modeling, topic TMWR ch. 4-6
modeling
Thursday, April 26
16 Tuesday May 1

7


Click to View FlipBook Version