T 94-842: Programming in R for Analytics, Fall 2019

Course Description

This course introduces students to R, a widely used statistical programming language. Students will learn to manipulate data objects, produce graphics, analyse data using common statistical methods, and generate reproducible statistical reports. They will also gain experience in applying these acquired skills in various public policy areas.

By the end of the class, students learn to:
  • Use RStudio, read R documentation, and write R scripts.
  • Import, export and manipulate data.
  • Produce statistical summaries of continuous and categorical data.
  • Produce basic graphics using standard functions, and produce more advanced graphics using the ggplot2 library.
  • Perform common hypothesis tests, and run simple regression models in R
  • Produce reports of statistical analyses in R Markdown/R Notebooks.

License


All of the course materials on this page are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Creative Commons License

Textbooks

While there are no required textbooks for this class. However, I highly recommend the freely available R for Data Science by Garrett Grolemund and Hadley Wickham.

Helpful resources

There are many resources online that may help you to learn R. A few that are particularly relevant for this course are listed below.

Course Work

Your grade in this course will be determined by a series of 5 weekly homework assignments (40%), lab submission (10%), quizzes (10%) and a final project (40%).

Assignments

Weekly assignments will take the form of a single R Markdown text file: namely, code snippets integrated with captions and other narrative. Except where otherwise noted, assignments are typically due on Wednesdays at 1:30pm ET on the dates indicated on Canvas.

Your assignment score for the course will be calculated by averaging your four (4) highest homework scores. That is, your lowest homework score will not count toward your grade.

Each homework assignment will have 5 problems, each of which may have several parts. Your score for each assignment will be assigned according to the scheme outlined in the rubric below.

Homework rubric

Total: 10 points

Correctness : Each homework will have 5 problems, which will often have multiple parts. Each of the 5 problems will be worth 2 points. Deductions will be made at the discretion of the grader.



Style : Coding style is very important. With the exception of Homework 1, you will receive a deduction of up to 1 point if you do not adhere to good coding style.

  • No deduction if your homework is submitted with:
    • good, consistent coding style
    • appropriate use of variables
    • appropriate use of functions
    • good commenting
    • good choice of variable names
    • appropriate use of inline code chunks
  • -0.5 if coding style is acceptable, but fails on a couple of the criteria above.
  • -1 if coding style is overall poor and fails to adhere to many of the above criteria.


Participation

Lab activities

The Lab session is scheduled for Fridays. Lab attendance is encouraged, but is not mandated due to the challenges this would present for students in remote timezones. During the lab sessions, students will get hands-on practice with the week's material by working on assigned lab activities. Members of the teaching staff will be available over Zoom to introduce the activities and to answer any questions you may have. Tasks may include but are not limited to: running or modifying code from the lecture, pair coding, or completing short coding exercises. During weeks where Friday sessions are cancelled due to holidays, you are still required to submit the labs in order for them to count toward your "participation" score.

All thirteen (13) scheduled lectures will have an associated lab component. Your Lab participation score for the course will be calculated based on the number of labs that you submit, as indicated in the table below.
0-45-79-1011-13
057.510

Quizzes

There will be 4 short quizzes scheduled during the later weeks of class. Dates and times will be announced in advance. The purpose of these quizzes is to assess your understanding of various concepts that are central to the class. Your score on the quizzes will count for 10% of your final grade.

Final project


The final project for the class will ask you to explore a broad policy question using a large publicly available dataset. This project is intended to provide students with the complete experience of going from a study question and a rich data set to a full statistical report. Students will be expected to (a) explore the data to identify important variables; (b) perform statistical analyses to address the policy question; (c) produce tabular and graphical summaries to support their findings; and (d) write a report describing their methodological approach, findings, and limitations thereof.

While students may work in small groups to decide on appropriate statistical methodology and graphical/tabular summaries, each student will be required to produce and submit their own code and final report.

Regardless of grading basis, students must receive a score of at least 50% on the final project in order to pass the class.

Course Grading

Your final course grade will be calculated according to the following breakdown.
Assignments40%
Participation10%
Quizzes10%
Final project40%

Late submission

Homework is to be submitted by 1:30pm ET on Wednesdays on the due date indicated, unless an alternate due date is announced.
Late homework will not be accepted for credit.

Note that your lowest homework score will not count toward your grade, so you can miss one homework without it counting toward your course grade.

Collaboration

You are encouraged to discuss homework problems with your fellow students. However, the work you submit must be your own. You must acknowledge in your submission any help received on your assignments. That is, you must include a comment in your homework submission that clearly states the name of the student, book, or online reference from which you received assistance.

Submissions that fail to properly acknowledge help from other students or non-class sources will receive no credit. Copied work will receive no credit. Any and all violations will be reported to Heinz College administration.

All student are expected to comply with the CMU policy on academic integrity. This policy can be found online at http://www.cmu.edu/academic-integrity/.

What constitutes plagiarism in a coding class?

The course collaboration policy allows you to discuss the problems with other students, but requires that you complete the work on your own. Every line of text and line of code that you submit must be written by you personally. You may not refer to another student's code, or a "common set of code" while writing your own code. You may, of course, copy/modify lines of code that you saw in lecture or lab.

The following discussion of code copying is taken from the Computer Science and Engineering Department at the University of Washington. You may find this discussion helpful in understanding the bounds of the collaboration policy.

"[It is] important to make sure that the assistance you receive consists of general advice that does not cross the boundary into using code or answers written by someone else. It is fine to discuss ideas and strategies, but you should be careful to write your programs on your own."

"You must not share actual program code with other students. In particular, you should not ask anyone to give you a copy of their code or, conversely, give your code to another student who asks you for it; nor should you post your solutions on the web, in public repositories, or any other publicly accessible place. [You may not work out a full communal solution on a whiteboard/blackboard/paper and then transcribe the communal code for your submission.] Similarly, you should not discuss your algorithmic strategies to such an extent that you and your collaborators end up turning in [essentially] the same code. Discuss ideas together, but do the coding on your own."

"Modifying code or other artifacts does not make it your own. In many cases, students take deliberate measures -- rewriting comments, changing variable names, and so forth -- to disguise the fact that their work is copied from someone else. It is still not your work. Despite such cosmetic changes, similarities between student solutions are easy to detect. Programming style is highly idiosyncratic, and the chance that two submissions would be the same except for changes of the sort made easy by a text editor is vanishingly small. In addition to solutions from previous years or from other students, you may come across helpful code on the Internet or from other sources outside the class. Modifying it does not make it yours."

"[I] allow exceptions in certain obvious instances. For example, you might be assigned to work with a project team. In that case, developing a solution as a team is expected. The instructor might also give you starter code, or permit use of local libraries. Anything which the instructor explicitly gives you doesn't normally need to be cited. Likewise, help you receive from course staff doesn't need to be cited."

If you have any questions about any of the course policies, please don't hesitate to ask. You may post your questions on Piazza or ask me directly.

Policies

Computing:

The statistical computing package we will use in this course is R, which is available on many campus computers. You may download your own copy from http://www.r-project.org. We require that you use R Markdown to complete your assignments, which is enabled very nicely with RStudio.

Communication:

Assignments and class information will be posted on Canvas and the class website.

Email:

The Piazza forum should be used for general course-related questions that may be of interest to others in the class. For other types of questions (e.g., to report illness, request various permissions) please contact Prof. Chouldechova via email.
Please include the course code 94842 in the subject line of your email.

Disability Services:

If you have a disability and need special accommodations in this class, please contact the instructor. You may also want to contact the Disability Resources office at 412-268-2013.

Tentative Schedule

Date
Topic
Due
Week 1: Introduction and Basics
Lecture 1Introductions. Installing R on personal machines. Retrieving R packages.

Basics of R, RStudio, R Markdown.

Basic data types and operations: numbers, characters and composites.

Vectors, creating sequences, common functions.

Homework 0 assigned.

Lecture 1 notes [Rmd] [slides]

Lab 1 [Rmd] [html]

Lab 1 Solutions [Rmd] [html]

Lecture 2 Importing tabular data.

Simple summaries of categorical and continuous data.

R style basics

Lecture 2 notes [Rmd] [slides]

Lab 2 [Rmd] [html]


Lab 2 Solutions [Rmd] [html]
Week 2: Data frames, functions, loops, if/else
Lecture 3More on data frames and lists.

Writing functions in R.

If/else statements.


Lecture 3 notes [slides] [Rmd]

Lab 3 [Rmd] [html]
Lab 3 Solutions [Rmd] [html]

Lecture 4
A common data cleaning task.

Functions

If-else statements


Lecture 4 notes [slides] [Rmd]

An Introduction to Factors in R

Lab 4 [Rmd] [html]
Lab 4 Solutions [Rmd] [html]
HW 1
Week 3: Data summaries and Graphics
Lecture 5
Multivariate statistical summaries

For/while loops.

Loop alternatives


Lecture 5 notes [Rmd] [html]

Lab 5 [Rmd] [html]
Lab 5 Solutions [Rmd] [html]

Lecture 6
Multivariate statistical summaries

Introduction to ggplot2 graphics


Lecture 6 notes [Rmd] [html]

Lab 6 [Rmd] [html]

Lab 6 Solutions [Rmd] [html]
Homework 3 assigned.
HW 2
Week 4: Statistical tests and models
Lecture 7
ggplot2 graphics


Lecture 7 notes
We wrapped up the Lecture 6 materials


Only one lab this week! Continuing to call it Lab 7 even though it appears under the Lecture 8 materials.
Lecture 8
Testing for differences in means between two groups

QQ plots

Tests for 2x2 tables


Lecture 8 notes [Rmd] [html]

Lab 7 [Rmd] [html]

Lab 7 Solutions [Rmd] [html]

Supplement: Statistical significance testing [Rmd] [html]


Homework 4 assigned.
HW 3
Week 5: Linear regression
Lecture 9
Tests for 2x2 tables

Tests for jxk tables

Plotting error bars

Linear regression


Lecture 9 notes [Rmd] [html]

Lab 9 [Rmd] [html]

Lab 9 Solutions [Rmd] [html]

Supplement: Shiny apps
Ordinary least squares
Basic diagnostics

Supplement: diagnostic plots for lm objects
[Rmd] [html]

No lecture today
THANKSGIVING - NO CLASS
HW 4
Week 6: Regression, more graphics
Lecture 10
Linear regression

Lecture 10 notes [html] [Rmd]

Lab 10 [Rmd] [html]

Lab 10 Solutions [Rmd] [html]

Lecture 11
Interpreting categorical variables in regression

Interaction terms in regression


Lecture 11 notes [Rmd] [html]

Lab 11 [Rmd] [html]

Lab 11 Solutions [Rmd] [html]

Week 7: Interactive Graphics and Prediction
Lecture 12
Interaction terms in regression

Stratified regression models

Lecture 12 notes [html] [Rmd] [Rpres] [project slides]


HW 5
Lecture 13
Interactive graphics in R

Lecture 13 notes [Rmd] [html] [shiny]

[Rpres] [project slides]