Course Description

This course introduces students to R, a widely used statistical programming language. Students will learn to manipulate data objects, produce graphics, analyse data using common statistical methods, and generate reproducible statistical reports. They will also gain experience in applying these acquired skills in various policy areas.

By the end of the class, students learn to:

Course Info

Instructor: Jeremy C. Weiss , where yyy=jeremyweiss

Office: HBH 2101F

Office Hours: Jeremy C. Weiss, Thursdays 1pm, virtual (see Canvas for link)

Teaching Assistants (zzz), append @andrew.cmu.edu to zzz:

  • Shannon Dutchie (sdutchie)
  • Shri Ragavan (sragavan)
  • Qianying Zhao (qianying)
  • Jen Andre (jandre)
  • Mary Kubinski (mkubinsk)

This Website: http://www.andrew.cmu.edu/~jweiss2/21f_r/

All course materials will be posted on this site.

Homework submission: Assignments to be submitted via Canvas.

Prerequisites: Students must be enrolled in a graduate program in Heinz College. Special permission can be granted by the College.

License

All of the course materials on this page are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Textbooks

While there are no required textbooks for this class. However, I highly recommend the freely available R for Data Science by Garrett Grolemund and Hadley Wickham.

Helpful resources

There are many resources online that may help you to learn R. A few that are particularly relevant for this course are listed below.

R Style guide
An Introduction to Factors in R
ggplot2 cheatsheet
The odds ratio: calculation, usage, and interpretation
Fisher’s exact test
Pearson’s Chi-squared test

This course is adapted from Prof. Alexandra Chouldechova’s and Prof. Zig Zdziarski’s courses 1 2.

Course Work

Your grade in this course will be determined by a series of 5 weekly homework assignments (40%), lab submission (10%), quizzes (10%) and a final project (40%).

Assignments

Weekly assignments will take the form of a single R Markdown text file: namely, code snippets integrated with captions and other narrative. Except where otherwise noted, assignments are typically due on Wednesdays at 1:30pm ET on the dates indicated on Canvas.

Your assignment score for the course will be calculated by averaging your four (4) highest homework scores. That is, your lowest homework score will not count toward your grade.

Each homework assignment will have 5 problems, each of which may have several parts. Your score for each assignment will be assigned according to the scheme outlined in the rubric below.

Homework rubric

Total: 10 points

Correctness : Each homework will have 5 problems, which will often have multiple parts. Each of the 5 problems will be worth 2 points. Deductions will be made at the discretion of the grader.

Style: Coding style is very important. With the exception of Homework 1, you will receive a deduction of up to 1 point if you do not adhere to good coding style.

No deduction if your homework is submitted with:

  • good, consistent coding style
  • appropriate use of variables
  • appropriate use of functions
  • good commenting
  • good choice of variable names
  • appropriate use of inline code chunks

-0.5 if coding style is acceptable, but fails on a couple of the criteria above.
-1 if coding style is overall poor and fails to adhere to many of the above criteria.

Participation

Lab activities

The Lab session is scheduled for Fridays. Lab attendance is encouraged, but is not mandated due to the challenges this would present for students in remote timezones. During the lab sessions, students will get hands-on practice with the week’s material by working on assigned lab activities. Members of the teaching staff will be available over Zoom to introduce the activities and to answer any questions you may have. Tasks may include but are not limited to: running or modifying code from the lecture, pair coding, or completing short coding exercises. During weeks where Friday sessions are cancelled due to holidays, you are still required to submit the labs in order for them to count toward your “participation” score.

All thirteen (13) scheduled lectures will have an associated lab component. Your Lab participation score for the course will be calculated based on the number of labs that you submit, as indicated in the table below.

Participation score
Labs Scores
0-4 0
5-7 5
9-10 7.5
11-13 10

Quizzes

There will be 3-4 short quizzes scheduled during the later weeks of class. Dates and times will be announced in advance. The purpose of these quizzes is to assess your understanding of various concepts that are central to the class. Your score on the quizzes will count for 10% of your final grade.

Final project

The final project for the class will ask you to explore a broad policy question using a large publicly available dataset. This project is intended to provide students with the complete experience of going from a study question and a rich data set to a full statistical report. Students will be expected to (a) explore the data to identify important variables; (b) perform statistical analyses to address the policy question; (c) produce tabular and graphical summaries to support their findings; and (d) write a report describing their methodological approach, findings, and limitations thereof.

Regardless of grading basis, students must receive a score of at least 50% on the final project in order to pass the class.

Course Grading

Your final course grade will be calculated according to the following breakdown.

  • Assignments 40%
  • Participation 10%
  • Quizzes 10%
  • Final project 40%

Late submission

Homework is to be submitted by 1:30pm ET on Wednesdays on the due date indicated, unless an alternate due date is announced. Late homework will not be accepted for credit.

Note that your lowest homework score will not count toward your grade, so you can miss one homework without it counting toward your course grade.

Collaboration

You are encouraged to discuss homework problems with your fellow students. However, the work you submit must be your own. You must acknowledge in your submission any help received on your assignments. That is, you must include a comment in your homework submission that clearly states the name of the student, book, or online reference from which you received assistance.

Submissions that fail to properly acknowledge help from other students or non-class sources will receive no credit. Copied work will receive no credit. Any and all violations will be reported to Heinz College administration.

All student are expected to comply with the CMU policy on academic integrity. This policy can be found online at http://www.cmu.edu/academic-integrity/.

What constitutes plagiarism in a coding class?

The course collaboration policy allows you to discuss the problems with other students, but requires that you complete the work on your own. Every line of text and line of code that you submit must be written by you personally. You may not refer to another student’s code, or a “common set of code” while writing your own code. You may, of course, copy/modify lines of code that you saw in lecture or lab.

The following discussion of code copying is taken from the Computer Science and Engineering Department at the University of Washington. You may find this discussion helpful in understanding the bounds of the collaboration policy.

“[It is] important to make sure that the assistance you receive consists of general advice that does not cross the boundary into using code or answers written by someone else. It is fine to discuss ideas and strategies, but you should be careful to write your programs on your own.”

“You must not share actual program code with other students. In particular, you should not ask anyone to give you a copy of their code or, conversely, give your code to another student who asks you for it; nor should you post your solutions on the web, in public repositories, or any other publicly accessible place. [You may not work out a full communal solution on a whiteboard/blackboard/paper and then transcribe the communal code for your submission.] Similarly, you should not discuss your algorithmic strategies to such an extent that you and your collaborators end up turning in [essentially] the same code. Discuss ideas together, but do the coding on your own.”

“Modifying code or other artifacts does not make it your own. In many cases, students take deliberate measures – rewriting comments, changing variable names, and so forth – to disguise the fact that their work is copied from someone else. It is still not your work. Despite such cosmetic changes, similarities between student solutions are easy to detect. Programming style is highly idiosyncratic, and the chance that two submissions would be the same except for changes of the sort made easy by a text editor is vanishingly small. In addition to solutions from previous years or from other students, you may come across helpful code on the Internet or from other sources outside the class. Modifying it does not make it yours.”

“[I] allow exceptions in certain obvious instances. For example, you might be assigned to work with a project team. In that case, developing a solution as a team is expected. The instructor might also give you starter code, or permit use of local libraries. Anything which the instructor explicitly gives you doesn’t normally need to be cited. Likewise, help you receive from course staff doesn’t need to be cited.” If you have any questions about any of the course policies, please don’t hesitate to ask. You may post your questions on Piazza or ask me directly.

Other Policies

Computing: The statistical computing package we will use in this course is R, which is available on many campus computers. You may download your own copy from http://www.r-project.org. We require that you use R Markdown to complete your assignments, which is enabled very nicely with RStudio.

Communication: Assignments and class information will be posted on Canvas and the class website.

Email: The Piazza forum should be used for general course-related questions that may be of interest to others in the class. For other types of questions (e.g., to report illness, request various permissions) please contact Dr. Weiss via email. Please include the course code 94842 in the subject line of your email.

Disability Services: If you have a disability and need special accommodations in this class, please contact the instructor. You may also want to contact the Disability Resources office at 412-268-2013.

Schedule

Note 1: Links will go live as the course progresses.
Note 2: The course schedule is subject to change.
Note 3: Recordings of classes (at least 1 section per lecture) are here: recordings

Time Content Assignments
Week 1: Introduction and basics

Lecture 1 Introductions. Installing R on personal machines. Retrieving R packages.

Basics of R, RStudio, R Markdown.

Basic data types and operations: numbers, characters and composites.

Vectors, creating sequences, common functions.

Homework 0 assigned.

Lecture 1 notes Rmd slides

Lab 1 Rmd html

Lab 1 Solutions Rmd html

Lecture 2 Importing tabular data.

Simple summaries of categorical and continuous data.

R style basics

Lecture 2 notes Rmd slides

Lab 2 Rmd html

Lab 2 Solutions Rmd html

Week 2: Data frames, functions, loops, if/else

Lecture 3 More on data frames and lists.

Writing functions in R.

If/else statements.

Lecture 3 notes Rmd slides

Lab 3 Rmd html

Lab 3 Solutions Rmd html

Lecture 4
A common data cleaning task.

Functions

If-else statements

Lecture 4 notes Rmd slides

Lab 4 Rmd html

Lab 4 Solutions Rmd html

HW 1 due

Week 3: Data summaries and Graphics

Lecture 5
Multivariate statistical summaries

For/while loops.

Loop alternatives

Lecture 5 notes html

Lecture 6
Multivariate statistical summaries

Introduction to ggplot2 graphics

Lecture 6 notes Rmd html

Lab 6 Rmd html

Lab 6 Solutions Rmd html

Homework 3 assigned.

HW 2 due

Week 4: Statistical tests and models Lecture 7
ggplot2 graphics

Lab 7 Rmd html

Lab 7 Solutions Rmd html

Lecture 8
Testing for differences in means between two groups

QQ plots

Tests for 2x2 tables

Lecture 8 notes R html

Lab 8 Rmd

Lab 8 Solutions Rmd html

HW 3 due

Quiz Friday

Week 5: Linear regression

Lecture 9
Tests for 2x2 tables

Tests for jxk tables

Plotting error bars

Linear regression

Lecture 9 notes Rmd html

Lab 9 Rmd html

Lab 9 Solutions Rmd html

Lecture 10
Linear regression

Lecture 10 notes html Rmd

Lab 10 Rmd html

Lab 10 Solutions Rmd html

HW 4 due, Quiz Friday

Week 6: Regression, more graphics

Lecture 11
Interpreting categorical variables in regression

Interaction terms in regression

Lecture 11 notes Rmd html

Lab 11 Rmd html

Lab 11 Solutions Rmd html

HW 5 due Wednesday

Week 7: Interactive graphics
Lecture 12
Final project introduction

Lecture 12 notes html Rmd

Lab 12 Rmd html

Lab 12 solutions Rmd html

Lecture 13
Shiny visualizations

Lecture 13 notes Rmd html R demo

Lab 13 R

Lab 13 solutions R

Quiz Friday