T 94-842: Programming in R for Analytics, Fall 2017

Course Description

This course introduces students to R, a widely used statistical programming language. Students will learn to manipulate data objects, produce graphics, analyse data using common statistical methods, and generate reproducible statistical reports. They will also gain experience in applying these acquired skills in various public policy areas.

By the end of the class, students learn to:
  • Use RStudio, read R documentation, and write R scripts.
  • Import, export and manipulate data.
  • Produce statistical summaries of continuous and categorical data.
  • Produce basic graphics using standard functions, and produce more advanced graphics using the ggplot2 library.
  • Perform common hypothesis tests, and run simple regression models in R
  • Produce reports of statistical analyses in R Markdown.

License


All of the course materials on this page are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Creative Commons License

Textbooks

While there are no required textbooks for this class, the following references are highly recommended. Students may find it useful to own a personal copy of one or two of the texts below.

Recommended textbooks

Helpful resources

There are many resources online that may help you to learn R. A few that are particularly relevant for this course are listed below.

Course Work

Your grade in this course will be determined by a series of 5 weekly homework assignments (35%), lab participation (10%), quizzes (10%) and a final project (45%).

Assignments

Weekly assignments will take the form of a single R Markdown text file: namely, code snippets integrated with captions and other narrative. All assignments are due Thursdays at 4:20pm on the dates indicated in the Sidebar.

Your assignment score for the course will be calculated by averaging your four (4) highest homework scores. That is, your lowest homework score will not count toward your grade.

Each homework assignment will have 5 problems, each of which may have several parts. Your score for each assignment will be assigned according to the scheme outlined in the rubric below.

Homework rubric

Total: 10 points

Correctness : Each problem will be worth 2 points. Deductions will be made at the discretion of the grader.

Knitting: -0.5 deduction if the Rmd file you submit does not knit correctly (i.e., if there are errors and no HTML file is produced when the grader attempts to knit your Rmd file.)

  • If your Rmd file fails to knit, you will be contacted by the grader and will be given 24 hours to resubmit your homework. You will need to trace the source of the error(s) and correct it.

    Style : Coding style is very important. With the exception of Homework 1, you will receive a deduction of up to 1 point if you do not adhere to good coding style.

    • No deduction if your homework is submitted with:
      • good, consistent coding style
      • appropriate use of variables
      • appropriate use of functions
      • good commenting
      • good choice of variable names
      • appropriate use of inline code chunks
    • -0.5 if coding style is acceptable, but fails on a couple of the criteria above.
    • -1 if coding style is overall poor and fails to adhere to many of the above criteria.


    Participation

    Lab activities

    The Lab session is scheduled for Fridays from 4:30 - 5:50PM. Lab attendance is mandatory, and counts toward your course grade. During the lab sessions, students will get hands-on practice with the week's material by working on assigned lab activities. Members of the teaching staff will be present to introduce the activities and to answer any questions you may have. Tasks may include but are not limited to: running or modifying code from the lecture, pair coding, or completing short coding exercises. During weeks where Friday sessions are cancelled due to holidays (50th anniversary celebration, Thanksgiving), you are still expected to attempt and submit the labs.

    All thirteen (13) scheduled lectures will have an associated lab component. Your Lab participation score for the course will be calculated based on the number of labs that you submit, as indicated in the table below.
    0-45-79-1011-13
    057.510

    Quizzes

    There will be 3 short quizzes scheduled during the later weeks of class. Dates and times will be announced in advance. The purpose of these quizzes is to assess your understanding of various concepts that are central to the class. Your score on the quizzes will count for 10% of your final grade.

    Final project


    The final project for the class will ask you to explore a broad policy question using a large publicly available dataset. This project is intended to provide students with the complete experience of going from a study question and a rich data set to a full statistical report. Students will be expected to (a) explore the data to identify important variables; (b) perform statistical analyses to address the policy question; (c) produce tabular and graphical summaries to support their findings; and (d) write a report describing their methodological approach, findings, and limitations thereof.

    While students may work in small groups to decide on appropriate statistical methodology and graphical/tabular summaries, each student will be required to produce and submit their own code and final report.

    Regardless of grading basis, students must receive a score of at least 50% on the final project in order to pass the class.
  • Course Grading

    Your final course grade will be calculated according to the following breakdown.
    Assignments35%
    Participation10%
    Quizzes10%
    Final project45%

    Late submission

    Homework is to be submitted by 4:20pm on Thursdays on the due date indicated.
    Late homework will not be accepted for credit.

    Note that your lowest homework score will not count toward your grade, so you can miss one homework without it counting toward your course grade.

    Collaboration

    You are encouraged to discuss homework problems with your fellow students. However, the work you submit must be your own. You must acknowledge in your submission any help received on your assignments. That is, you must include a comment in your homework submission that clearly states the name of the student, book, or online reference from which you received assistance.

    Submissions that fail to properly acknowledge help from other students or non-class sources will receive no credit. Copied work will receive no credit. Any and all violations will be reported to Heinz College administration.

    All student are expected to comply with the CMU policy on academic integrity. This policy can be found online at http://www.cmu.edu/academic-integrity/.

    What constitutes plagiarism in a coding class?

    The course collaboration policy allows you to discuss the problems with other students, but requires that you complete the work on your own. Every line of text and line of code that you submit must be written by you personally. You may not refer to another student's code, or a "common set of code" while writing your own code. You may, of course, copy/modify lines of code that you saw in lecture or lab.

    The following discussion of code copying is taken from the Computer Science and Engineering Department at the University of Washington. You may find this discussion helpful in understanding the bounds of the collaboration policy.

    "[It is] important to make sure that the assistance you receive consists of general advice that does not cross the boundary into using code or answers written by someone else. It is fine to discuss ideas and strategies, but you should be careful to write your programs on your own."

    "You must not share actual program code with other students. In particular, you should not ask anyone to give you a copy of their code or, conversely, give your code to another student who asks you for it; nor should you post your solutions on the web, in public repositories, or any other publicly accessible place. [You may not work out a full communal solution on a whiteboard/blackboard/paper and then transcribe the communal code for your submission.] Similarly, you should not discuss your algorithmic strategies to such an extent that you and your collaborators end up turning in [essentially] the same code. Discuss ideas together, but do the coding on your own."

    "Modifying code or other artifacts does not make it your own. In many cases, students take deliberate measures -- rewriting comments, changing variable names, and so forth -- to disguise the fact that their work is copied from someone else. It is still not your work. Despite such cosmetic changes, similarities between student solutions are easy to detect. Programming style is highly idiosyncratic, and the chance that two submissions would be the same except for changes of the sort made easy by a text editor is vanishingly small. In addition to solutions from previous years or from other students, you may come across helpful code on the Internet or from other sources outside the class. Modifying it does not make it yours."

    "[I] allow exceptions in certain obvious instances. For example, you might be assigned to work with a project team. In that case, developing a solution as a team is expected. The instructor might also give you starter code, or permit use of local libraries. Anything which the instructor explicitly gives you doesn't normally need to be cited. Likewise, help you receive from course staff doesn't need to be cited."

    If you have any questions about any of the course policies, please don't hesitate to ask. You may post your questions on Piazza or ask me directly.

    ISLE research

    For this class, I am conducting research on a new form of statistics lab format called ISLE. This research will involve deploying ISLE-enabled labs activities. You will not be asked to do anything above and beyond the normal learning activities and assignments that are part of this course. You are free not to participate in this research, and your participation will have no influence on your grade for this course or your academic career at CMU. Participants will not receive any compensation. The data collected as part of this research will include student grades. All analyses of data from participants’ coursework will be conducted after the course is over and final grades are submitted. The Eberly Center may provide support on this research project regarding data analysis and interpretation. To minimize the risk of breach of confidentiality, the Eberly Center will never have access to data from this course containing your personal identifiers. All data will be analyzed in de-identified form and presented in the aggregate, without any personal identifiers. Please contact me at the email address located in the course page sidebar if you have questions or concerns about your participation.

    Policies

    Computing:

    The statistical computing package we will use in this course is R, which is available on many campus computers. You may download your own copy from http://www.r-project.org. We require that you use R Markdown to complete your assignments, which is enabled very nicely with RStudio.

    Laptop Policy:

    Students are expected to be participate in class, either on their own laptops or on the provided lab machines.

    Communication:

    Assignments and class information will be posted on Canvas and the class website.

    Email:

    The Piazza forum should be used for general course-related questions that may be of interest to others in the class. For other types of questions (e.g., to report illness, request various permissions) please contact Prof. Chouldechova via email.
    Please include the course code 94842 in the subject line of your email.

    Disability Services:

    If you have a disability and need special accommodations in this class, please contact the instructor. You may also want to contact the Disability Resources office at 8-2013.

    Tentative Schedule

    Date
    Topic
    Due
    Week 1: Introduction and Basics
    Tue 10/24Introductions. Installing R on personal machines. Retrieving R packages.

    Basics of R, RStudio, R Markdown.

    Basic data types and operations: numbers, characters and composites.

    Vectors, creating sequences, common functions.

    Homework 0 assigned.

    Lecture 1 notes [Rpres] [slides]

    Lab 1 [Rmd] [html]

    Thu 10/26 Importing tabular data.

    Simple summaries of categorical and continuous data.

    R style basics

    Lecture 2 notes [Rpres] [slides] [Rmd] [html]

    Lab 2 [ISLE lab link]

    Week 2: Data frames, functions, loops, if/else
    Tue 10/31More on data frames and lists.

    Writing functions in R.

    If/else statements.


    Lecture 3 notes [Rpres] [slides]

    Lab 3 [ISLE link]
    Thu 11/2
    A common data cleaning task.

    For/while loops.

    Using apply() to iterate over data.

    Using with() to specify environment.


    Lecture 4 notes [Rpres] [slides] [Rmd] [html]

    An Introduction to Factors in R

    Lab 4 [ISLE link]
    HW 1
    Week 3: Data summaries and Graphics
    Tue 11/7
    Introduction to plyr

    Multivariate statistical summaries

    Introduction to ggplot2 graphics


    Lecture 5 notes [Rpres] [slides] [Rmd] [html]

    Practice exercises [Rmd] [html]

    Thu 11/9
    ggplot2


    Lecture 6 notes [Rmd] [html]

    Lab 6 [ISLE link]
    Homework 3 assigned.
    HW 2
    Week 4: Statistical tests and models
    Tue 11/14
    Testing for differences in means between two groups

    QQ plots

    Tests for 2x2 tables

    Plotting confidence intervals


    Lecture 7 notes [Rmd] [html]

    Lab 7 [ISLE link]

    Supplement: Statistical significance testing [Rmd] [html]

    Thu 11/16
    ANOVA

    Linear regression

    Assessing multicollinearity

    Diagnosing and interpreting regression


    Lecture 8 notes [Rmd] [html]

    Lab 8 [ISLE link]

    Supplement: diagnostic plots for lm objects
    [Rmd] [html]


    Homework 4 assigned.
    HW 3
    Week 5: Linear regression
    Tue 11/21
    More linear regression


    Lecture 9 notes [Rmd] [html]
    [proj.Rmd] [proj.html]

    Lab 9 [ISLE link]
    Supplement: Shiny apps
    Ordinary least squares
    Basic diagnostics

    Final project assigned.
    Thu 11/23 Thanksgiving holiday, no class HW 4
    Week 6: plyr, Logistic regression
    Tue 11/28
    Interpreting categorical variables in regression

    Interaction terms in regression


    Lecture 10 notes [Rmd] [html]

    Thu 11/30
    plyr

    Final project discussion


    plyr package: split-apply-combine

    ggplot practice
    Lecture 11 notes [html] [Rmd]

    Week 7: dplyr and Interactive Graphics
    Tue 12/05
    Introduction to dplyr

    Lecture 12 notes [html] [Rmd] [Rpres] [project slides]




    HW 5
    Thu 12/07
    Course summary

    More R packages

    Shiny

    Lecture 13 notes [Rpres] [slides] [shiny]