Getting started

Read through the project descriptions for each of the 6 projects to identify the project that you and your team would like to work on. You can download the data for the projects by navigating through the file directory at the following link:

http://www.andrew.cmu.edu/user/achoulde/95791/projects/

Grading rubric

Your submission will be graded according to the following rubric

Project A: Predicting Home Sale Price

Associated data sets

AmesHousing.csv 

Data description

It’s 2011 and you are consulting for a local real estate agent in Ames, Iowa. The agent has assembled detailed data on house characteristics and sales price for every home sold in the area from 2006-2010. She has asked for your assistance with developing a sales price prediction model that she can use to better price her listings and to give her clients advice on worthwhile home improvement projects.

A complete description of the data elements can be found here

Start by taking all of the 2010 data and holding it out as a validation set. You should not use this data for training your model. This data will be used to evaluate the ability of your model to forecast sales prices on newly listed homes.

Key tasks

  1. Perform an exploratory data analysis to identify general trends in sales prices across time, neighbourhood, and home characteristics.

  2. Build a sales price prediction model using the data on 2006-2009 house sales prices that is as accurate as possible. Try to evaluate its performance using appropriate validation methods on just the 2006-2009 data. Compare the estimated prediction accuracy to your model(s) actual performance on the 2010 data.

    (Tip: For this problem you’ll want to think about performance metrics other than MSE. For instance, MAE (mean absolute error), relative error ((prediction – actual )/ actual), and others.)

  3. What home features are most predictive of sales price? How are those features related to price?

  4. It’s not uncommon for the accuracy of a house price prediction model to vary geographically. Does your model appear to do equally well across the different neighbourhoods, or are your predictions worse in some regions?

  5. Clients are often interested in knowing whether particular home improvements can increase the value of their home beyond the cost of the home improvement. Common improvements include adding bathrooms, remodeling kitchens, finishing basements, and putting on new roofs. Create a “renovation value” calculator that predicts the expected bump in sales price for each of these types of home improvements. Apply it to listings in the 2010 data where some of these improvements are possible. Are there any cases where you would recommend particular home improvements to the clients? For this question you will want to look up typical costs for these types of renovations.

Project B: Criminal Recidivism Prediction

Associated data sets

ProPublica's COMPAS data: https://github.com/propublica/compas-analysis

Data description

Many jurisdictions around the United States are using risk assessment instruments (RAIs) in helping judges make bail decisions. Pre-trial RAIs are often statistical models that try to predict the likelihood that an individual will commit a crime if released on bail pending their court date. For the past several years, the courts in Broward County, Florida have been using one of the RAIs in the COMPAS suite to inform their decisions. In May of 2016 an investigative journalism team at ProPublica published a report that analysed whether COMPAS might be racially biased. You will use the publicly released data from their analysis in performing the key tasks below.

Key tasks

  1. The COMPAS tool was not developed on the Broward County population. Going forward, the County is considering developing their own RAI to replace COMPAS. Using the available data, construct an RAI for predicting two-year recidivism. Evaluate the predictive performance of your model. What are the most important predictors of recidivism?

  2. Construct an RAI for predicting violent recidivism. Evaluate the predictive performance of your model. What are the most important predictors of violent recidivism? How do they compare to the important predictors of general recidivism?

  3. Are your RAIs from (1) and (2) equally predictive across race/ethnicity groups? How about across age and sex groups?

  4. Compare your RAIs to the COMPAS RAI. Do your RAIs perform better or worse than COMPAS? Do your RAIs produce similar classifications to COMPAS? Can you identify any systematic differences between your classifications and those of COMPAS?

Project C: Predicting flight delays

Associated data set

All_PIT_2006.csv        

Data description

On-time performance is a very important criterion used to measure the quality of customer service by commercial transportation companies. Airlines, operating at very tight margins of profit these days, are vitally interested in maintaining high levels of customer satisfaction in despite of operational constraints due to reductions of staffing and tighter schedules. Monitoring on-time performance and detecting patterns of delays are therefore of their particular interest. Also, consumer advocacy groups as well as the individual travelers could benefit by being aware of systematic and/or predictable delays in order to reduce risk of disruptions of their planned trips or in order to mount pressure on service providers to fix the root causes of the problems. The data in the associated file All_PIT_2006.csv is an extract from the Airline On-Time Performance Data made available through the Bureau of Transportation Statistics of the U.S. Department of Transportation provided here as an example. It reflects commercial flight activity to and from Pittsburgh International Airport (airport code PIT) throughout year 2006. It only covers flights by the U.S. certified air carriers each of which accounts for at least one percent of the domestic scheduled passenger revenues. More complete collection can be obtained from the source which makes available all such records dating back to 1987, and for many more airports.

Key tasks

  1. Retrieve the latest available 12-month On-Time Performance data for PIT from the web site cited above. Summarize and review the data to identify and visually present potentially interesting aspects of it.

  2. Which of the available data features are useful in predicting flight delays at PIT airport?

  3. Suppose you are asked to design an analytical component of a personalized flight delay warning system. Can the available data be used to build such a tool? What aspects of it may be useful? Demonstrate your ideas with examples of analysis of the data focusing on specific anticipated use cases. Discuss limitations of the data at hand and evaluate availability and potential utility of alternative sources of relevant information.

  4. What has changed at PIT airport since 2006 when it comes to flight delays?

Project D: Detecting network instrusions

Associated data sets

network_traffic.csv 

Data description

A data dictionary is contained in the Data_Description.doc file in the project subfolder.

XYZ Bank is a large and profitable bank in Saint Louis, Missouri. Like any large corporation, XYZ Bank has a very large and intricate infrastructure that supports its networking system. A Network Analyst recently discovered unusual network activity. Then, pouring over year’s worth of logs, their team of analysts discovered many instances of anomalous network activity that resulted in significant sums of money being siphoned from bank accounts. The Chief Networking Officer has come to your group for help in developing a system that can automatically detect and warn of such known, as well as other unknown, anomalous network activities.

The network_traffic.csv file is a synopsis of logged network activity. It contains labeled examples of benign network sessions as well as examples of sessions involving intrusions. It is important to note that it is likely that there exist many different intrusion types in the data. The data_description.txt file provides explanations of each of the attributes found in the network_traffic dataset.

Key tasks

  1. Determine if it is possible to differentiate between the labeled intrusions and benign sessions.

  2. Is it possible to identify different types of intrusions? If so, which values of which attributes in data correlate with the specific types of intrusions?

  3. Develop and implement a systematic approach to detect instances of intrusions in log files. Your system will need to be able to take a new network_traffic log file and determine the existence of known patterns of intrusions as well as anomalies which may be indicative of new and unknown intrusion patterns.

  4. Evaluate detection power of your system.

  5. Can your intrusion detector be used in real-time? It would need to be able to receive data about a current session, and in seconds determine if it is likely to be and intrusion of previously seen type or an anomaly potentially signifying an unseen yet intrusion mode. What information should be exchanged via the user interface of such system?

Project E: Customer Life-Time Value

Associated data sets

ltv.csv

Data description

Your client is an online greeting card company. The company offers monthly subscriptions at a rate of $1 per month for access to their eCard website. The client is interested in understanding the life-time value (ltv) of their customers.

The life-time value of a customer is defined as the total revenue earned by the company over the course of their relationship with the customer.

The enclosed (synthetic) data represent usage statistics for 10,000 customers. Usage is summarized at a daily level and covers a period of 4 years from 2011-01-01 to 2014-12-31.

The following is a description of each field captured in the enclosed data set.

  • id: A unique user identifier
  • status: Subscription status:
    • ‘0’- new,
    • ‘1’- open,
    • ‘2’- cancelation event
  • gender: User gender
    • ‘M’- male
    • ‘F’- female
  • date: Date of in which user ‘id’ logged into the site
  • pages: Number of pages visted by user ‘id’ on date ‘date’
  • onsite: Number of minutes spent on site by user ‘id’ on date ‘date’
  • entered: Flag indicating whether or not user entered the send order path on date ‘date’
  • completed: Flag indicating whether the user completed the order (sent an eCard)
  • holiday: Flag indicating whether at least one completed order included a holiday themed card

Key tasks

  1. Develop an attrition model, to predict whether a customer will cancel their subscription in the near future. Characterize your model performance.

  2. Develop a model for estimating the ltv of a customer. Characterize your model performance.

  3. Develop a customer segmentation scheme. Include in this scheme the identification of sleeping customers, those that are no longer active but have not canceled their account.