Applications of Textual Analysis in Social Sciences
Instructor: Lee Gao
Office: GSIA 301
Phone: 6179550727
Email: lilig@andrew.cmu.edu
Office Hours: Mon and Wed, 2:00pm3:00pm, or by appointment
Dates: March 25th, April 1st, 8th, 22nd. 2016
Time: 10:00am11:30am
Location: Posner 384
Syllabus
Welcome to sign up!



WORKSHOP DESCRIPTION

The goal of this workshop is to describe the "big picture" of the concurrent popular methods in computational linguistics, introduce popular offtheshelf software packages, and lead students to develop basic textual analysis skills through weekly homework and a data analysis project. Techniques introduced in this workshop have been used in financial news sentiment analysis, political ideology estimates and so on.

REFERENCES

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze, (2008). Introduction to Information Retrieval. Vol. 1. No. 1. Cambridge University Press.

David J. C. MacKay, (2007). Information Theory, Inference and Learning Algorithms. Cambridge University Press.

Martin, James H., and Daniel Jurafsky. (2008). Speech and Language Processing (2nd Edition). Prentice Hall.

John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman, (2006). Introduction to Automata Theory, Languages, and Computation (3rd Edition).

Christopher Bishop (2006). Pattern Recognition and Machine Learning.

Liu, Bing. (2015). Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge University Press.

Loughran, Tim, and Bill McDonald. (2015). Textual analysis in accounting and finance: A survey.

Li, Feng. (2010). Textual analysis of corporate disclosures: A survey of the literature.

SOFTWARE
 This workshop emphasizes on putting the machine learning and natural language processing techniques into practice. There are rich open source toolboxes avaiable for common textual analysis tasks. The software packages used in this class are invoked in Python and R. Please follow the procedures in HW0 to install Python and R on your computer.
 Useful Python packages (How to install Python packages): Jupyter Notebook, numpy, scipy, pandas,
scikitlearn, statsmodels, lxml, gensim, Word2Vec
 Useful R packages (How to install R packages): glmnet, mnir,
tm

COMMUNICATION

The best way to get in touch with me is by email: lilig@andrew.cmu.edu. You can also attend my office hours or arrange an appointment to meet me in my office or contact me by phone (see details above).

I greatly value your feedback on any aspects of this workshop. Please feel free to contact me in person or by email with suggestions. I have also set up an anonymous feedback system
here. You can use it throughout the workshop to communicate any concern or question regarding the materials, the class, etc. I will try to address your comments as promptly as possible.

ASSIGNMENTS (OPTIONAL)

There will be 4 assignments. As this workshop contains no credit, assignments are not required, but are encouraged. The assignments can help students to digest the theories and practice the empirical methods covered in the lectures. Each assignment will be released around each lecture day, covering the materials taught in class, and the solution to the pastweek assignment will be released at the same time. The assignments will not be graded.

DATA ANALYSIS PROJECT (OPTIONAL)

There will be 1 data analysis project, which is also optional. Students are encouraged to put the methods learned in class into real research problems. Students who plan to work on a project should submit a proposal in the second lecture (April 1st). Due to the short span of this workshop, that is pretty demanding! The proposal should be less than 1 page, it does not need to cover all the methodology details, it is very unlikely that you will have a clear idea about what kind of methods will be appropriate for your research after just 1 lecture. The proposal only needs to cover the research idea and the data set you plan to use, and you should come to my office hour and I will suggest some ideas to try.

The midterm report is due on April 15th (no class, submit to me in email). The midterm report should be about 3  5 pages, including introduction, literature review, data set descriptions, preliminary results and plans for the future.

The final resport is due on April 29th (no class, submit to me in email). The final report should be about 7  8 pages, including introduction, literature reivew, data set descriptions, results and conclusion. After the submission, I will give you my feedback as soon as possible.

USEFUL LINKS


WEEK 0
 Assignment 0 (due on March 25):
Install Python (Windows, OS X, Linux),
Jupyter Notebook (Windows, OS X, Linux);
R (Windows, OS X, Linux),
RStudio (Windows, OS X, Linux)

WEEK 1: FRIDAY, MARCH 25
 Topics: Text Preprocessing Techniques, Vector Representations
 Lecture Slides: lecture1_slides
 Working Sample: html, Jupyter Notebook, Movie Review Data
 Assignment 1 (due on April 1): Problem Set, Python Code, data
 Assignment 1 Solution: html, Jupyter Notebook
 Readings:

O'Connor, B. T. (2014). Statistical Text Analysis for Social Science.

Tetlock, P. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance 62.3: 11391168.

Tetlock, P., Saar‐Tsechansky, M., and Macskassy S. (2008). More than words: Quantifying language to measure firms' fundamentals. The Journal of Finance 63.3: 14371467.

Loughran, T, and McDonald B. (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks. The Journal of Finance 66.1: 3565.

Jegadeesh, N., and Wu, D. (2013). Word power: A new approach for content analysis. Journal of Financial Economics 110.3: 712729.

Monroe, B., Colaresi, M., and Quinn, K. (2008). Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict. The Journal of Finance 62.3: 11391168.

Pang, Bo, and Lillian Lee. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval 2.12: 1135. APA

WEEK 2: FRIDAY, APRIL 1
 Topics: Dimension Reductions, Graphical Models
 Lecture Slides: lecture2_slides
 Working Sample: html, Jupyter Notebook, Movie Review Data
 Assignment 2 (due on April 8): Problem Set, data
 Assignment 2 Solution: html, Jupyter Notebook
 Readings:

Dumais, S. T. (2004). Latent semantic analysis. Annual review of information science and technology 38.1: 188230.

Blei, D., Ng, A. and Jordan, M. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research. Issue 3, 9931022.

Blei, D. and McAuliffe, J. (2007). Supervised Topic Models. Advances in neural information processing systems, 121128.

Griffiths, T., and Steyvers, M. Finding scientific topics. Proceedings of the National Academy of Sciences 101.suppl 1. 52285235.

Taddy, M. (2013). Multinomial Inverse Regression for Text Analysis. Journal of the American Statistical Association, No. 108(503), 755770.

Rabinovich, M., and Blei, D. (2014). The Inverse Regression Topic Model. Proceedings of The 31st International Conference on Machine Learning, 199207.

WEEK 3: FRIDAY, APRIL 8
 Topics: Support Vector Machines, Penalized Regressions
 Lecture Slides: lecture3_slides
 Working Sample: html, Jupyter Notebook, Bond Premia Data
 Assignment 3 (due on April 22): Problem Set, Movie Review Data, Bond Premia Data
 Assignment 3 Solution: html, Jupyter Notebook
 Readings:

Smola, A. J., and Bernhard S. (2004). A tutorial on support vector regression. Statistics and computing 14.3: 199222.

Tibshiranim, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B, Volume 58, Issue 1, 267288.

Zou, H. and Hastie, T. (2005). Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society. Series B, Volume 67, Part 2, 301320.

Efron, B., Hastie, T., Johnstone, I. and Tibshirane, R. (2004). Least Angle Regression. The Annals of Statistics. Vol. 32, No. 2, 407499.

Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Vol. 33(1) ,122.

WEEK 4: FRIDAY, APRIL 22
 Topics: Neural networks, Word Vector Embeddings
 Lecture Slides: lecture4_slides
 Working Sample: html, Jupyter Notebook, Movie Review Data
 Assignment 4 (due on April 29): Problem Set, Movie Review Data
 Assignment 4 Solution: html, Jupyter Notebook
 Readings:

Le, Quoc V. (2015). A Tutorial on Deep Learning Part 1: Nonlinear Classifiers and The Backpropagation Algorithm.

Le, Quoc V. (2015). A Tutorial on Deep Learning Part 2: Autoencoders, Convolutional Neural Networks and Recurrent Neural Networks.

Mkkolov, T., Chen, K., Corrado, G., Dean J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed Representations of Words and Phrases and Their Compositionality. In Advances in neural information processing systems (pp. 31113119).

Levy, O., Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems (pp. 21772185).

Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., and Qin, B. (2014). Learning SentimentSpecific Word Embedding for Twitter Sentiment Classification. In ACL (1) (pp. 15551565).

Ulrike, L. (2007). A Tutorial on Spectral Clustering. Statistics and computing 17.4: 395416.

