Hello ! I am Sam David




"I Dream, Believe, Do, Repeat"


Who Am I ?

I am a graduate student at Carnegie Mellon University , Pursuing Masters in Biomedical Engineering. Earlier I graduated from Indian Institute of Technology Madras , with Bachelor of Technology in Chemical Engineering. I'm looking for full time opportunities in the field of Data Analytics and Machine Learning.'

Sam David

Research Interests

My interests revolve around Machine Learning , Stochastics and Data Science .

Having a strong attachment towards Stochastics from early days of my career, the concept of applying it to Big data attracted me a lot. The ability to combine Data and Analytical skills to find and interpret rich data sources, always motivates me further to do research in this field. I am currently working on applying Machine Learning and Stochastic methods to Big Data.

Skills

R, Python, C/C++, MySQL, Tableau, Matlab, HTML/CSS, Java, Git, Bash

Certifications

Machine Learning, Data Science Specialization, MySQL, R, Tableau, Java.


Education

Carnegie Mellon University

Aug 2015 - Dec 2016

Master of Science

Machine Learning Big Data Analytics Data Visualization
Modern Regression Neural Signal Processing Image Analysis

My graduate projects were focused on analyzing big data using Machine learning techniques like Deep Convolutional Neural Networks, SVM, PCA, Factor Analysis, Kalman Filter and Adaboost. The details of the projects have been given below.

Indian Institute of Technology Madras

Aug 2010 - May 2014

Bachelor of Technology

Modelling of Particulate Processes Computational Techniques Multivariate Data Analysis
Numerical Methods Linear Algebra Statistics

My undergraduate projects were focused on modelling molecular particles using statistical methods like Monte Carlo and Molecular dynamics, to predict the phase behavior. The details of the projects have been given below.

Projects


Monte carlo simulation using Kawasaki Ising Model

Monte Carlo analysis, is a means of statistical evaluation of mathematical functions using random samples. Monte Carlo methods can be used to solve any problem having a probabilistic interpretation. We used this method to predict the phase behavior of molecular particles.

Coarse grained model is used to obtain parameters for the molecular particles and random number is generated to move the particle. The move is accepted according to the probablity given by long range kawasaki Ising Model. The algorithm is also modified to account for local moves, for which the probability is given by a exponential term. The algorithm is validated by comparing the results with properties of the molecules. For further details please refer to the article. Click here to read the article

Monte Carlo
Kawasaki Exchange method
Simulation

Article link

Named Entity recognition using Hidden Markov Model

Named Entity Recognition is a sequence labelling task that seeks to identify elements in text from speci.c categories. The labels have a BIO speci.ers (begin, inside, and outside). In our case, there are four categories: Person, Organization, Location and Miscellaneous. There are nine labels total: eight for the cross-product of the four categories with the begin and inside speci.ers; one for the outside speci.er.

Implemented Viterbi decoding algorithm to learn the parameters from the labelled data and compared it to the simple baseline that does independent predictions for each label in the sequence. The train and test data sets are pre-processed by indexing both the tags and words and substituting by integers.

Machine Learning
Natural Language Processing
Hidden Markov Model


Adaboost algorithm for Data classification

Adaboost is used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers.

In this project, I implemented decision stump as base learners to classify data set consisting of positive and negative value. Adaboost is then used to combine the weighted sum of weak learners to create a strong classifier. We also showed that the Margin keeps increasing with every iteration of the algorithm and that the large margin reduces the generalization error.

Adaboost
Decision Stump
Machine Learning


Article classification using supervised learning

A convolutional neural network (CNN, or ConvNet) models have subsequently been shown to be effective for various Natural language processing (NLP) problems and have achieved excellent results in semantic parsing, search query retrieval, sentence modeling, classification, prediction, and other traditional NLP tasks.

In this project, CNN has been used for text classification problem. A collection of text articles from the magazines The Economist and The onion were given. The goal is to learn a classifier that can distinguish between articles of each magazine. The dictionary containing set of all possible words are extracted from the magazines. The CNN network had 2 Convolutional layer, 2 Pooling layer and 1 Relu layer. For each article a feature vector is then created with value 1 if the word appears, 0 otherwise. Two layer CNN is trained using the training data. The test data classification accuracy is 98%. The neural nets shown is not completely representative of the one used in the project.

Machine Learning
Convolutional Neural Network
Natural Language processing


Hotel Recommendations using Expedia Data sets

Expedia is a parent company for several online travel brands including Expedia.com, Hotels.com, Hotwire.com, etc. Expedia wants to provide personolized hotel recommendations to their users. With hundreds of millions of users every month this is not a easy task.

Hotel recommendations are made by assigning scores for each hotel, based on the user's data such as Origin city, destination city, number of people, site name, etc. Data visualization is also used as a measure to verify the score assigned to each hotel.

Data Analysis
Data Visualization
Statistics


Monkey arm movement prediction

Neural data is the opportunity to answer questions about how the brain work. The monkey's neural data is recorded during movement planning of its arm to the target location given by a cursor. The recorded data is then linked with the corresponding coordinates (x and y) of the target location and the velocity in the x and y directions.

Kalman filter is used to model the neural data after pre-processing the data with PCA. The parameters of the models such as mean position estimate, covariance matrix are obtained during the training phase. The arm location predicted using the test data yielded an accuracy of 98%.

Machine Learning
Convolutional Neural Network
Natural Language processing

See Live Demo

Avito duplicate Ad detection

Avito is one of the largest and fastest growing online classifields. Avito hosts high volume of listings and competitive sellers often go to great length to get their wares noticed. Avito is looking to develop a model where users can that can automatically spot duplicate ads, to ensure buyers can easily find what they are looking for.

Pair of ads are labelled as 0's or 1's in a response variable, indicating wether the ad is duplicate or not. A linear model is then fitted between the response and the features of the data sets like latitude, longitude, location ID, images, etc . Binomial model is fitted using the glm package in R. The probability of the response is then obtained for the test values, with threshold as 0.5

R prog
Binomial Distribution
glm package


Handwritten number Classification using logistic regression

Logistic regression is a regression model where the dependent variable (DV) is categorical. The binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features).

The MNIST data set containing handwritten digits is used to train the logistic regression classifier. The two numbers chosen for the classification study are 4 and 7. The binary classifier is trained to differentiate the number. The test classification accuracy was 99%.

Classification
Machine Learning
Logistic Regression

See Live Demo

Expectation Maximization(EM) algorithm for Gaussian Mixture models(GMM)

In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step.

In this project, I implemented EM algorithm for GMM on the neural data sets, to identify the neuron responsible for each recorded spike. The data has been processed using PCA, prior to applying EM algorithm. Computed Cross-validated likelihood for different number of clusters, to find the best fit. Previously, I implemented k-means on the same data set to cluster the data.

EM algorithm
Graphical Mixture Models
Neural Data


Maximum Likelihood parameters and Decision boundary

In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given data.

Derived ML parameters and decision boundaries for Gaussian (Class specific co-variance), Gaussian (Shared Co-variance) and Poisson models. Implemented them on neural data sets to classify them.

Classification
Machine Learning
Logistic Regression

See Live Demo

Statistical Analysis on Neural Data

Made statistical analysis on Neural Data by using following plots: Spike Histogram, Tuning Curve, Count Distribution, Fano Factor, Inter Spike Interval distribution and Co-efficient of variation

Data Visualization
Neural Data
Data analysis


Probabilistic PCA and Factor Analysis

Implemented PPCA and FA on the neural data sets. Visualized the data by projecting it into two dimensional space and compared the results with PCA.

PPCA
FA
Data Visualization

See Live Demo

Contact

My Work Location


2355, Eldridge Street,
APT SA,
Pittsburgh, PA, USA.
Call: +1-715-497-9060
Email: schristd@andrew.cmu.edu