My interests revolve around Machine Learning , Stochastics and Data Science .
Having a strong attachment towards Stochastics from early days of my career, the concept of applying it to Big data attracted me a lot. The ability to combine Data and Analytical skills to find and interpret rich data sources, always motivates me further to do research in this field. I am currently working on applying Machine Learning and Stochastic methods to Big Data.
R, Python, C/C++, MySQL, Tableau, Matlab, HTML/CSS, Java, Git, Bash
Machine Learning, Data Science Specialization, MySQL, R, Tableau, Java.
|Machine Learning||Big Data Analytics||Data Visualization|
|Modern Regression||Neural Signal Processing||Image Analysis|
My graduate projects were focused on analyzing big data using Machine learning techniques like Deep Convolutional Neural Networks, SVM, PCA, Factor Analysis, Kalman Filter and Adaboost. The details of the projects have been given below.
|Modelling of Particulate Processes||Computational Techniques||Multivariate Data Analysis|
|Numerical Methods||Linear Algebra||Statistics|
My undergraduate projects were focused on modelling molecular particles using statistical methods like Monte Carlo and Molecular dynamics, to predict the phase behavior. The details of the projects have been given below.
Monte Carlo analysis, is a means of statistical evaluation of mathematical functions using random samples. Monte Carlo methods can be used to solve any problem having a probabilistic interpretation. We used this method to predict the phase behavior of molecular particles.
Coarse grained model is used to obtain parameters for the molecular particles and random number is generated to move the particle. The move is accepted according to the probablity given by long range kawasaki Ising Model. The algorithm is also modified to account for local moves, for which the probability is given by a exponential term. The algorithm is validated by comparing the results with properties of the molecules. For further details please refer to the article. Click here to read the article
Kawasaki Exchange method
Named Entity Recognition is a sequence labelling task that seeks to identify elements in text from speci.c categories. The labels have a BIO speci.ers (begin, inside, and outside). In our case, there are four categories: Person, Organization, Location and Miscellaneous. There are nine labels total: eight for the cross-product of the four categories with the begin and inside speci.ers; one for the outside speci.er.
Implemented Viterbi decoding algorithm to learn the parameters from the labelled data and compared it to the simple baseline that does independent predictions for each label in the sequence. The train and test data sets are pre-processed by indexing both the tags and words and substituting by integers.
Natural Language Processing
Hidden Markov Model
Adaboost is used in conjunction with many other types of learning algorithms to improve their performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers.
In this project, I implemented decision stump as base learners to classify data set consisting of positive and negative value. Adaboost is then used to combine the weighted sum of weak learners to create a strong classifier. We also showed that the Margin keeps increasing with every iteration of the algorithm and that the large margin reduces the generalization error.
A convolutional neural network (CNN, or ConvNet) models have subsequently been shown to be effective for various Natural language processing (NLP) problems and have achieved excellent results in semantic parsing, search query retrieval, sentence modeling, classification, prediction, and other traditional NLP tasks.
In this project, CNN has been used for text classification problem. A collection of text articles from the magazines The Economist and The onion were given. The goal is to learn a classifier that can distinguish between articles of each magazine. The dictionary containing set of all possible words are extracted from the magazines. The CNN network had 2 Convolutional layer, 2 Pooling layer and 1 Relu layer. For each article a feature vector is then created with value 1 if the word appears, 0 otherwise. Two layer CNN is trained using the training data. The test data classification accuracy is 98%. The neural nets shown is not completely representative of the one used in the project.
Convolutional Neural Network
Natural Language processing
Expedia is a parent company for several online travel brands including Expedia.com, Hotels.com, Hotwire.com, etc. Expedia wants to provide personolized hotel recommendations to their users. With hundreds of millions of users every month this is not a easy task.
Hotel recommendations are made by assigning scores for each hotel, based on the user's data such as Origin city, destination city, number of people, site name, etc. Data visualization is also used as a measure to verify the score assigned to each hotel.
Neural data is the opportunity to answer questions about how the brain work. The monkey's neural data is recorded during movement planning of its arm to the target location given by a cursor. The recorded data is then linked with the corresponding coordinates (x and y) of the target location and the velocity in the x and y directions.
Kalman filter is used to model the neural data after pre-processing the data with PCA. The parameters of the models such as mean position estimate, covariance matrix are obtained during the training phase. The arm location predicted using the test data yielded an accuracy of 98%.
Convolutional Neural Network
Natural Language processing
Avito is one of the largest and fastest growing online classifields. Avito hosts high volume of listings and competitive sellers often go to great length to get their wares noticed. Avito is looking to develop a model where users can that can automatically spot duplicate ads, to ensure buyers can easily find what they are looking for.
Pair of ads are labelled as 0's or 1's in a response variable, indicating wether the ad is duplicate or not. A linear model is then fitted between the response and the features of the data sets like latitude, longitude, location ID, images, etc . Binomial model is fitted using the glm package in R. The probability of the response is then obtained for the test values, with threshold as 0.5
Logistic regression is a regression model where the dependent variable (DV) is categorical. The binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features).
The MNIST data set containing handwritten digits is used to train the logistic regression classifier. The two numbers chosen for the classification study are 4 and 7. The binary classifier is trained to differentiate the number. The test classification accuracy was 99%.
In statistics, an expectation–maximization (EM) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. The EM iteration alternates between performing an expectation (E) step, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization (M) step, which computes parameters maximizing the expected log-likelihood found on the E step.
In this project, I implemented EM algorithm for GMM on the neural data sets, to identify the neuron responsible for each recorded spike. The data has been processed using PCA, prior to applying EM algorithm. Computed Cross-validated likelihood for different number of clusters, to find the best fit. Previously, I implemented k-means on the same data set to cluster the data.
Graphical Mixture Models
In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model given data.
Derived ML parameters and decision boundaries for Gaussian (Class specific co-variance), Gaussian (Shared Co-variance) and Poisson models. Implemented them on neural data sets to classify them.
Made statistical analysis on Neural Data by using following plots: Spike Histogram, Tuning Curve, Count Distribution, Fano Factor, Inter Spike Interval distribution and Co-efficient of variation
Implemented PPCA and FA on the neural data sets. Visualized the data by projecting it into two dimensional space and compared the results with PCA.
2355, Eldridge Street,
Pittsburgh, PA, USA.