200 hrs (Udacity Offical)
10 - 12 months (Udacity Official)
4 weeks
(Projects are listed in reverse chronological order).
The host, Memorial Sloan Kettering Cancer Center (MSKCC), has been maintaining the database OncoKB for the purpose of knowledge sharing of mutation effects among oncologist. Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature. And due to my combined interest in biology and data science, I am interested in using my expertise to develop a machine learning algorithm that, using an expert-annotated knowledge base as a baseline, automatically classifies genetic variations.
This is my very first competition since I started my journey of data science the middle of last year. And it has become a unforgetful fast learning experience for me. The rigorious background research, heated forum discussions, sometimes unpredictable model responses, and the hectic and exhuasting feature engineering implementations, have altoghether given me a read taste of the machine learning implementation on an actual need. The competition ends at October 2nd, 2017. I am very happy to see the ranking of top 5% among 1300+ teams in my novice trial as reward of the hardwork I have put. If you are interested, you can check my doumentation
for more details about the solution.
I found it’s rather interesting to see how machine learning can help in routine biological study. Actually, there are tons of repetitive work in a junior researcher’s career. For me, as my project is using biophysical method to study the structural and functional characteristics of a pathological protein, I am exposed to lots of NMR data every day. It always takes weeks or even months to manually assign amino acid fingerprint son HSQC to its primary structure by comparing their chemical shift with empirical values, which is totally boring and mechanic. Therefore, I am very interested in applying my knowledge and skills I learned here to change this situation.
Note
- The mentioned kaggle competition is here.
- All the files mentioned could be found in
data
folder or accessed here. - My kaggle profile is here.
- Research and investigate a real-world problem of interest.
- Accurately apply specific machine learning algorithms and techniques.
- Properly analyze and visualize your data and results for validity.
- Quickly pick up unfamiliar libraries and techniques.
- Prioritize and implement numerous of ideas and hypothesis.
- Document and write a structed report.
[0] Kaggle Competition: Personalized Medicine
[1] OncoKB: A Precision Oncology Knowledge Base
[3] Predicting the Functional Consequences of Somatic Missense Mutations Found in Tumors
[4] Predicting the functional impact of protein mutations: application to cancer genomic
[5] tmVar: A text mining approach for extracting sequence variants in biomedical literature
[6] TaggerOne: Joint Named Entity Recognition and Normalization with Semi-Markov Models
[7] GNormPlus: An Integrative Approach for Tagging Gene, Gene Family and Protein Domain
[8] Personalised Medicine - EDA with tidy R
[10] Brief insight on Genetic variations
[11] Human Genome Variation Society
[12] Official external data and pre-trained models thread
[13] Key Sentences Extraction ideas
[15] Introduction to Ensembling/Stacking in Python
[16] Titanic Top 4% with ensemble modeling
[17] Entrez-biopython
In this project, I classified images from the CIFAR-10 dataset, which is consists of airplanes, dogs, cats, and other objects. I first preprocessed the images, then trained a convolutional neural network on all the samples. After that, I normalized the image features and one-hot encoded the labels. Finally, I applied the concepts and techniques I have learned to build a convolutional, max pooling, dropout, and fully connected layers. At the end, I checked and optimized the neural network's predictions on the sample images.
In this project I applied reinforcement learning techniques for a self-driving agent in a simplified world to aid it in effectively reaching its destinations in the allotted time. I first investigated the environment the agent operates in by constructing a very basic driving implementation. Once my agent was successful at operating within the environment, I proceeded to identify each possible state the agent can be in when considering such things as traffic lights and oncoming traffic at each intersection. With states identified, I implemented a Q-Learning algorithm for the self-driving agent to guide the agent towards its destination within the allotted time. Finally, I improved upon the Q-Learning algorithm to find the best configuration of learning and exploration factors to ensure the self-driving agent is reaching its destinations with consistently positive results.
In this project I applied unsupervised learning techniques on product spending data collected for customers of a wholesale distributor in Lisbon, Portugal to identify customer segments hidden in the data. I first explored the data by selecting a small subset to sample and determine if any product categories highly correlate with one another. Afterwards, I preprocessd the data by scaling each product category and then identifying (and removing) unwanted outliers. With the good, clean customer spending data, I applied PCA transformations to the data and implement clustering algorithms to segment the transformed customer data. Finally, I compared the segmentation found with an additional labeling and consider ways this information could assist the wholesale distributor with future service changes.
- Apply preprocessing techniques such as feature scaling and outlier detection.
- Interpret data points that have been scaled, transformed, or reduced from PCA.
- Analyze PCA dimensions and construct a new feature space.
- Optimally cluster a set of data to find hidden patterns in a dataset.
- Assess information given by cluster data and use it in a meaningful way.
In this project, I applied supervised learning techniques and an analytical mind on data collected for the U.S. census to help CharityML (a fictitious charity organization) identify people most likely to donate to their cause. I first explored the data to learn how the census data is recorded. Next, I applied a series of transformations and preprocessing techniques to manipulate the data into a workable format. I then evaluated several supervised learners on the data, and picked best suited for the solution. Afterwards, optimized the model as the solution to CharityML. Finally, I explored the chosen model and its predictions under the hood, and I found it performed quite well considering the data it's given.
- Identify when preprocessing is needed, and how to apply it.
- Establish a benchmark for a solution to the problem.
- Investigate whether a candidate solution model is adequate for the problem.
In this project, I applied supervised machine learning concepts and techniques on data collected for housing prices in the Boston, Massachusetts area to predict the selling price of a new home.I explored the data to obtain important features and descriptive statistics about the dataset, then splitted the data into testing and training subsets, and decided the most suitable performance metric for this problem, and finally built a fairly performed model for this problem.
- NumPy to investigate the latent features of a dataset.
- Analyze various learning performance plots for variance and bias.
- Determine the best-guess model for predictions from unseen data.
- Evaluate a model's performance on unseen data using previous data.