Machine Learning Engineer Nanodegree

Estimated length of program:

200 hrs (Udacity Offical)

Suggested Complete Time:

10 - 12 months (Udacity Official)

Actual Complete Time:

4 weeks

(Projects are listed in reverse chronological order).

Capstone Project: Kaggle's Personalized Medicine

Overview

The host, Memorial Sloan Kettering Cancer Center (MSKCC), has been maintaining the database OncoKB for the purpose of knowledge sharing of mutation effects among oncologist. Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature. And due to my combined interest in biology and data science, I am interested in using my expertise to develop a machine learning algorithm that, using an expert-annotated knowledge base as a baseline, automatically classifies genetic variations.

This is my very first competition since I started my journey of data science the middle of last year. And it has become a unforgetful fast learning experience for me. The rigorious background research, heated forum discussions, sometimes unpredictable model responses, and the hectic and exhuasting feature engineering implementations, have altoghether given me a read taste of the machine learning implementation on an actual need. The competition ends at October 2nd, 2017. I am very happy to see the ranking of top 5% among 1300+ teams in my novice trial as reward of the hardwork I have put. If you are interested, you can check my doumentation for more details about the solution.

I found it’s rather interesting to see how machine learning can help in routine biological study. Actually, there are tons of repetitive work in a junior researcher’s career. For me, as my project is using biophysical method to study the structural and functional characteristics of a pathological protein, I am exposed to lots of NMR data every day. It always takes weeks or even months to manually assign amino acid fingerprint son HSQC to its primary structure by comparing their chemical shift with empirical values, which is totally boring and mechanic. Therefore, I am very interested in applying my knowledge and skills I learned here to change this situation.

Note

The mentioned kaggle competition is here.
All the files mentioned could be found in data folder or accessed here.
My kaggle profile is here.

Software and Libararies

Skills Used

Research and investigate a real-world problem of interest.
Accurately apply specific machine learning algorithms and techniques.
Properly analyze and visualize your data and results for validity.
Quickly pick up unfamiliar libraries and techniques.
Prioritize and implement numerous of ideas and hypothesis.
Document and write a structed report.

References

[0] Kaggle Competition: Personalized Medicine

[1] OncoKB: A Precision Oncology Knowledge Base

[2] Cancer-specific High-throughput Annotation of Somatic Mutations: computational prediction of driver missense mutations

[3] Predicting the Functional Consequences of Somatic Missense Mutations Found in Tumors

[4] Predicting the functional impact of protein mutations: application to cancer genomic

[5] tmVar: A text mining approach for extracting sequence variants in biomedical literature

[6] TaggerOne: Joint Named Entity Recognition and Normalization with Semi-Markov Models

[7] GNormPlus: An Integrative Approach for Tagging Gene, Gene Family and Protein Domain

[8] Personalised Medicine - EDA with tidy R

[9] Redefining Treatment

[10] Brief insight on Genetic variations

[11] Human Genome Variation Society

[12] Official external data and pre-trained models thread

[13] Key Sentences Extraction ideas

[14] KAGGLE ENSEMBLING GUIDE

[15] Introduction to Ensembling/Stacking in Python

[16] Titanic Top 4% with ensemble modeling

[17] Entrez-biopython

Deep Learning Project: Classifying CIFAR-10 Images

Overview

In this project, I classified images from the CIFAR-10 dataset, which is consists of airplanes, dogs, cats, and other objects. I first preprocessed the images, then trained a convolutional neural network on all the samples. After that, I normalized the image features and one-hot encoded the labels. Finally, I applied the concepts and techniques I have learned to build a convolutional, max pooling, dropout, and fully connected layers. At the end, I checked and optimized the neural network's predictions on the sample images.

Software and Libraries

Reinforcement Learning Project: Train a Smartcab How to Drive

Overview

In this project I applied reinforcement learning techniques for a self-driving agent in a simplified world to aid it in effectively reaching its destinations in the allotted time. I first investigated the environment the agent operates in by constructing a very basic driving implementation. Once my agent was successful at operating within the environment, I proceeded to identify each possible state the agent can be in when considering such things as traffic lights and oncoming traffic at each intersection. With states identified, I implemented a Q-Learning algorithm for the self-driving agent to guide the agent towards its destination within the allotted time. Finally, I improved upon the Q-Learning algorithm to find the best configuration of learning and exploration factors to ensure the self-driving agent is reaching its destinations with consistently positive results.

Software and Libraries

Unsupervised Learning Project: Creating Customer Segments

Overview

In this project I applied unsupervised learning techniques on product spending data collected for customers of a wholesale distributor in Lisbon, Portugal to identify customer segments hidden in the data. I first explored the data by selecting a small subset to sample and determine if any product categories highly correlate with one another. Afterwards, I preprocessd the data by scaling each product category and then identifying (and removing) unwanted outliers. With the good, clean customer spending data, I applied PCA transformations to the data and implement clustering algorithms to segment the transformed customer data. Finally, I compared the segmentation found with an additional labeling and consider ways this information could assist the wholesale distributor with future service changes.

Skills Used

Apply preprocessing techniques such as feature scaling and outlier detection.
Interpret data points that have been scaled, transformed, or reduced from PCA.
Analyze PCA dimensions and construct a new feature space.
Optimally cluster a set of data to find hidden patterns in a dataset.
Assess information given by cluster data and use it in a meaningful way.

Software and Libraries

Supervised Learning Project: Finding Donors for CharityML

Overview

In this project, I applied supervised learning techniques and an analytical mind on data collected for the U.S. census to help CharityML (a fictitious charity organization) identify people most likely to donate to their cause. I first explored the data to learn how the census data is recorded. Next, I applied a series of transformations and preprocessing techniques to manipulate the data into a workable format. I then evaluated several supervised learners on the data, and picked best suited for the solution. Afterwards, optimized the model as the solution to CharityML. Finally, I explored the chosen model and its predictions under the hood, and I found it performed quite well considering the data it's given.

Skills Used

Identify when preprocessing is needed, and how to apply it.
Establish a benchmark for a solution to the problem.
Investigate whether a candidate solution model is adequate for the problem.

Software and Libraries

Supervised Learning Project: Predicting Boston Housing Prices

Overview

In this project, I applied supervised machine learning concepts and techniques on data collected for housing prices in the Boston, Massachusetts area to predict the selling price of a new home.I explored the data to obtain important features and descriptive statistics about the dataset, then splitted the data into testing and training subsets, and decided the most suitable performance metric for this problem, and finally built a fairly performed model for this problem.

Skills Used

NumPy to investigate the latent features of a dataset.
Analyze various learning performance plots for variance and bias.
Determine the best-guess model for predictions from unseen data.
Evaluate a model's performance on unseen data using previous data.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
P1-Supervised Learning for Boston Housing Predicting		P1-Supervised Learning for Boston Housing Predicting
P2-Supervised Learning for Finding Donors for CharityML		P2-Supervised Learning for Finding Donors for CharityML
P3-Unsupervised Learning for Creating Customer Segments		P3-Unsupervised Learning for Creating Customer Segments
P4-Reinforced Learning for Designing a Smart Cab		P4-Reinforced Learning for Designing a Smart Cab
P5-TensorFlow Deeplearning Network for Classifying CIFAR-10 Images		P5-TensorFlow Deeplearning Network for Classifying CIFAR-10 Images
P6-Kaggle-s Personalized Medicine Competition		P6-Kaggle-s Personalized Medicine Competition
README.md		README.md
image11.png		image11.png
mlnd-certificate.png		mlnd-certificate.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Engineer Nanodegree

Estimated length of program:

Suggested Complete Time:

Actual Complete Time:

Capstone Project: Kaggle's Personalized Medicine

Overview

Software and Libararies

Skills Used

References

Deep Learning Project: Classifying CIFAR-10 Images

Overview

Software and Libraries

Reinforcement Learning Project: Train a Smartcab How to Drive

Overview

Software and Libraries

Unsupervised Learning Project: Creating Customer Segments

Overview

Skills Used

Software and Libraries

Supervised Learning Project: Finding Donors for CharityML

Overview

Skills Used

Software and Libraries

Supervised Learning Project: Predicting Boston Housing Prices

Overview

Skills Used

Software and Libraries

About

Releases

Packages

Languages

Heronwang/UDACITY-Machine-Learning-Engineer

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Engineer Nanodegree

Estimated length of program:

Suggested Complete Time:

Actual Complete Time:

Capstone Project: Kaggle's Personalized Medicine

Overview

Software and Libararies

Skills Used

References

Deep Learning Project: Classifying CIFAR-10 Images

Overview

Software and Libraries

Reinforcement Learning Project: Train a Smartcab How to Drive

Overview

Software and Libraries

Unsupervised Learning Project: Creating Customer Segments

Overview

Skills Used

Software and Libraries

Supervised Learning Project: Finding Donors for CharityML

Overview

Skills Used

Software and Libraries

Supervised Learning Project: Predicting Boston Housing Prices

Overview

Skills Used

Software and Libraries

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages