This repository documents a machine learning project for the classification of resumes into different categories using the UpdatedResumeDataset.csv dataset. The project follows a systematic approach, including pre-processing, exploratory data analysis (EDA), text cleaning, stopwords removal, feature extraction, and model evaluation.
- Loaded the dataset using the pandas library to initiate the project.
- Examined the distribution of resume categories within the dataset using the
value_counts()
method. This step provides an initial understanding of the dataset's class distribution.
- Visualized the count of resumes for each category using a horizontal bar plot. This plot provides a clear representation of the number of resumes in each category, aiding in identifying any class imbalances.
- Generated a pie chart illustrating the percentage distribution of resumes across different categories. This visualization helps in grasping the proportional contribution of each category to the overall dataset.
- Standardized the text data by converting all resume text to lowercase.
- Developed a cleaning function to remove special characters, URLs, RT, punctuations, and extra whitespace.
- Stored the cleaned text in a new column for further analysis.
- Utilized the nltk library to tokenize the cleaned resume text and identify the most common words.
- Created a Word Cloud to visually represent the most frequently occurring words in the resume text.
- Encoded the categorical variable 'Category' into numerical values using label encoding.
- Utilized the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to convert the cleaned resume text into numerical feature vectors. This process involves tokenizing documents, learning vocabulary, and calculating inverse document frequency weightings.
- Splitted the data into training and testing sets.
- Implemented a Naive Bayes Classifier, specifically the MultinomialNB model, to train on the feature vectors and make predictions.
- Evaluated the model's performance by calculating accuracy and providing a detailed classification report.
Explore the complete project, including the Jupyter notebook with code and visualizations, at Project Link.
Feel free to adapt and use the provided code for your own resume classification tasks. For any questions or suggestions, please open an issue or reach out. Happy exploring!