GitHub - hanzalaareeb/Data-Extraction-NLP: project involves extracting text from articles using web scraping techniques and performing text analysis to compute various linguistic variables

Data Extraction and NLP Text Analysis This project involves extracting text from articles using web scraping techniques and performing text analysis to compute various linguistic variables such as sentiment scores, readability metrics, and more. The project was developed using Python and leverages libraries like BeautifulSoup, Selenium, and others to crawl the web and analyze the data.

Objective The goal of this project is to extract textual data from provided URLs and analyze the text to compute a range of text-based variables including:

Positive Score Negative Score Polarity Score Subjectivity Score Average Sentence Length Percentage of Complex Words Gunning Fog Index Average Number of Words Per Sentence Complex Word Count Word Count Syllables Per Word Personal Pronouns Average Word Length Features Data Extraction: Scrapes articles from the URLs provided in an Excel sheet (input.xlsx). Only the article title and text are extracted, excluding irrelevant content like headers, footers, or advertisements.

Text Analysis: Performs sentiment analysis and computes various text-based metrics such as readability and sentence complexity.

Output: The results are saved in a structured format as per the Output Data Structure.xlsx.

Libraries Used beautifulsoup4==4.12.3 blinker==1.8.2 distlib==0.3.8 httpx==0.26.0 nltk==3.9.1 openpyxl==3.1.5 packaging==23.2 pandas==2.2.1 referencing==0.33.0 requests==2.31.0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Cleaned_Articles		Cleaned_Articles
MasterDictionary		MasterDictionary
StopWords		StopWords
extracted_articles		extracted_articles
Extraction.py		Extraction.py
Input.xlsx		Input.xlsx
Output Data Solution.xlsx		Output Data Solution.xlsx
ReadMe.md		ReadMe.md
analysis2.py		analysis2.py
cleaning.py		cleaning.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

hanzalaareeb/Data-Extraction-NLP

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages