Corpus harvester

This project emerges from the need of creating text datasets.

This program scraps search engines (currently google) for links based on queries (configurable in seeds.json).

Then it acceses the links to cleverly extract information from text/html pages using Python library newspaper, or extracting information from files such as PDF or DOCs using Textract.

Finally, it cleans the text leaving only words separated by spaces.

Each source is extracted in its correspondant folder in a separate .txt file.

Dependencies

This program has been only tested on Ubuntu alongisde Python3.5 for the moment. Once you have this context, to use it:

Install textract dependencies

apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev
pip3 install textract

Install Newspaper to extract main content from webpages (I tested library Goose, but this one seems to perform better).

pip3 install newspaper3k

Using harvester

Install dependencies
Configure seeds.json
Run harvest.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
debug		debug
logs		logs
modules		modules
tmp		tmp
discard_links		discard_links
harvest.py		harvest.py
readme.md		readme.md
seeds.json		seeds.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Corpus harvester

Dependencies

Using harvester

About

Releases

Packages

Languages

freesoul/corpus-harvester

Folders and files

Latest commit

History

Repository files navigation

Corpus harvester

Dependencies

Using harvester

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages