Monitor and downloader of RSS feeds

feed_monitor.py is a script to fetch a set of RSS feeds and download the HTML pages they link. The script loads a list of feeds to be monitored from a CSV file - more on format below - and then proceeds reading the feeds and downloading any HTML page linked in them that has not been already downloaded. Multiple download of the same linked page is avoided by SHA hashing.

CSV format

A row in the CSV file specifies a feed to be monitored, using the format:

"url of the rss feed","name of the feed","label 1","label 2","label 3",...

The first column is the URL of the RSS feed.
The second column is a name that identifies the source of the feed.
Any successive column - at least one is required - assigns a label to the feed.

Any page downloaded from a feed will be saved in the path determined by the name and each of the labels.

For examples, given the row:

https://sports.yahoo.com/mlb/rss.xml,"yahoo","MLB","sports"

any downloaded page will be saved under the directories yahoo/MLB and yahoo/sports.

The idea is that more than one feed can contribute to a label, e.g.:

https://sports.yahoo.com/nba/rss.xml,"yahoo","NBA","sports"

will also contribute its pages to yahoo/sports (and save to the more specific label yahoo/NBA).

The whole HTML of downloaded pages is saved together with a JSON record of the original RSS item. File names are determined by hashing the URL of the page.

Extracting text from the HTML page

The feed_extractor.py script is a simple script that uses a few hand-made rules to isolate the relevant text of the page from the rest of the content (menus, ads, links, headers and footers). The default heuristic is to keep all text that is visible, not part of a link, and that is long at least 25 characters. A regex can be specified to further clean the text. This is however just an example on how to extract text, each feed may need a dedicated processing to get a nice output, e.g., using the text_cleaner.py script.

DISCLAIMER

The content you download from a feed will be likely covered by copyright and/or other IP rights, please check with the source of the feed what you can do with the content you download.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
LICENSE		LICENSE
README.md		README.md
dirs_to_csv.py		dirs_to_csv.py
feed_extractor.py		feed_extractor.py
feed_monitor.py		feed_monitor.py
sample_feeds.csv		sample_feeds.csv
shuffle_csv.py		shuffle_csv.py
text_cleaner.py		text_cleaner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Monitor and downloader of RSS feeds

CSV format

Extracting text from the HTML page

DISCLAIMER

About

Releases

Packages

Languages

License

aesuli/rss-feed-monitor

Folders and files

Latest commit

History

Repository files navigation

Monitor and downloader of RSS feeds

CSV format

Extracting text from the HTML page

DISCLAIMER

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages