-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
WIP add processors/archive_webpages module
- fixes #36
- Loading branch information
Showing
7 changed files
with
113 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
"""archive webpages | ||
TODO | ||
# $ cat tests/.hecat.archive_webpages.yml | ||
steps: | ||
- name: archive webpages | ||
module: processors/archive_webpages | ||
module_options: | ||
data_file: tests/shaarli.yml # path to the YAML data file | ||
only_tags: ['doc'] # only download items tagged with all these tags | ||
exclude_tags: ['nodl'] # (default []), don't download items tagged with any of these tags | ||
output_directory: 'tests/webpages' # path to the output directory for media files | ||
skip_already_archived: True # (default True) skip processing when item already has a 'archive_path': key | ||
# $ hecat --config tests/.hecat.archive_webpages.yml | ||
Data file format (output of import_shaarli module): | ||
# shaarli.yml | ||
- id: 1667 # required, unique id | ||
url: https://solar.lowtechmagazine.com/2016/10/pigeon-towers-a-low-tech-alternative-to-synthetic-fertilizers | ||
tags: | ||
- tag1 | ||
- tag2 | ||
- diy | ||
- doc | ||
- readlater | ||
... | ||
archive_path: TODO | ||
Source directory structure: | ||
└── shaarli.yml | ||
Output directory structure: | ||
└── TODO | ||
""" | ||
|
||
import sys | ||
import os | ||
import logging | ||
import ruamel.yaml | ||
from ..utils import load_yaml_data | ||
|
||
yaml = ruamel.yaml.YAML() | ||
yaml.indent(sequence=2, offset=0) | ||
yaml.width = 99999 | ||
|
||
def wget(item): | ||
"""archive a webpage with wget""" | ||
|
||
|
||
def archive_webpages(step): | ||
"""archive webpages linked from each item's 'url', if their tags match one of step['only_tags'], | ||
write path to local archive to a new key 'archive_path' in the original data file for each downloaded item | ||
""" | ||
skipped_count = 0 | ||
items = load_yaml_data(step['module_options']['data_file']) | ||
for item in items: | ||
# skip already archived items when skip_already_archived: True | ||
if (('skip_already_archived' not in step['module_options'].keys() or | ||
step['module_options']['skip_already_archived']) and 'archive_path' in item.keys()): | ||
logging.debug('skipping %s (id %s): already archived', item['url'], item['id']) | ||
skipped_count = skipped_count +1 | ||
# skip items matching exclude_tags | ||
elif ('exclude_tags' in step['module_options'] and any(tag in item['tags'] for tag in step['module_options']['exclude_tags'])): | ||
logging.debug('skipping %s (id %s): one or more tags are present in exclude_tags', item['url'], item['id']) | ||
skipped_count = skipped_count +1 | ||
# archive items matching only_tags | ||
elif list(set(step['module_options']['only_tags']) & set(item['tags'])): | ||
logging.info('archiving %s (id %s)', item['url'], item ['id']) | ||
wget(item) | ||
else: | ||
logging.debug('skipping %s (id %s): no tags matching only_tags', item['url'], item['id']) | ||
skipped_count = skipped_count + 1 | ||
# sys.exit(1) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
steps: | ||
- name: archive webpages | ||
module: processors/archive_webpages | ||
module_options: | ||
data_file: tests/shaarli.yml | ||
only_tags: ['hecat'] | ||
exclude_tags: ['nodl'] | ||
output_directory: 'tests/webpages' | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters