A simple html scraper with xpath or css.
pip install hodorlive
WARNING: This package by default doesn't verify ssl connections. Please check the arguments to enable them.
from hodor import Hodor
from dateutil.parser import parse
def date_convert(data):
return parse(data)
url = 'http://www.nasdaq.com/markets/stocks/symbol-change-history.aspx'
CONFIG = {
'old_symbol': {
'css': '#SymbolChangeList_table tr td:nth-child(1)',
'many': True
},
'new_symbol': {
'css': '#SymbolChangeList_table tr td:nth-child(2)',
'many': True
},
'effective_date': {
'css': '#SymbolChangeList_table tr td:nth-child(3)',
'many': True,
'transform': date_convert
},
'_groups': {
'data': '__all__',
'ticker_changes': ['old_symbol', 'new_symbol']
},
'_paginate_by': {
'xpath': '//*[@id="two_column_main_content_lb_NextPage"]/@href',
'many': False
}
}
h = Hodor(url=url, config=CONFIG, pagination_max_limit=5)
h.data
{'data': [{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
'new_symbol': 'ARNC',
'old_symbol': 'AA'},
{'effective_date': datetime.datetime(2016, 11, 1, 0, 0),
'new_symbol': 'ARNC$',
'old_symbol': 'AA$'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALN8',
'old_symbol': 'AHUSDN2018'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALN9',
'old_symbol': 'AHUSDN2019'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ6',
'old_symbol': 'AHUSDQ2016'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ7',
'old_symbol': 'AHUSDQ2017'},
{'effective_date': datetime.datetime(2016, 8, 16, 0, 0),
'new_symbol': 'MALQ8',
'old_symbol': 'AHUSDQ2018'}]}
ua
(User-Agent)proxies
(check requesocks)auth
crawl_delay
(crawl delay in seconds across pagination - default: 3 seconds)pagination_max_limit
(max number of pages to crawl - default: 100)ssl_verify
(default: False)robots
(if set respects robots.txt - default: True)reppy_capacity
(robots cache LRU capacity - default: 100)trim_values
(if set trims output for leading and trailing whitespace - default: True)
- By default any key in the config is a rule to parse.
- Each rule can be either a
xpath
or acss
- Each rule can extract
many
values by default unless explicity set toFalse
- Each rule can allow to
transform
the result with a function if provided
- Each rule can be either a
- Extra parameters include grouping (
_groups
) and pagination (_paginate_by
) which is also of the rule format.