- extraction: add heuristics (#173)
- maintenance: explicitly support Python 3.13 (#172)
- tests: better coverage (#175)
- docs: update images and contributing (#180)
- maintenance: explicit re-export and code quality (#168)
- setup: remove pytest.ini (#167)
- update dependencies
- fix: more robust copyright parsing (#165)
- cleaning fix: safer element removal (2735620)
- focus on Python 3.8+, use pyproject.toml file and update setup (#150, #153, #160)
- revamp tests and evaluation (#151)
- simplify code parts (#152)
- docs: convert readme to markdown (#147)
- fix: more restrictive YYYYMM pattern to prevent ValueError with @b3n4kh (#145)
- maintenance: add pre-commit with checks with @nadasuhailAyesh12 (#142)
- change license to Apache 2.0 (#140)
- compile XPath expressions (#136)
- update docs with @EkaterineSheshelidze (#135)
- fix meta property updated vs. original behavior (#121)
- support for LXML version 5.0+ (#127)
- fix image links in Readme
- fix for MacOS: pin LXML dependency with @adamh-oai
- focus on precision, stricter extraction patterns (#103, #105, #106, #112)
- simplified code base (#108, #109)
- replaced lxml.html.Cleaner (#104)
- extended evaluation
- fix for missing months keys in custom extractor (#100)
- fix for None in
try_date_expr()
(#101)
- fix regression for fast extraction introduced in e8b3538 (#96)
- fix setup by making backports-datetime-fromisoformat optional (#95)
- slightly higher accuracy with revised heuristics
- simplified code structure for better performance
- setup: support for 3.12, fromisoformat backport if applicable
- HTML parsing fixes: more lenient parsing, pinned LXML version for MacOS
- maintenance release: upgrade
urllib3
dependency
- support min_date/max_date as datetimes or datetime strings with @kernc (#73)
- add date attributes to HTML extraction with @kernc (#74)
- fix for extraction of updated and original dates in time elements
- code refactoring and maintenance
- better coverage of relevant HTML attributes
- automatically define upper time bound at each function call (#70)
- reviewed and simplified extraction code
- cache validation for format diverging from
%Y-%m-%d
- updated dependencies and removed real-world tests from package
- additional search of free text in whole document (#67)
- optional parameter for subdaily precision with @getorca (#66)
- fix for HTML doctype parsing (#44)
- cleaner code for multilingual month expressions
- extended expressions for extraction in HTML meta fields
- update of dependencies and evaluation
- technical release: explicit support for Python 3.11 and logo
- fix for use of
min_date
&max_date
(#62) - simplified code & updated setup
- entirely type-checked code base
- new function
clear_caches()
(#57) - slightly more efficient code (about 5% faster)
- fix for memory leak (#56)
- docs updated
- slightly higher accuracy & faster extensive extraction
- maintenance: code base simplified, more tests
- bugs addressed: #51, #54
- docs: fix by @MSK1582
- speed and accuracy gains
- better extraction coverage, simpler code
- bug fixed (typo in variable)
- better performance
- remove unnecessary ciso8601 dependency
- temporary fix for scrapinghub/dateparser#1045 bug
- bugfix: input encoding
- improved extraction coverage (#47)
- better handling of file encodings
- slight increase in accuracy, more efficient code
- maintenance release, code base cleaned
- command-line interface:
--version
added - file parsing reviewed
- faster and more accurate encoding detection
- simplified code base
- include support for Python 3.10 and dropped support for Python 3.5
- improved generic date parsing (thanks @RadhiFadlillah)
- specific support for French and Indonesian (thanks @RadhiFadlillah)
- additional evaluation for English news sites (kudos to @coreydockser & @rahulbot)
- bugs fixed
- improved exhaustive search
- simplified code
- bug fixes
- removed support for Python 3.4
- bugfixes
dateparser
andregex
modules fully integrated- patterns added for coverage
- smarter HTML doc loading
- dependencies updated and reduced: switch from
requests
to bareurllib3
, makechardet
standard andcchardet
optional - fixes: downloads,
OverflowError
in extraction
- compatibility with Python 3.9
- better speed and accuracy
- technical release: package requirements and docs wording
- code base and performance improved
- minimum date available as option
- support for Turkish patterns and CMS idiosyncrasies (thanks @evolutionoftheuniverse)
- more efficient code
- additional evaluation data
- performance and documentation improved
- code base restructured
- bugs fixed and further tests
- restored retro-compatibility with Python 3.4
- reduced number of packages dependencies
- introduced and tested optional dependencies
- more detailed documentation on readthedocs
- tests on Windows
- compataibility and code linting
- tests on Linux & MacOS
- bugs removed
- manually set maximum date
- better precision
- temporarily dropped support for Python 3.4
- coverage extension
- small bugs and coverage issues removed
- streamlined utils
- documentation added
- bugs corrected and cleaner code
- more errors caught and better test coverage
- significant speed-up after code profiling
- better support of free text detection (DE/EN)
- fixed lxml dependency
- reordered XPath-expressions
- refined and combined XPath-expressions
- better extraction of dates in free text
- better coverage and consistency issues solved
- improved consistency and further tests
- improvements in markup analysis along with more tests
- higher resolution for free text detection (e.g. DD/MM/YY)
- download mode (serial on command-line)
- better code consistency
- tested for Python2 and 3 with tox and coverage stats
- refined date comparisons
- debug and logging options
- more tests and test files
- extensive search can be disabled
- refined targeting of HTML structure
- better extraction logic for plain text cases
- further tests
- better extraction
- logging
- further tests
- settings
- tests functions (tox and pytest)
- retro-compatibility (python2)
- minor improvements
- minimum viable package