Skip to content

Releases: icaropires/pdf2dataset

v0.5.3

13 Sep 04:37
b070d65
Compare
Choose a tag to compare

Changes

  • Fix problems with custom tasks (#21)
  • Add tests for custom tasks subclass
  • Improve tests organization

v0.5.2

08 Sep 20:18
bfde2d1
Compare
Choose a tag to compare

Changes

  • @features decorator more flexible, don't requiring pyarrow_type for helper features #19
  • @features decorator now supports receiving arguments to be passed to pyarrow_type #20
  • number of total tasks when resuming processing is now fixed #18
  • Don't use any kind of pools when using num_cpus=1

Known issues

  • Problem with memoization when implementing custom tasks class #21

v0.5.1

01 Sep 03:43
c3f3472
Compare
Choose a tag to compare

Changes

  • Add PIL.Image.DecompressionBombError to expected exceptions when extracting image #15
  • Progress bar showing before first results arrive #14
  • Progress bar starts showing skipped files

Rework base structure

23 Aug 21:42
c3f3472
Compare
Choose a tag to compare

Just renamed from v4.0.0 which was wrong!

Changes

  • Refactor most of the code (including tests) structure to be scalable on the number of extracted features
  • Add support to specify custom features through inheritance
  • Add image feature
  • Support multiple params to customize text and image extractions (image size, ocr image size, image format, etc)
  • Update dependencies
  • General small fixes

Known issues

  • Saving progress had to be disabled yet for this release #8 . Will be fixed on the next one

v0.5.0

29 Aug 03:34
Compare
Choose a tag to compare

Changes

  • Rework "resuming progress" feature
  • Add support to receiving a list of files to be processed
  • Improves code quality

Fix performance

24 Aug 03:47
Compare
Choose a tag to compare

Changes

  • v0.4.0 caused some problems with performance, this fix them

Small improvements

02 Aug 01:30
Compare
Choose a tag to compare

Changes

  • Raise exception for invalid input_dir
  • Add maximum chunksize default constraint

Bug Fix

29 Jul 02:08
Compare
Choose a tag to compare

Changes

  • Raise exceptions for invalid page numbers when specifying tasks

New features

29 Jul 01:28
Compare
Choose a tag to compare

Changes

  • Add ability to pass specific tasks to be calculated
  • Add ability to return a list instead of pandas DataFrame

Fix High Memory Usage

27 Jul 22:26
Compare
Choose a tag to compare

Changes

  • Fix high memory usage caused by last release #3
  • Chunksize was not really being calculated