Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertrix Crawler 0.5.0 Beta 8

23 Mar 01:08
Compare
Choose a tag to compare
Pre-release

This release includes fix for:

  • Improved capture of non-HTML pages, fixes #129
  • For scopeType: domain, if specified URL starts with www., include the non-www version.

Browsertix Crawler 0.5.0 Beta 7

18 Mar 18:50
Compare
Choose a tag to compare
Pre-release

This beta includes the following fixes:

  • Refactor chrome args, add disable LazyFrameLoading to avoid page.goto() never finishing.
  • Fix userAgent customization not working, #90
  • Fix possible cloudflare wait #110
  • Tweak profile creation, support running with pywb proxy
  • Update wacz dependency to 0.4.4

Browsertrix Crawler 0.5.0 Beta 6

14 Mar 21:45
12d96f2
Compare
Choose a tag to compare
Pre-release

Fixes Include:

  • Fix to regression caused in previous release, where check for ERR:NET_ABORTED could cause a null exception.
  • Support for downloading profiles via a URL, eg. --profile https://example.com/path/to/profile.tar.gz

Browsertrix Crawler 0.5.0 Beta 5

14 Mar 18:15
ab096cd
Compare
Choose a tag to compare
Pre-release
  • Support for saving state incrementally when saveState: always is set, saving every saveStateInterval seconds, keeping the last saveStateHistory states.
  • Make direct capture only apply to 200 responses, load all others (eg. redirect via browser). Print just error message, not stack trace, also ignore ERR_ABORTED caused by trying to load a PDF (the file can not be loaded as a page but is still archived).
  • When writing pages, ensure previous page write is awaited.

Browsertix Crawler 0.5.0 Beta 4

07 Mar 17:30
Compare
Choose a tag to compare
Pre-release
  • Update to py-wacz 0.4.3, more tolerant of pages with invalid full text search data (skips pages instead of fails wacz creation)
  • Support for scopeType: domain and include http/https pages in scope by default

Browsertix Crawler 0.5.0 Beta 3

02 Mar 21:30
e160382
Compare
Choose a tag to compare
Pre-release

Various fixes, including:

  • Screencasting refactor, support screencast via redis, add new 'init' message
  • Support for retrying pending URLs after a limited amount of time
  • Redis: load queues gracefully to avoid large redis data load

Browsertix Crawler 0.5.0 Beta 2

27 Jan 01:32
66ce668
Compare
Choose a tag to compare
Pre-release

Add support for WACZ signing (experimental), enabled via WACZ_SIGN_URL and WACZ_SIGN_TOKEN env vars.

Browsertix Crawler 0.5.0 Beta 1

23 Nov 21:01
9f541ab
Compare
Choose a tag to compare
Pre-release

Support for uploading WACZ to S3-compatible storage!

Browsertrix Crawler 0.5.0 Beta 0

25 Sep 17:10
Compare
Choose a tag to compare
Pre-release

Initial Build of 0.5.0 beta for testing!

Browsertrix Crawler 0.4.4

18 Aug 04:28
Compare
Choose a tag to compare

This release includes fixes block rules system and README improvements:

  • Page Block Rules Fix: 'request already handled' errors by avoiding adding duplicate handlers to same page.
  • Page Block Rules Fix: await all continue/abort() calls and catch errors.
  • Page Block Rules: Don't apply to top-level page, print warning and recommend scope rules instead.
  • Setup: Attempt to create the crawl working directory (cwd) specified via --cwd if it doesn't exist.
  • Scope Types: Rename 'none' -> 'page' (single page only) and 'page' -> 'page-spa' (page with hashtags).
  • README: Add more scope rule examples, clarify distinction between scope rules and block rules.
  • README: Update old type -> scopeType, list new scope types.