Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertrix Crawler 1.1.0 Beta 5

15 Apr 21:53
efebc33
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz by @ikreymer in #535
  • Adblock support by @ikreymer in #534
  • Remove no longer needed invalid Brave update URLs by @tw4l in #539
  • Better logging of all queue WARCWriter operations by @ikreymer in #536
  • qa: filter out non-html pages by @ikreymer in #541
  • Fix for --rolloverSize for individual WARCs in 1.x by @ikreymer in #542
  • Set mime type for html pages by @tw4l in #545

Full Changelog: v1.1.0-beta.4...v1.1.0-beta.5

v1.1.0-beta.4

05 Apr 01:14
c247189
Compare
Choose a tag to compare
v1.1.0-beta.4 Pre-release
Pre-release

What's Changed

  • Gracefully handle non-absolute path for create-login-profile --filename by @tw4l in #521
  • refactor handling of max size for html/js/css by @ikreymer in #525
  • merge V1.0.4 change -> main: by @ikreymer in #527
  • ensure all warcwriter write operations go through a queue. by @ikreymer in #528
  • qa/replay crawl loading improvements by @ikreymer in #526

Full Changelog: v1.1.0-beta.3...v1.1.0-beta.4

Browsertrix Crawler v1.0.4

03 Apr 22:23
a3f93ca
Compare
Choose a tag to compare

What's Changed

  • refactor handling of max size for html/js/css by @ikreymer in #525
    Fix for #522, issues loading pages with large streaming js/css

Full Changelog: v1.0.3...v1.0.4

Browsertrix Crawler 1.1.0 Beta 3 (QA Support)

29 Mar 00:21
Compare
Choose a tag to compare

What's Changed

  • Use RFC2606 invalid domain names by @vnznznz in #514
  • Fixes from 1.0.3 release -> main by @ikreymer in #517
  • Unify WARC writing + CDXJ indexing into single class by @ikreymer in #507
  • upgrade puppeteer-core to 22.6.1 by @ikreymer in #516
  • avoid cloudflare detection of puppeteer when using browser profiles: by @ikreymer in #518
  • add an extra --postLoadDelay param to specify how many seconds to wait after page-load by @ikreymer in #520

Full Changelog: v1.1.0-beta.2...v1.1.0-beta.3

Browsertrix Crawler 1.0.3

26 Mar 21:11
Compare
Choose a tag to compare

What's Changed

  • fixes redirected seed (from #475) being counted againt page limit: by @ikreymer in #509
  • sitemap improvements: gz support + application/xml + extraHops fix by @ikreymer in #511

Full Changelog: v1.0.2...v1.0.3

Browsertrix Crawler 1.1.0 Beta 2 (QA Crawl Support Beta)

23 Mar 05:11
Compare
Choose a tag to compare

What's Changed

  • Docs: Minor fixes to edit link & clarifications by @Shrinks99 in #501
  • Improved support for running as non-root by @ikreymer in #503
  • improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully by @ikreymer in #504
  • service worker capture fix: disable by default for now by @ikreymer in #506
  • QA Crawl Support (Beta) by @ikreymer in #469

New Contributors

Full Changelog: v1.1.0-beta.1...v1.1.0-beta.2

Browsertrix Crawler 1.0.2

22 Mar 20:38
22a7351
Compare
Choose a tag to compare

What's Changed

  • service worker capture fix: disable service workers by default for now, add cli option by @ikreymer in #506

Full Changelog: v1.0.1...v1.0.2

Browsertrix Crawler 1.0.1

21 Mar 20:58
93c3894
Compare
Choose a tag to compare

What's Changed

  • Docs: Minor fixes to edit link & clarifications by @Shrinks99 in #501
  • Improved support for running as non-root by @ikreymer in #503
  • improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully by @ikreymer in #504

New Contributors

Full Changelog: v1.0.0...v1.0.1

Browsertrix Crawl 1.1.0 Beta 1 (QA Support)

20 Mar 05:33
Compare
Choose a tag to compare

What's Changed

  • Merge Browsertrix Crawler 1.0.0 release!

Full Changelog: v1.1.0-beta.0...v1.1.0-beta.1

Browsertrix Crawler 1.0.0

19 Mar 17:58
Compare
Choose a tag to compare

Browsertrix Crawler 1.0.0 Release

  • New capture mechanism via Chrome Debug Protocol, instead of pywb
  • Updated mkdocs (hosted at: https://crawler.docs.browsertrix.com/)
  • Customizable WARC filenames
  • Improved log filtering
  • Conversion to TypeScript
  • Support for pageinfo: records per page.
  • Optimized Sitemap parsing

What's Changed

Full Changelog: v0.12.4...v1.0.0