Releases: webrecorder/browsertrix-crawler
Releases · webrecorder/browsertrix-crawler
Browsertrix Crawler 1.1.0 Beta 5
What's Changed
- Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz by @ikreymer in #535
- Adblock support by @ikreymer in #534
- Remove no longer needed invalid Brave update URLs by @tw4l in #539
- Better logging of all queue WARCWriter operations by @ikreymer in #536
- qa: filter out non-html pages by @ikreymer in #541
- Fix for --rolloverSize for individual WARCs in 1.x by @ikreymer in #542
- Set mime type for html pages by @tw4l in #545
Full Changelog: v1.1.0-beta.4...v1.1.0-beta.5
v1.1.0-beta.4
What's Changed
- Gracefully handle non-absolute path for create-login-profile --filename by @tw4l in #521
- refactor handling of max size for html/js/css by @ikreymer in #525
- merge V1.0.4 change -> main: by @ikreymer in #527
- ensure all warcwriter write operations go through a queue. by @ikreymer in #528
- qa/replay crawl loading improvements by @ikreymer in #526
Full Changelog: v1.1.0-beta.3...v1.1.0-beta.4
Browsertrix Crawler v1.0.4
What's Changed
- refactor handling of max size for html/js/css by @ikreymer in #525
Fix for #522, issues loading pages with large streaming js/css
Full Changelog: v1.0.3...v1.0.4
Browsertrix Crawler 1.1.0 Beta 3 (QA Support)
What's Changed
- Use RFC2606 invalid domain names by @vnznznz in #514
- Fixes from 1.0.3 release -> main by @ikreymer in #517
- Unify WARC writing + CDXJ indexing into single class by @ikreymer in #507
- upgrade puppeteer-core to 22.6.1 by @ikreymer in #516
- avoid cloudflare detection of puppeteer when using browser profiles: by @ikreymer in #518
- add an extra --postLoadDelay param to specify how many seconds to wait after page-load by @ikreymer in #520
Full Changelog: v1.1.0-beta.2...v1.1.0-beta.3
Browsertrix Crawler 1.0.3
Browsertrix Crawler 1.1.0 Beta 2 (QA Crawl Support Beta)
What's Changed
- Docs: Minor fixes to edit link & clarifications by @Shrinks99 in #501
- Improved support for running as non-root by @ikreymer in #503
- improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully by @ikreymer in #504
- service worker capture fix: disable by default for now by @ikreymer in #506
- QA Crawl Support (Beta) by @ikreymer in #469
New Contributors
- @Shrinks99 made their first contribution in #501
Full Changelog: v1.1.0-beta.1...v1.1.0-beta.2
Browsertrix Crawler 1.0.2
What's Changed
- service worker capture fix: disable service workers by default for now, add cli option by @ikreymer in #506
Full Changelog: v1.0.1...v1.0.2
Browsertrix Crawler 1.0.1
What's Changed
- Docs: Minor fixes to edit link & clarifications by @Shrinks99 in #501
- Improved support for running as non-root by @ikreymer in #503
- improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully by @ikreymer in #504
New Contributors
- @Shrinks99 made their first contribution in #501
Full Changelog: v1.0.0...v1.0.1
Browsertrix Crawl 1.1.0 Beta 1 (QA Support)
What's Changed
- Merge Browsertrix Crawler 1.0.0 release!
Full Changelog: v1.1.0-beta.0...v1.1.0-beta.1
Browsertrix Crawler 1.0.0
Browsertrix Crawler 1.0.0 Release
- New capture mechanism via Chrome Debug Protocol, instead of pywb
- Updated mkdocs (hosted at: https://crawler.docs.browsertrix.com/)
- Customizable WARC filenames
- Improved log filtering
- Conversion to TypeScript
- Support for
pageinfo:
records per page. - Optimized Sitemap parsing
What's Changed
- Use new browser-based archiving mechanism instead of pywb proxy by @ikreymer in #424
- TypeScript Conversion by @ikreymer in #425
- Add Prettier to the repo, and format all the files! by @emma-sg in #428
- follow-up to #428: update ignore files by @ikreymer in #431
- Raise size limit for large HTML pages by @ikreymer in #430
- logging: don't log filtered out direct fetch attempt as error by @ikreymer in #432
- Fix potential for pending list never being processed by @ikreymer in #433
- more specific types additions by @ikreymer in #434
- Add types + validation for log context options by @ikreymer in #435
- WARC filename prefix + rollover size + improved 'livestream' / truncated response support. by @ikreymer in #440
- detect invalid custom behaviors on load: by @ikreymer in #450
- Merge 0.12.3 into 1.0.0 by @ikreymer in #455
- Generate urn:pageinfo: records by @ikreymer in #458
- skipping resources: ensure HEAD, OPTIONS, 204, 206, and 304 response/request pairs are not written to WARC by @ikreymer in #460
- Add arg to write pages to Redis by @tw4l in #464
- Page Resources: Include Cached Resources by @ikreymer in #465
- Update Browser Image by @ikreymer in #466
- Misc Page Resource/Recorder Fixes by @ikreymer in #467
- Include resource type + mime type in page resources list by @ikreymer in #468
- Set warc prefix via WARC_PREFIX env var by @ikreymer in #470
- pageinfo: add console errors to pageinfo record, tracking in 'counts' field by @ikreymer in #471
- warcwriter: better filehandle init on first use by @ikreymer in #474
- Include WARC prefix for screenshots and text WARCs by @ikreymer in #473
- new seed on redirect + error page check: by @ikreymer in #476
- store page statusCode if not 200 by @ikreymer in #477
- Ensure links added via behaviors also get processed by @ikreymer in #478
- Fail on status code option + requeue fix by @ikreymer in #480
- warc: add Network.resourceType (https://chromedevtools.github.io/devt… by @ikreymer in #481
- resourceType lowercase fix: by @ikreymer in #483
- Dev 1.0.0 -> Main by @ikreymer in #482
- Better tracking of failed requests + logging context exclude by @ikreymer in #485
- page state type fixes: by @ikreymer in #488
- Additional type fixes, follow-up to #488 by @ikreymer in #489
- Fix Save/Load State by @ikreymer in #495
- Add MKDocs documentation site for Browsertrix Crawler 1.0.0 by @tw4l in #494
- Temporarily disable tmp-cdx creation by @tw4l in #499
- profiles: handle terminate signals directly by @ikreymer in #500
- SAX-based sitemap parser by @ikreymer in #497
Full Changelog: v0.12.4...v1.0.0