Skip to content

Releases: apify/crawlee

v0.20.2

09 Mar 17:06
Compare
Choose a tag to compare
  • Fix an error where persistence of SessionPool would fail if a cookie included invalid
    expires value.
  • Skipping one patch version because of an error in publishing via CI.

v0.20.0

03 Mar 13:06
Compare
Choose a tag to compare
  • BREAKING: Apify.utils.requestAsBrowser() no longer aborts request on status code 406
    or when other than text/html type is received. Use options.abortFunction if you want to
    retain this functionality.
  • BREAKING: Added useInsecureHttpParser option to Apify.utils.requestAsBrowser() which
    is true by default and forces the function to use a HTTP parser that is less strict than
    default Node 12 parser, but also less secure. It is needed to be able to bypass certain
    anti-scraping walls and fetch websites that do not comply with HTTP spec.
  • BREAKING: RequestList now removes all the elements from the sources array on
    initialization. If you need to use the sources somewhere else, make a copy. This change
    was added as one of several measures to improve memory management of RequestList
    in scenarios with very large amount of Request instances.
  • DEPRECATED: RequestListOptions.persistSourcesKey is now deprecated. Please use
    RequestListOptions.persistRequestsKey.
  • RequestListOptions.sources can now be an array of string URLs as well.
  • Added sourcesFunction to RequestListOptions. It enables dynamic fetching of sources
    and will only be called if persisted Requests were not retrieved from key-value store.
    Use it to reduce memory spikes and also to make sure that your sources are not re-created
    on actor restarts.
  • Updated stealth hiding of webdriver to avoid recent detections.
  • Apify.utils.log now points to an updated logger instance which prints colored logs (in TTY)
    and supports overriding with custom loggers.
  • Improved Apify.launchPuppeteer() code to prevent triggering bugs in Puppeteer by passing
    more than required options to puppeteer.launch().
  • Documented BasicCrawler.autoscaledPool property, and added CheerioCrawler.autoscaledPool
    and PuppeteerCrawler.autoscaledPool properties.
  • SessionPool now persists state on teardown. Before, it only persisted state every minute.
    This ensures that after a crawler finishes, the state is correctly persisted.
  • Added TypeScript typings and typedef documentation for all entities used throughout SDK.
  • Upgraded proxy-chain NPM package from 0.2.7 to 0.4.1 and many other dependencies
  • Removed all usage of the now deprecated request package.

v0.19.1

30 Jan 16:13
Compare
Choose a tag to compare
  • BREAKING (EXPERIMENTAL): session.checkStatus() -> session.retireOnBlockedStatusCodes().
  • Session API is no longer considered experimental.
  • Updates documentation and introduces a few internal changes.

v0.19.0

20 Jan 12:01
342c727
Compare
Choose a tag to compare
  • BREAKING: APIFY_LOCAL_EMULATION_DIR env var is no longer supported (deprecated on 2018-09-11).
    Use APIFY_LOCAL_STORAGE_DIR instead.
  • SessionPool API updates and fixes. The API is no longer considered experimental.
  • Logging of system info moved from require time to Apify.main() invocation.
  • Use native RegExp instead of xregexp for unicode property escapes.

v0.18.1

08 Jan 08:19
db460f5
Compare
Choose a tag to compare
  • Fix SessionPool not automatically working in CheerioCrawler.
  • Fix incorrect management of page count in PuppeteerPool.

v0.18.0

06 Jan 12:16
343366d
Compare
Choose a tag to compare
  • BREAKING CheerioCrawler ignores ssl errors by default - options.ignoreSslErrors: true.
  • Add SessionPool implemenation to CheerioCrawler.
  • Add SessionPool implementation to PuppeteerPool and PupeteerCrawler.
  • Fix Request constructor not making a copy of objects such as userData and headers.
  • Fix desc option not being applied in local dataset.getData().

v0.17.0

25 Nov 16:02
Compare
Choose a tag to compare
  • BREAKING: Node 8 and 9 are no longer supported. Please use Node 10.17.0 or higher.
  • DEPRECATED: Apify.callTask() body and contentType options are now deprecated.
    Use input instead. It must be of content-type: application/json.
  • Add default SessionPool implementation to BasicCrawler.
  • Add the ability to create ad-hoc webhooks via Apify.call() and Apify.callTask().
  • Add an example of form filling with Puppeteer.
  • Add country option to Apify.getApifyProxyUrl().
  • Add Apify.utils.puppeteer.saveSnapshot() helper to quickly save HTML and screenshot of a page.
  • Add the ability to pass got supported options to requestOptions in CheerioCrawler
    thus supporting things such as cookieJar again.
  • Switch Puppeteer to web socket again due to suspected pipe errors.
  • Fix an issue where some encodings were not correctly parsed in CheerioCrawler.
  • Fix parsing bad Content-Type headers for CheerioCrawler.
  • Fix custom headers not being correctly applied in Apify.utils.requestAsBrowser().
  • Fix dataset limits not being correctly applied.
  • Fix a race condition in RequestQueueLocal.
  • Fix RequestList persistence of downloaded sources in key-value store.
  • Fix Apify.utils.puppeteer.blockRequests() always including default patterns.
  • Fix inconsistent behavior of Apify.utils.puppeteer.infiniteScroll() on some websites.
  • Fix retry histogram statistics sometimes showing invalid counts.
  • Added regexps for Youtube videos (YOUTUBE_REGEX, YOUTUBE_REGEX_GLOBAL) to utils.social
  • Added documentation for option json in handlePageFunction of CheerioCrawler

v0.16.1

31 Oct 10:34
Compare
Choose a tag to compare
  • Add useIncognitoPages option to PuppeteerPool to enable opening new pages in incognito
    browser contexts. This is useful to keep cookies and cache unique for each page.
  • Added options to load every content type in CheerioCrawler.
    There are new options body and contentType in handlePageFunction for this purposes.
  • DEPRECATED: CheerioCrawler html option in handlePageFunction was replaced with body option.

v0.16.0

30 Sep 09:51
Compare
Choose a tag to compare
  • Update @apify/http-request to version 1.1.2.
  • Update CheerioCrawler to use requestAsBrowser() to better disguise as a real browser.

v0.15.5

19 Aug 07:46
Compare
Choose a tag to compare
  • This release just updates some dependencies (not Puppeteer).