Skip to content

Releases: apify/crawlee

v1.1.0

19 Mar 17:46
088959f
Compare
Choose a tag to compare

In this minor release we focused on the SessionPool. Besides fixing a few bugs, we added one important feature: setting and getting of sessions by ID.

// Now you can add specific sessions to the pool,
// instead of relying on random generation.
await sessionPool.addSession({
    id: 'my-session',
    // ... some config
});

// Later, you can retrieve the session. This is useful
// for example when you need a specific login session.
const session = await sessionPool.getSession('my-session');

Full list of changes:

  • Add sessionPool.addSession() function to add a new session to the session pool (possibly with the provided options, e.g. with specific session id).
  • Add optional parameter sessionId to sessionPool.getSession() to be able to retrieve a session from the session pool with the specific session id.
  • Fix SessionPool not working properly in both PuppeteerCrawler and PlaywrightCrawler.
  • Fix Apify.call() and Apify.callTask() output - make it backwards compatible with previous versions of the client.
  • Improve handling of browser executable paths when using the official SDK Docker images.
  • Update browser-pool to fix issues with failing hooks causing browsers to get stuck in limbo.
  • Removed proxy-chain dependency because now it's covered in browser-pool.

v1.0.2

05 Mar 13:36
Compare
Choose a tag to compare
  • Add the ability to override ProxyConfiguration status check URL with the APIFY_PROXY_STATUS_URL env var.
  • Fix inconsistencies in cookie handling when SessionPool was used.
  • Fix TS types in multiple places. TS is still not a first class citizen, but this should improve the experience.

v1.0.1

03 Feb 19:23
Compare
Choose a tag to compare
  • Fix dataset.pushData() validation which would not allow other than plain objects.
  • Fix PuppeteerLaunchContext.stealth throwing when used in PuppeteerCrawler.

v1.0.0

25 Jan 19:25
Compare
Choose a tag to compare

After 3.5 years of rapid development, and a lot of breaking changes and deprecations, here comes the result - Apify SDK v1. There were two goals for this release. Stability and adding support for more browsers - Firefox and Webkit (Safari).

The SDK has grown quite popular over the years, powering thousands of web scraping and automation projects. We think our developers deserve a stable environment to work in and by releasing SDK v1, we commit to only make breaking changes once a year, with a new major release.

We added support for more browsers by replacing PuppeteerPool with browser-pool. A new library that we created specifically for this purpose. It builds on the ideas from PuppeteerPool and extends them to support Playwright. Playwright is a browser automation library similar to Puppeteer. It works with all well known browsers and uses almost the same interface as Puppeteer, while adding useful features and simplifying common tasks. Don't worry, you can still use Puppeteer with the new BrowserPool.

A large breaking change is that neither puppeteer nor playwright are bundled with the SDK v1. To make the choice of a library easier and installs faster, users will have to install the selected modules and versions themselves. This allows us to add support for even more libraries in the future.

Thanks to the addition of Playwright we now have a PlaywrightCrawler. It is very similar to PuppeteerCrawler and you can pick the one you prefer. It also means we needed to make some interface changes. The launchPuppeteerFunction option of PuppeteerCrawler is gone and launchPuppeteerOptions were replaced by launchContext. We also moved things around in the handlePageFunction arguments. See the migration guide for more detailed explanation and migration examples.

What's in store for SDK v2? We want to split the SDK into smaller libraries, so that everyone can install only the things they need. We plan a TypeScript migration to make crawler development faster and safer. Finally, we will take a good look at the interface of the whole SDK and update it to improve the developer experience. Bug fixes and scraping features will of course keep landing in versions 1.X as well.

Full list of changes:

  • BREAKING: Removed puppeteer from dependencies. If you want to use Puppeteer, you must install it yourself.
  • BREAKING: Removed PuppeteerPool. Use browser-pool.
  • BREAKING: Removed PuppeteerCrawlerOptions.launchPuppeteerOptions. Use launchContext.
  • BREAKING: Removed PuppeteerCrawlerOptions.launchPuppeteerFunction. Use PuppeteerCrawlerOptions.preLaunchHooks and postLaunchHooks.
  • BREAKING: Removed args.autoscaledPool and args.puppeteerPool from handle(Page/Request)Function arguments. Use args.crawler.autoscaledPool and args.crawler.browserPool.
  • BREAKING: The useSessionPool and persistCookiesPerSession options of crawlers are now true by default. Explicitly set them to false to override the behavior.
  • BREAKING: Apify.launchPuppeteer() no longer accepts LaunchPuppeteerOptions. It now accepts PuppeteerLaunchContext.

New deprecations:

  • DEPRECATED: PuppeteerCrawlerOptions.gotoFunction. Use PuppeteerCrawlerOptions.preNavigationHooks and postNavigationHooks.

Removals of earlier deprecated functions:

  • BREAKING: Removed Apify.utils.puppeteer.enqueueLinks(). Deprecated in 01/2019. Use Apify.utils.enqueueLinks().
  • BREAKING: Removed autoscaledPool.(set|get)MaxConcurrency(). Deprecated in 2019. Use autoscaledPool.maxConcurrency.
  • BREAKING: Removed CheerioCrawlerOptions.requestOptions. Deprecated in 03/2020. Use CheerioCrawlerOptions.prepareRequestFunction.
  • BREAKING: Removed Launch.requestOptions. Deprecated in 03/2020. Use CheerioCrawlerOptions.prepareRequestFunction.

New features:

  • Added Apify.PlaywrightCrawler which is almost identical to PuppeteerCrawler, but it crawls with the playwright library.
  • Added Apify.launchPlaywright(launchContext) helper function.
  • Added browserPoolOptions to PuppeteerCrawler to configure BrowserPool.
  • Added crawler to handle(Request/Page)Function arguments.
  • Added browserController to handlePageFunction arguments.
  • Added crawler.crawlingContexts Map which includes all running crawlingContexts.

v0.22.4

10 Jan 15:02
Compare
Choose a tag to compare
  • Fix issues with Apify.pushData() and keyValueStore.forEachKey() by updating @apify/storage-local to 1.0.2.

v0.22.2

22 Dec 13:24
Compare
Choose a tag to compare
  • Pinned cheerio to 1.0.0-rc.3 to avoid install problems in some builds.
  • Increased default maxEventLoopOverloadedRatio in SystemStatusOptions to 0.6.
  • Updated packages and improved docs.

v0.22.1

09 Dec 10:12
Compare
Choose a tag to compare

This is the last major release before SDK v1.0.0. We're committed to deliver v1 at the
end of 2020 so stay tuned. Besides Playwright integration via a new BrowserPool,
it will be the first release of SDK that we'll support for an extended period of time.
We will not make any breaking changes until 2.0.0, which will come at the end of
2021. But enough about v1, let's see the changes in 0.22.0.

In this release we've changed a lot of code, but you may not even notice.
We've updated the underlying apify-client package which powers all communication with
the Apify API to version 1.0.0. This means a completely new API for all internal calls.
If you use Apify.client calls in your code, this will be a large breaking change for you.
Visit the client docs
to see what's new in the client, but also note that we removed the default client
available under Apify.client and replaced it with Apify.newClient() function.
We think it's better to have separate clients for users and internal use.

Until now, local emulation of Apify Storages has been a part of the SDK. We moved the logic
into a separate package @apify/storage-local which shares interface with apify-client.
RequestQueue is now powered by SQLite3 instead of file system, which improves
reliability and performance quite a bit. Dataset and KeyValueStore still use file
system, for easy browsing of data. The structure of apify_storage folder remains unchanged.

After collecting common developer mistakes, we've decided to make argument validation stricter.
You will no longer be able to pass extra arguments to functions and constructors. This is
to alleviate frustration, when you mistakenly pass useChrome to PuppeteerPoolOptions
instead of LaunchPuppeteerOptions and don't realize it. Before this version, SDK wouldn't
let you know and would silently continue with Chromium. Now, it will throw an error saying
that useChrome is not an allowed property of PuppeteerPoolOptions.

Based on developer feedback, we decided to remove --no-sandbox from the default Puppeteer
launch args. It will only be used on Apify Platform. This gives you the chance to use
your own sandboxing strategy.

LiveViewServer and puppeteerPoolOptions.useLiveView were never very user-friendly
or performant solutions, due to the inherent performance issues with rapidly taking many
screenshots in Puppeteer. We've decided to remove it. If you need similar functionality,
try the devtools-server NPM package, which utilizes the Chrome DevTools Frontend for
screen-casting live view of the running browser.

Full list of changes:

  • BREAKING: Updated apify-client to 1.0.0 with a completely new interface.
    We also removed the Apify.client property and replaced it with an Apify.newClient()
    function that creates a new ApifyClient instance.

  • BREAKING: Removed --no-sandbox from default Puppeteer launch arguments.
    This will most likely be breaking for Linux and Docker users.

  • BREAKING: Function argument validation is now more strict and will not accept extra
    parameters which are not defined by the functions' signatures.

  • DEPRECATED: puppeteerPoolOptions.useLiveView is now deprecated.
    Use the devtools-server NPM package instead.

  • Added postResponseFunction to CheerioCrawlerOptions. It allows you to override
    properties on the HTTP response before processing by CheerioCrawler.

  • Added HTTP2 support to utils.requestAsBrowser(). Set useHttp2 to true
    in RequestAsBrowserOptions to enable it.

  • Fixed handling of XML content types in CheerioCrawler.

  • Fixed capitalization of headers when using utils.puppeteer.addInterceptRequestHandler.

  • Fixed utils.puppeteer.saveSnapshot() overwriting screenshots with HTML on local.

  • Updated puppeteer to version 5.4.1 with Chrom(ium) 87.

  • Removed RequestQueueLocal in favor of @apify/storage-local API emulator.

  • Removed KeyValueStoreLocal in favor of @apify/storage-local API emulator.

  • Removed DatasetLocal in favor of @apify/storage-local API emulator.

  • Removed the userData option from Apify.utils.enqueueLinks (deprecated in Jun 2019).
    Use transformRequestFunction instead.

  • Removed instanceKillerIntervalMillis and killInstanceAfterMillis (deprecated in Feb 2019).
    Use instanceKillerIntervalSecs and killInstanceAfterSecs instead.

  • Removed the memory option from Apify.call options which was (deprecated in 2018).
    Use memoryMbytes instead.

  • Removed delete() methods from Dataset, KeyValueStore and RequestQueue (deprecated in Jul 2019).
    Use .drop().

  • Removed utils.puppeteer.hideWebDriver() (deprecated in May 2019).
    Use LaunchPuppeteerOptions.stealth.

  • Removed utils.puppeteer.enqueueRequestsFromClickableElements() (deprecated in 2018).
    Use utils.puppeteer.enqueueLinksByClickingElements.

  • Removed request.doNotRetry() (deprecated in June 2019)
    Use request.noRetry = true.

  • Removed RequestListOptions.persistSourcesKey (deprecated in Feb 2020)
    Use persistRequestsKey.

v0.21.10

07 Dec 19:33
Compare
Choose a tag to compare
  • Bump Puppeteer to 5.5.0 and Chrom(ium) 88.

v0.21.9

03 Nov 17:55
Compare
Choose a tag to compare
  • Fix various issues in stealth.
  • Fix SessionPool not retiring sessions immediately when they become unusable. It fixes a problem where PuppeteerPool would not retire browsers wit bad sessions.

v0.21.8

08 Oct 09:14
Compare
Choose a tag to compare
  • Make PuppeteerCrawler safe against malformed Puppeteer responses.
  • Update default user agent to Chrome 86
  • Bump Puppeteer to 5.3.1 with Chromium 86