Releases: apify/crawlee
v1.1.0
In this minor release we focused on the SessionPool
. Besides fixing a few bugs, we added one important feature: setting and getting of sessions by ID.
// Now you can add specific sessions to the pool,
// instead of relying on random generation.
await sessionPool.addSession({
id: 'my-session',
// ... some config
});
// Later, you can retrieve the session. This is useful
// for example when you need a specific login session.
const session = await sessionPool.getSession('my-session');
Full list of changes:
- Add
sessionPool.addSession()
function to add a new session to the session pool (possibly with the provided options, e.g. with specific session id). - Add optional parameter
sessionId
tosessionPool.getSession()
to be able to retrieve a session from the session pool with the specific session id. - Fix
SessionPool
not working properly in bothPuppeteerCrawler
andPlaywrightCrawler
. - Fix
Apify.call()
andApify.callTask()
output - make it backwards compatible with previous versions of the client. - Improve handling of browser executable paths when using the official SDK Docker images.
- Update
browser-pool
to fix issues with failing hooks causing browsers to get stuck in limbo. - Removed
proxy-chain
dependency because now it's covered inbrowser-pool
.
v1.0.2
- Add the ability to override
ProxyConfiguration
status check URL with theAPIFY_PROXY_STATUS_URL
env var. - Fix inconsistencies in cookie handling when
SessionPool
was used. - Fix TS types in multiple places. TS is still not a first class citizen, but this should improve the experience.
v1.0.1
v1.0.0
After 3.5 years of rapid development, and a lot of breaking changes and deprecations, here comes the result - Apify SDK v1. There were two goals for this release. Stability and adding support for more browsers - Firefox and Webkit (Safari).
The SDK has grown quite popular over the years, powering thousands of web scraping and automation projects. We think our developers deserve a stable environment to work in and by releasing SDK v1, we commit to only make breaking changes once a year, with a new major release.
We added support for more browsers by replacing PuppeteerPool
with browser-pool
. A new library that we created specifically for this purpose. It builds on the ideas from PuppeteerPool
and extends them to support Playwright. Playwright is a browser automation library similar to Puppeteer. It works with all well known browsers and uses almost the same interface as Puppeteer, while adding useful features and simplifying common tasks. Don't worry, you can still use Puppeteer with the new BrowserPool
.
A large breaking change is that neither puppeteer
nor playwright
are bundled with the SDK v1. To make the choice of a library easier and installs faster, users will have to install the selected modules and versions themselves. This allows us to add support for even more libraries in the future.
Thanks to the addition of Playwright we now have a PlaywrightCrawler
. It is very similar to PuppeteerCrawler
and you can pick the one you prefer. It also means we needed to make some interface changes. The launchPuppeteerFunction
option of PuppeteerCrawler
is gone and launchPuppeteerOptions
were replaced by launchContext
. We also moved things around in the handlePageFunction
arguments. See the migration guide for more detailed explanation and migration examples.
What's in store for SDK v2? We want to split the SDK into smaller libraries, so that everyone can install only the things they need. We plan a TypeScript migration to make crawler development faster and safer. Finally, we will take a good look at the interface of the whole SDK and update it to improve the developer experience. Bug fixes and scraping features will of course keep landing in versions 1.X as well.
Full list of changes:
- BREAKING: Removed
puppeteer
from dependencies. If you want to use Puppeteer, you must install it yourself. - BREAKING: Removed
PuppeteerPool
. Usebrowser-pool
. - BREAKING: Removed
PuppeteerCrawlerOptions.launchPuppeteerOptions
. UselaunchContext
. - BREAKING: Removed
PuppeteerCrawlerOptions.launchPuppeteerFunction
. UsePuppeteerCrawlerOptions.preLaunchHooks
andpostLaunchHooks
. - BREAKING: Removed
args.autoscaledPool
andargs.puppeteerPool
fromhandle(Page/Request)Function
arguments. Useargs.crawler.autoscaledPool
andargs.crawler.browserPool
. - BREAKING: The
useSessionPool
andpersistCookiesPerSession
options of crawlers are nowtrue
by default. Explicitly set them tofalse
to override the behavior. - BREAKING:
Apify.launchPuppeteer()
no longer acceptsLaunchPuppeteerOptions
. It now acceptsPuppeteerLaunchContext
.
New deprecations:
- DEPRECATED:
PuppeteerCrawlerOptions.gotoFunction
. UsePuppeteerCrawlerOptions.preNavigationHooks
andpostNavigationHooks
.
Removals of earlier deprecated functions:
- BREAKING: Removed
Apify.utils.puppeteer.enqueueLinks()
. Deprecated in 01/2019. UseApify.utils.enqueueLinks()
. - BREAKING: Removed
autoscaledPool.(set|get)MaxConcurrency()
. Deprecated in 2019. UseautoscaledPool.maxConcurrency
. - BREAKING: Removed
CheerioCrawlerOptions.requestOptions
. Deprecated in 03/2020. UseCheerioCrawlerOptions.prepareRequestFunction
. - BREAKING: Removed
Launch.requestOptions
. Deprecated in 03/2020. UseCheerioCrawlerOptions.prepareRequestFunction
.
New features:
- Added
Apify.PlaywrightCrawler
which is almost identical toPuppeteerCrawler
, but it crawls with theplaywright
library. - Added
Apify.launchPlaywright(launchContext)
helper function. - Added
browserPoolOptions
toPuppeteerCrawler
to configureBrowserPool
. - Added
crawler
tohandle(Request/Page)Function
arguments. - Added
browserController
tohandlePageFunction
arguments. - Added
crawler.crawlingContexts
Map
which includes all runningcrawlingContext
s.
v0.22.4
v0.22.2
v0.22.1
This is the last major release before SDK v1.0.0. We're committed to deliver v1 at the
end of 2020 so stay tuned. Besides Playwright integration via a new BrowserPool
,
it will be the first release of SDK that we'll support for an extended period of time.
We will not make any breaking changes until 2.0.0, which will come at the end of
2021. But enough about v1, let's see the changes in 0.22.0.
In this release we've changed a lot of code, but you may not even notice.
We've updated the underlying apify-client
package which powers all communication with
the Apify API to version 1.0.0
. This means a completely new API for all internal calls.
If you use Apify.client
calls in your code, this will be a large breaking change for you.
Visit the client docs
to see what's new in the client, but also note that we removed the default client
available under Apify.client
and replaced it with Apify.newClient()
function.
We think it's better to have separate clients for users and internal use.
Until now, local emulation of Apify Storages has been a part of the SDK. We moved the logic
into a separate package @apify/storage-local
which shares interface with apify-client
.
RequestQueue
is now powered by SQLite3
instead of file system, which improves
reliability and performance quite a bit. Dataset
and KeyValueStore
still use file
system, for easy browsing of data. The structure of apify_storage
folder remains unchanged.
After collecting common developer mistakes, we've decided to make argument validation stricter.
You will no longer be able to pass extra arguments to functions and constructors. This is
to alleviate frustration, when you mistakenly pass useChrome
to PuppeteerPoolOptions
instead of LaunchPuppeteerOptions
and don't realize it. Before this version, SDK wouldn't
let you know and would silently continue with Chromium. Now, it will throw an error saying
that useChrome
is not an allowed property of PuppeteerPoolOptions
.
Based on developer feedback, we decided to remove --no-sandbox
from the default Puppeteer
launch args. It will only be used on Apify Platform. This gives you the chance to use
your own sandboxing strategy.
LiveViewServer
and puppeteerPoolOptions.useLiveView
were never very user-friendly
or performant solutions, due to the inherent performance issues with rapidly taking many
screenshots in Puppeteer. We've decided to remove it. If you need similar functionality,
try the devtools-server
NPM package, which utilizes the Chrome DevTools Frontend for
screen-casting live view of the running browser.
Full list of changes:
-
BREAKING: Updated
apify-client
to1.0.0
with a completely new interface.
We also removed theApify.client
property and replaced it with anApify.newClient()
function that creates a newApifyClient
instance. -
BREAKING: Removed
--no-sandbox
from default Puppeteer launch arguments.
This will most likely be breaking for Linux and Docker users. -
BREAKING: Function argument validation is now more strict and will not accept extra
parameters which are not defined by the functions' signatures. -
DEPRECATED:
puppeteerPoolOptions.useLiveView
is now deprecated.
Use thedevtools-server
NPM package instead. -
Added
postResponseFunction
toCheerioCrawlerOptions
. It allows you to override
properties on the HTTP response before processing byCheerioCrawler
. -
Added HTTP2 support to
utils.requestAsBrowser()
. SetuseHttp2
totrue
inRequestAsBrowserOptions
to enable it. -
Fixed handling of XML content types in
CheerioCrawler
. -
Fixed capitalization of headers when using
utils.puppeteer.addInterceptRequestHandler
. -
Fixed
utils.puppeteer.saveSnapshot()
overwriting screenshots with HTML on local. -
Updated
puppeteer
to version5.4.1
with Chrom(ium) 87. -
Removed
RequestQueueLocal
in favor of@apify/storage-local
API emulator. -
Removed
KeyValueStoreLocal
in favor of@apify/storage-local
API emulator. -
Removed
DatasetLocal
in favor of@apify/storage-local
API emulator. -
Removed the
userData
option fromApify.utils.enqueueLinks
(deprecated in Jun 2019).
UsetransformRequestFunction
instead. -
Removed
instanceKillerIntervalMillis
andkillInstanceAfterMillis
(deprecated in Feb 2019).
UseinstanceKillerIntervalSecs
andkillInstanceAfterSecs
instead. -
Removed the
memory
option fromApify.call
options
which was (deprecated in 2018).
UsememoryMbytes
instead. -
Removed
delete()
methods fromDataset
,KeyValueStore
andRequestQueue
(deprecated in Jul 2019).
Use.drop()
. -
Removed
utils.puppeteer.hideWebDriver()
(deprecated in May 2019).
UseLaunchPuppeteerOptions.stealth
. -
Removed
utils.puppeteer.enqueueRequestsFromClickableElements()
(deprecated in 2018).
Useutils.puppeteer.enqueueLinksByClickingElements
. -
Removed
request.doNotRetry()
(deprecated in June 2019)
Userequest.noRetry = true
. -
Removed
RequestListOptions.persistSourcesKey
(deprecated in Feb 2020)
UsepersistRequestsKey
.