Releases: apify/crawlee
Releases · apify/crawlee
v0.20.2
v0.20.0
- BREAKING:
Apify.utils.requestAsBrowser()
no longer aborts request on status code 406
or when other thantext/html
type is received. Useoptions.abortFunction
if you want to
retain this functionality. - BREAKING: Added
useInsecureHttpParser
option toApify.utils.requestAsBrowser()
which
istrue
by default and forces the function to use a HTTP parser that is less strict than
default Node 12 parser, but also less secure. It is needed to be able to bypass certain
anti-scraping walls and fetch websites that do not comply with HTTP spec. - BREAKING:
RequestList
now removes all the elements from thesources
array on
initialization. If you need to use the sources somewhere else, make a copy. This change
was added as one of several measures to improve memory management ofRequestList
in scenarios with very large amount ofRequest
instances. - DEPRECATED:
RequestListOptions.persistSourcesKey
is now deprecated. Please use
RequestListOptions.persistRequestsKey
. RequestListOptions.sources
can now be an array ofstring
URLs as well.- Added
sourcesFunction
toRequestListOptions
. It enables dynamic fetching of sources
and will only be called if persistedRequests
were not retrieved from key-value store.
Use it to reduce memory spikes and also to make sure that your sources are not re-created
on actor restarts. - Updated
stealth
hiding ofwebdriver
to avoid recent detections. Apify.utils.log
now points to an updated logger instance which prints colored logs (in TTY)
and supports overriding with custom loggers.- Improved
Apify.launchPuppeteer()
code to prevent triggering bugs in Puppeteer by passing
more than required options topuppeteer.launch()
. - Documented
BasicCrawler.autoscaledPool
property, and addedCheerioCrawler.autoscaledPool
andPuppeteerCrawler.autoscaledPool
properties. SessionPool
now persists state onteardown
. Before, it only persisted state every minute.
This ensures that after a crawler finishes, the state is correctly persisted.- Added TypeScript typings and typedef documentation for all entities used throughout SDK.
- Upgraded
proxy-chain
NPM package from 0.2.7 to 0.4.1 and many other dependencies - Removed all usage of the now deprecated
request
package.
v0.19.1
v0.19.0
- BREAKING:
APIFY_LOCAL_EMULATION_DIR
env var is no longer supported (deprecated on 2018-09-11).
UseAPIFY_LOCAL_STORAGE_DIR
instead. SessionPool
API updates and fixes. The API is no longer considered experimental.- Logging of system info moved from
require
time toApify.main()
invocation. - Use native
RegExp
instead ofxregexp
for unicode property escapes.
v0.18.1
v0.18.0
- BREAKING
CheerioCrawler
ignores ssl errors by default -options.ignoreSslErrors: true
. - Add
SessionPool
implemenation toCheerioCrawler
. - Add
SessionPool
implementation toPuppeteerPool
andPupeteerCrawler
. - Fix
Request
constructor not making a copy of objects such asuserData
andheaders
. - Fix
desc
option not being applied in localdataset.getData()
.
v0.17.0
- BREAKING: Node 8 and 9 are no longer supported. Please use Node 10.17.0 or higher.
- DEPRECATED:
Apify.callTask()
body
andcontentType
options are now deprecated.
Useinput
instead. It must be ofcontent-type: application/json
. - Add default
SessionPool
implementation toBasicCrawler
. - Add the ability to create ad-hoc webhooks via
Apify.call()
andApify.callTask()
. - Add an example of form filling with
Puppeteer
. - Add
country
option toApify.getApifyProxyUrl()
. - Add
Apify.utils.puppeteer.saveSnapshot()
helper to quickly save HTML and screenshot of a page. - Add the ability to pass
got
supported options torequestOptions
inCheerioCrawler
thus supporting things such ascookieJar
again. - Switch Puppeteer to web socket again due to suspected
pipe
errors. - Fix an issue where some encodings were not correctly parsed in
CheerioCrawler
. - Fix parsing bad Content-Type headers for
CheerioCrawler
. - Fix custom headers not being correctly applied in
Apify.utils.requestAsBrowser()
. - Fix dataset limits not being correctly applied.
- Fix a race condition in
RequestQueueLocal
. - Fix
RequestList
persistence of downloaded sources in key-value store. - Fix
Apify.utils.puppeteer.blockRequests()
always including default patterns. - Fix inconsistent behavior of
Apify.utils.puppeteer.infiniteScroll()
on some websites. - Fix retry histogram statistics sometimes showing invalid counts.
- Added regexps for Youtube videos (
YOUTUBE_REGEX
,YOUTUBE_REGEX_GLOBAL
) toutils.social
- Added documentation for option
json
in handlePageFunction ofCheerioCrawler
v0.16.1
- Add
useIncognitoPages
option toPuppeteerPool
to enable opening new pages in incognito
browser contexts. This is useful to keep cookies and cache unique for each page. - Added options to load every content type in CheerioCrawler.
There are new optionsbody
andcontentType
inhandlePageFunction
for this purposes. - DEPRECATED: CheerioCrawler
html
option inhandlePageFunction
was replaced withbody
option.