Crawl with puppeteer or cheerio based on url #2007
Replies: 4 comments 5 replies
-
You can use the puppeteer crawler, it has https://crawlee.dev/docs/examples/skip-navigation |
Beta Was this translation helpful? Give feedback.
-
I was looking for a solution which supports both puppeteer and cheerio based crawling. |
Beta Was this translation helpful? Give feedback.
-
@B4nan Tried what you had mentioned const { PuppeteerCrawler, sleep } = require('crawlee')
const puppeteerExtra = require('puppeteer-extra')
const crawler = new PuppeteerCrawler({
maxRequestRetries: 2,
launchContext: {
launcher: puppeteerExtra,
launchOptions: {
useIncognitoPages: true,
headless: false
}
},
maxRequestsPerCrawl: 5,
async requestHandler ({ request, page, sendRequest, response }) {
try {
await sleep(3000)
console.log('in requestHandler')
if (request.skipNavigation) {
await sendRequest()
} else {
console.log('already crawled')
}
} catch (err) {
console.error('in catch requestHandler')
}
},
errorHandler ({ request }, err) {
console.log('in errorHandler')
},
failedRequestHandler ({ request }, err) {
try {
console.log('in failedRequestHandler')
console.log(`Request ${request.url} failed ${request.retryCount} many times.`)
console.error(err)
} catch (err) {
console.error('in failedRequestHandler catch')
}
},
preNavigationHooks: [
async (crawlingContext, gotoOptions) => {
console.log('in preNavigationHooks')
}
]
})
crawler.addRequests([{
url: 'https://google.com',
skipNavigation: true
},
{
url: 'https://about.google/',
skipNavigation: false
}])
crawler.run() However my usecase is that I dont want to open browser/page for urls that are meant to be crawled via got scraping. In this case it opens up a empty chrome tab for |
Beta Was this translation helpful? Give feedback.
-
Hi
I have a use case to run crawlee in both browser and curl crawling mode, i.e. puppeteer and got.
So each url will hold crawling mode property, and based on mode, will do either puppeteer or got crawl.
Just wanted to know is this possible in crawlee? If yes, is it possible to achieve it using single crawlee instance / single request queue.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions