Crawl with puppeteer or cheerio based on url #2007

abhisheksurve45 · 2023-07-24T08:04:40Z

abhisheksurve45
Jul 24, 2023

Hi

I have a use case to run crawlee in both browser and curl crawling mode, i.e. puppeteer and got.

So each url will hold crawling mode property, and based on mode, will do either puppeteer or got crawl.

Just wanted to know is this possible in crawlee? If yes, is it possible to achieve it using single crawlee instance / single request queue.

Thanks.

B4nan · 2023-07-24T08:57:10Z

B4nan
Jul 24, 2023
Maintainer

You can use the puppeteer crawler, it has sendRequest helper in the context you can use to fire requests directly instead of a browser.

https://crawlee.dev/docs/examples/skip-navigation
https://crawlee.dev/api/puppeteer-crawler/interface/PuppeteerCrawlingContext#sendRequest
https://crawlee.dev/docs/guides/got-scraping#sendrequest-api

0 replies

abhisheksurve45 · 2023-07-26T09:30:01Z

abhisheksurve45
Jul 26, 2023
Author

I was looking for a solution which supports both puppeteer and cheerio based crawling.
@B4nan

1 reply

B4nan Jul 26, 2023
Maintainer

You can do that with the proposed sendRequest helper in combination with (yet another contextual helper) parseWithCheerio.

abhisheksurve45 · 2023-07-26T09:50:09Z

abhisheksurve45
Jul 26, 2023
Author

@B4nan Tried what you had mentioned

const { PuppeteerCrawler, sleep } = require('crawlee')

const puppeteerExtra = require('puppeteer-extra')

const crawler = new PuppeteerCrawler({
  maxRequestRetries: 2,
  launchContext: {
    launcher: puppeteerExtra,
    launchOptions: {
      useIncognitoPages: true,
      headless: false
    }
  },

  maxRequestsPerCrawl: 5,
  async requestHandler ({ request, page, sendRequest, response }) {
    try {
      await sleep(3000)
      console.log('in requestHandler')
      if (request.skipNavigation) {
        await sendRequest()
      } else {
        console.log('already crawled')
      }
    } catch (err) {
      console.error('in catch requestHandler')
    }
  },
  errorHandler ({ request }, err) {
    console.log('in errorHandler')
  },
  failedRequestHandler ({ request }, err) {
    try {
      console.log('in failedRequestHandler')
      console.log(`Request ${request.url} failed ${request.retryCount} many times.`)
      console.error(err)
    } catch (err) {
      console.error('in failedRequestHandler catch')
    }
  },
  preNavigationHooks: [
    async (crawlingContext, gotoOptions) => {
      console.log('in preNavigationHooks')
    }
  ]
})

crawler.addRequests([{
  url: 'https://google.com',
  skipNavigation: true
},
{
  url: 'https://about.google/',
  skipNavigation: false
}])

crawler.run()

However my usecase is that I dont want to open browser/page for urls that are meant to be crawled via got scraping.

In this case it opens up a empty chrome tab for https://google.com url

4 replies

B4nan Jul 26, 2023
Maintainer

Not sure if that is your actual code - if it is, you are missing quite a lot of awaits in there.

skipNavigation is what drives this, requests with that enabled won't open the browser. Such requests are going directly to the request handler.

Not sure why are you adding the puppeteer-extra to the mix, it shouldn't be needed (and can generally have some unwanted side effects).

abhisheksurve45 Jul 26, 2023
Author

@B4nan please ignore the awaits on crawler.addRequests and crawler.run. This is just an example code.

skipNavigation is what drives this, requests with that enabled won't open the browser. Such requests are going directly to the request handler.

This doesn't seem to be happening. If I keep my headless mode as false and set maxOpenPagesPerBrowser to 1
, I can see that chrome is opened twice. (chrome opened === concurrency)

abhisheksurve45 Jul 26, 2023
Author

Script to test the above scenario

const { PuppeteerCrawler, sleep } = require('crawlee')

const crawler = new PuppeteerCrawler({
  maxRequestRetries: 1,
  browserPoolOptions: {
    maxOpenPagesPerBrowser: 1
  },
  launchContext: {
    launchOptions: {
      useIncognitoPages: false,
      headless: false
    }
  },
  minConcurrency: 2,
  maxConcurrency: 2,
  maxRequestsPerCrawl: 5,
  async requestHandler ({ request, page, sendRequest, response }) {
    try {
      await sleep(3000)
      console.log('in requestHandler')
      if (request.skipNavigation) {
        await sendRequest()
      } else {
        console.log('already crawled')
      }
    } catch (err) {
      console.error('in catch requestHandler')
    }
  },
  errorHandler ({ request }, err) {
    console.log('in errorHandler')
  },
  failedRequestHandler ({ request }, err) {
    try {
      console.log('in failedRequestHandler')
      console.log(`Request ${request.url} failed ${request.retryCount} many times.`)
      console.error(err)
    } catch (err) {
      console.error('in failedRequestHandler catch')
    }
  },
  preNavigationHooks: [
    async (crawlingContext, gotoOptions) => {
      console.log('in preNavigationHooks')
    }
  ]
})

crawler.addRequests([{
  url: 'https://google.com',
  skipNavigation: true
},
{
  url: 'https://about.google/',
  skipNavigation: false
}])

crawler.run()

abhisheksurve45 Jul 31, 2023
Author

@B4nan let me know if any thoughts on this. Not urgent.

abhisheksurve45 · 2023-08-09T20:20:43Z

abhisheksurve45
Aug 9, 2023
Author

@B4nan @barjin

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawl with puppeteer or cheerio based on url #2007

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Crawl with puppeteer or cheerio based on url #2007

abhisheksurve45 Jul 24, 2023

Replies: 4 comments · 5 replies

B4nan Jul 24, 2023 Maintainer

abhisheksurve45 Jul 26, 2023 Author

B4nan Jul 26, 2023 Maintainer

abhisheksurve45 Jul 26, 2023 Author

B4nan Jul 26, 2023 Maintainer

abhisheksurve45 Jul 26, 2023 Author

abhisheksurve45 Jul 26, 2023 Author

abhisheksurve45 Jul 31, 2023 Author

abhisheksurve45 Aug 9, 2023 Author

abhisheksurve45
Jul 24, 2023

Replies: 4 comments 5 replies

B4nan
Jul 24, 2023
Maintainer

abhisheksurve45
Jul 26, 2023
Author

B4nan Jul 26, 2023
Maintainer

abhisheksurve45
Jul 26, 2023
Author

B4nan Jul 26, 2023
Maintainer

abhisheksurve45 Jul 26, 2023
Author

abhisheksurve45 Jul 26, 2023
Author

abhisheksurve45 Jul 31, 2023
Author

abhisheksurve45
Aug 9, 2023
Author