Option to exclude certain resource types #1814
-
Which package is the feature request for? If unsure which one to select, leave blank@crawlee/playwright (PlaywrightCrawler) FeatureHi, MotivationIt would help with the speed of scraping, and lower the internet usage. Ideal solution or implementation, and any additional constraintsIn PlaywrightCrawler object constructor. Alternative solutions or implementationsNo response Other contextNo response |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Do you mean something like this https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests? |
Beta Was this translation helpful? Give feedback.
-
Or Skipping navigations for certain requests. import {
PlaywrightCrawler, // https://crawlee.dev/docs/examples/playwright-crawler
log
} from 'crawlee';
// https://playwright.dev/docs/api/class-request#request-resource-type
const RESOURCE_EXCLUSTIONS = ['image', 'stylesheet', 'media', 'font', 'other'];
/* More ressource type
For a new site I try to block everything in BLOCKED_IMG_CSS_JS.
In case the site does not work (pages not rendered properly) I try the BLOCKED_IMG_CSS.
If it is still not work - BLOCKED_IMG and than - no block at all.
const BLOCKED_IMG = ['image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt', 'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];
const BLOCKED_IMG_CSS = ['stylesheet', 'image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt', 'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];
const BLOCKED_IMG_CSS_JS = ['websocket', 'xhr', 'xmlhttprequest', 'script', 'stylesheet', 'image', 'imageset', 'object', 'object_subrequest', 'ping', 'web_manifest', 'xslt', 'media', 'font', 'other', 'beacon', 'csp_report', 'speculative', 'sub_frame', 'xbl', 'xml_dtd', 'texttrack', 'fetch', 'eventsource', 'manifest'];
*/
const crawler = new PlaywrightCrawler({
headless: true,
preNavigationHooks: [
async ({ page, request }) => {
await page.route('**/*', (route) => {
if (RESOURCE_EXCLUSTIONS.includes(route.request().resourceType())) route.abort()
else {
log.info(`Request: ${route.request().url()} to resource type: ${route.request().resourceType()}`);
route.continue()
}
});
},
],
async requestHandler({ request, page, log }) {
log.info(`Title of ${page.url()} is '${await page.title()}'`);
},
});
await crawler.run(["https://amazon.com"]); |
Beta Was this translation helpful? Give feedback.
Or Skipping navigations for certain requests.
Or you can do like this: