[BUG] session.setCookie (tough-cookie): URL incorrectly interpreted as Regex #2724

firecrauter · 2024-10-27T05:21:18Z

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

Template typescript CheerioCrawleer
URLs with regex in Starturl.
Add preNavigationHooks and set a cookie in that URL
npm install
npm start

output:

...
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. SyntaxError: Invalid regular expression: /^/Antonov++Andrii/: Nothing to repeat
    at new RegExp (<anonymous>)
    at pathMatch (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\pathMatch.js:35:13)
    at matchRFC (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\memstore.js:68:51)
    at D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\memstore.js:87:13
    at Array.forEach (<anonymous>)
    at MemoryCookieStore.findCookies (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\memstore.js:82:17)
    at CookieJar.getCookies (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:536:15)
    at CookieJar.getCookieString (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:597:14)
    at CookieJar.callSync (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:168:16)
    at CookieJar.getCookieStringSync (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:610:22) {"id":"ZTnkJJu5aEw0Obe","url":"https://www.google.com/Antonov++Andrii/","retryCount":3}
INFO  CheerioCrawler: Error analysis: {"totalErrors":3,"uniqueErrors":1,"mostCommonErrors":["3x: Invalid regular expression: _ Nothing to repeat (<anonymous>)"]}
INFO  CheerioCrawler: Finished! Total 4 requests: 1 succeeded, 3 failed. {"terminal":true}

Code sample

// For more information, see https://crawlee.dev/
import { CheerioCrawler } from 'crawlee';

//Example of URLs with regex (even though it returns a 404):
const startUrls = [
    "https://www.example.com/dev/Cibus+%7C+Pluxee/",
    "https://www.example.com/dev/Y+C++S+T+U+D+I+O/",
    "https://www.example.com/dev/Antonov++Andrii/",
    'https://www.example.com/dev/Mobile+Dialer+%28+HelloBDTel+-Ten+Card+Company+%29'
];

const crawler = new CheerioCrawler({
    // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
    requestHandler: async ({ request, $, log }) => {
        const title = $('title').text();
        log.info(`${title}`, { url: request.loadedUrl });
    },
    errorHandler: async ({ }, _: Error) => {
        //        console.log(request.url);
    },
    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 4,
    persistCookiesPerSession: false,
    preNavigationHooks: [
        (crawlingContext, _) => {
            // ...
            try {
                const { session, request } = crawlingContext;
                if (session) {
                    const cookieString = 'adlt=1;';

                    const urlWithoutPath = new URL(request.url);
                    urlWithoutPath.pathname = '/'; // Restablecer el path a solo "/"
                    const targetUrl = urlWithoutPath.toString();

                    session.setCookie(cookieString, targetUrl);
                }
            } catch (error) {
            }
        },
    ],
});

await crawler.run(startUrls);

Package version

3.11.4, 3.11.5

Node.js version

22.6.0

Operating system

Windows 11, and Ubuntu 24.04

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

No response

Other context

No response

The text was updated successfully, but these errors were encountered:

firecrauter · 2024-11-03T07:29:07Z

Fixed in /tough-cookie/pull/465. I guess now I just have to wait for them to release an updated version of tough-cookie, and then Crawlee can be updated.
I'm sorry for so many references and for opening/closing

firecrauter added the bug Something isn't working. label Oct 27, 2024

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Oct 27, 2024

firecrauter changed the title ~~[BUG] Error in session.setCookie function: URL incorrectly interpreted as Regex~~ [BUG] Error in session.setCookie (tough-cookie): URL incorrectly interpreted as Regex Oct 27, 2024

firecrauter changed the title ~~[BUG] Error in session.setCookie (tough-cookie): URL incorrectly interpreted as Regex~~ [BUG] session.setCookie (tough-cookie): URL incorrectly interpreted as Regex Oct 27, 2024

firecrauter mentioned this issue Oct 28, 2024

URL incorrectly interpreted as Regex? salesforce/tough-cookie#464

Closed

firecrauter closed this as completed Oct 29, 2024

firecrauter reopened this Oct 29, 2024

firecrauter closed this as completed Nov 3, 2024

firecrauter reopened this Nov 3, 2024

firecrauter closed this as completed Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] session.setCookie (tough-cookie): URL incorrectly interpreted as Regex #2724

[BUG] session.setCookie (tough-cookie): URL incorrectly interpreted as Regex #2724

firecrauter commented Oct 27, 2024 •

edited

Loading

firecrauter commented Nov 3, 2024 •

edited

Loading

[BUG] session.setCookie (tough-cookie): URL incorrectly interpreted as Regex #2724

[BUG] session.setCookie (tough-cookie): URL incorrectly interpreted as Regex #2724

Comments

firecrauter commented Oct 27, 2024 • edited Loading

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

firecrauter commented Nov 3, 2024 • edited Loading

firecrauter commented Oct 27, 2024 •

edited

Loading

I have tested this on the `next` release

firecrauter commented Nov 3, 2024 •

edited

Loading