Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] session.setCookie (tough-cookie): URL incorrectly interpreted as Regex #2724

Closed
1 task
firecrauter opened this issue Oct 27, 2024 · 1 comment
Closed
1 task
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@firecrauter
Copy link

firecrauter commented Oct 27, 2024

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

  1. Template typescript CheerioCrawleer
  2. URLs with regex in Starturl.
  3. Add preNavigationHooks and set a cookie in that URL
  4. npm install
  5. npm start

output:

...
WARN  CheerioCrawler: Reclaiming failed request back to the list or queue. SyntaxError: Invalid regular expression: /^/Antonov++Andrii/: Nothing to repeat
    at new RegExp (<anonymous>)
    at pathMatch (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\pathMatch.js:35:13)
    at matchRFC (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\memstore.js:68:51)
    at D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\memstore.js:87:13
    at Array.forEach (<anonymous>)
    at MemoryCookieStore.findCookies (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\memstore.js:82:17)
    at CookieJar.getCookies (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:536:15)
    at CookieJar.getCookieString (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:597:14)
    at CookieJar.callSync (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:168:16)
    at CookieJar.getCookieStringSync (D:\Fire\Proyectos\my-crawler-borrar\node_modules\tough-cookie\dist\cookie\cookieJar.js:610:22) {"id":"ZTnkJJu5aEw0Obe","url":"https://www.google.com/Antonov++Andrii/","retryCount":3}
INFO  CheerioCrawler: Error analysis: {"totalErrors":3,"uniqueErrors":1,"mostCommonErrors":["3x: Invalid regular expression: _ Nothing to repeat (<anonymous>)"]}
INFO  CheerioCrawler: Finished! Total 4 requests: 1 succeeded, 3 failed. {"terminal":true}

Code sample

// For more information, see https://crawlee.dev/
import { CheerioCrawler } from 'crawlee';

//Example of URLs with regex (even though it returns a 404):
const startUrls = [
    "https://www.example.com/dev/Cibus+%7C+Pluxee/",
    "https://www.example.com/dev/Y+C++S+T+U+D+I+O/",
    "https://www.example.com/dev/Antonov++Andrii/",
    'https://www.example.com/dev/Mobile+Dialer+%28+HelloBDTel+-Ten+Card+Company+%29'
];

const crawler = new CheerioCrawler({
    // proxyConfiguration: new ProxyConfiguration({ proxyUrls: ['...'] }),
    requestHandler: async ({ request, $, log }) => {
        const title = $('title').text();
        log.info(`${title}`, { url: request.loadedUrl });
    },
    errorHandler: async ({ }, _: Error) => {
        //        console.log(request.url);
    },
    // Comment this option to scrape the full website.
    maxRequestsPerCrawl: 4,
    persistCookiesPerSession: false,
    preNavigationHooks: [
        (crawlingContext, _) => {
            // ...
            try {
                const { session, request } = crawlingContext;
                if (session) {
                    const cookieString = 'adlt=1;';

                    const urlWithoutPath = new URL(request.url);
                    urlWithoutPath.pathname = '/'; // Restablecer el path a solo "/"
                    const targetUrl = urlWithoutPath.toString();

                    session.setCookie(cookieString, targetUrl);
                }
            } catch (error) {
            }
        },
    ],
});

await crawler.run(startUrls);

Package version

3.11.4, 3.11.5

Node.js version

22.6.0

Operating system

Windows 11, and Ubuntu 24.04

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@firecrauter firecrauter added the bug Something isn't working. label Oct 27, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Oct 27, 2024
@firecrauter firecrauter changed the title [BUG] Error in session.setCookie function: URL incorrectly interpreted as Regex [BUG] Error in session.setCookie (tough-cookie): URL incorrectly interpreted as Regex Oct 27, 2024
@firecrauter firecrauter changed the title [BUG] Error in session.setCookie (tough-cookie): URL incorrectly interpreted as Regex [BUG] session.setCookie (tough-cookie): URL incorrectly interpreted as Regex Oct 27, 2024
@firecrauter firecrauter reopened this Oct 29, 2024
@firecrauter firecrauter reopened this Nov 3, 2024
@firecrauter
Copy link
Author

firecrauter commented Nov 3, 2024

Fixed in /tough-cookie/pull/465. I guess now I just have to wait for them to release an updated version of tough-cookie, and then Crawlee can be updated.
I'm sorry for so many references and for opening/closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

1 participant