Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to --no-check-certs use at own risk #89

Merged
merged 8 commits into from
Feb 3, 2024
Merged

Conversation

vsoch
Copy link
Collaborator

@vsoch vsoch commented Feb 1, 2024

This will address urlstechie/urlchecker-action#105. After it is tested by the person that opened the issue we will merge, release and update the action.

@vsoch
Copy link
Collaborator Author

vsoch commented Feb 1, 2024

@SuperKogito it looks like one of your previously working escape sequences is now considered invalid syntax:

urlchecker/core/urlmarker.py:53
  /home/runner/work/urlchecker-python/urlchecker-python/urlchecker/core/urlmarker.py:53: SyntaxWarning: invalid escape sequence '\['
    "[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",

And then it's not detecting the urls and some tests fail I think?

Signed-off-by: vsoch <[email protected]>
@@ -50,7 +50,7 @@
")",
"|[a-z0-9.\\-]+[.](?:%s)/)" % domain_extensions,
"(?:",
"[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",
r"[^\\s()<>\[\\]]+|\\([^\\s()]*?\\([^\\s()]+\\)[^\\s()]*?\\)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything other than this looks good to me. I will check this in a couple of hours from now and see if I can come up with a quick fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This was my effort to fix the warning - turning into a raw string (what I did above) helps sometimes. But the tests are still failing - it's not even detecting URLs on many cases, so we have a larger issue on our hands.

I appreciate your help @SuperKogito !

Copy link
Member

@SuperKogito SuperKogito Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the good news, you had the right fix to the warning and we have some good tests.
The bad news, we got a lot of deprecation warnings and some tough URLs to check e.g. https://codepen.io/rootwork/ cannot be checked from my machine but the link is there.

Also check this, this is the same test, https://groups.drupal.org/node/278968 causes the fail at the first one and then works in the second. Zero code changed in between just timing I guess.
ucheck1
ucheck2

My suggestions are the following:

This is what worked for me:

URL_REGEX = r"".join(
    (
        r"(?i)\b(",
        r"(?:",
        r"https?:(?:/{1,3}|[a-z0-9%]",
        r")",
        r"|[a-z0-9.\-]+[.](?:%s)/)" % domain_extensions,
        r"(?:",
        r"[^\s()<>\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)",
        r"|\([^\\s]+?\)",
        r")",
        r"+",
        r"(?:",
        r"\([^\s()]*?\([^\s()]+\)[^\s()]*?\)",
        r"|\([^\\s]+?\)",
        r"|[^\s`!()\[\];:'\".,<>?«»“”‘’]",
        r")",
        r"|",
        r"(?:",
        r"(?<!@)[a-z0-9]",
        r"+(?:[.\-][a-z0-9]+)*[.]",
        r"(?:%s)\b/?(?!@)" % domain_extensions,
        r"))",
    )
)

*I did not make any direct changes to the branch since I don't have a better solution and your input on these matters is very important.

Copy link
Member

@SuperKogito SuperKogito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm.

@vsoch
Copy link
Collaborator Author

vsoch commented Feb 1, 2024

@SuperKogito could you please do a PR to the PR branch here (and then it can be also tested)?

@SuperKogito
Copy link
Member

#90 but I am having the same fail error :/

@vsoch
Copy link
Collaborator Author

vsoch commented Feb 3, 2024

okay doing a bit of work:

  • using HEAD instead of get, falling back to get if 405 method not allowed
  • added support to target single files (shown below)

image

Going to look at the checking issues next.

@vsoch
Copy link
Collaborator Author

vsoch commented Feb 3, 2024

okay I found the issues - nothing to do with our regex or the requests, it was an update to selenium webdriver that deprecated some of the logic we were using. As a result, the driver was failing, returning to be None, and since that is the primary means to get a lot of these URLs (e.g., the initial requests response is not allowed), a lot (actually many) were failing. This is becoming more common with websites, as is logical, they don't want people scraping. But they can't prevent a selenium webdriver from doing so.

image

I'm finishing up local tests now and will push the fixes shortly.

@vsoch
Copy link
Collaborator Author

vsoch commented Feb 3, 2024

Note for myself: we will need to update the driver in the Dockerfile as well, once we find the one that matches GH actions.

The current failures are a result of an update to selenium,
so the instantiation of our driver fails, returns as None,
and then all the requests are done with only requests. As
the web matures (and sites do not want scraping) it is less
likely this approach will work - we need the driver. This
change will update the selenium UI to ensure the driver works
and restore functionality. I will follow up with any tweaks
needed for the CI (working locally for me).

Signed-off-by: vsoch <[email protected]>
@vsoch
Copy link
Collaborator Author

vsoch commented Feb 3, 2024

That green is sure beautiful :) 🍏 https://github.com/urlstechie/urlchecker-python/actions/runs/7769722953/job/21189125531?pr=89

Just pushed the update for the container, and we should be able to merge and release soon and test with the action.

@vsoch vsoch merged commit d0e7560 into master Feb 3, 2024
4 checks passed
@vsoch vsoch deleted the add-skip-check-certs branch February 3, 2024 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants