Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic to verify URLs using HTML meta tag #16597

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

facutuesca
Copy link
Contributor

@facutuesca facutuesca commented Aug 29, 2024

Part of #8635, this PR adds a function to verify arbitrary URLs by parsing their HTML and looking for a specific meta tag.

Concretely, a webpage with a meta tag in its header element like the following:

<meta content="package1 package2" namespace="pypi.org" rel="me" />

would pass validation for the project1 and project2 PyPI projects.

This PR only adds the function and its tests. The function is not used anywhere yet.

This implementation takes into account the discussion in the issue linked above, starting with this comment: #8635 (comment).

Concretely:

  • URLs must use https://
  • The hostname must be a regular name (i.e.: domain.tld), it cannot be an P address (e.g: https://100.100.100.100)
  • If a port is present, it must be 443 (we could also remove this, and require that no port is present)
  • Before getting the HTML, we resolve the URL to an IP address, and check that it's a global IP and not a private or shared IP
  • We limit the amount of content we download to 1024 bytes (this number was an arbitrary choice, it's open to changes)
  • HTML is parsed using lxml, which recovers from partial HTML, meaning only reading the first N bytes should be fine as long as it contains the tag we are looking for

I'm opening this PR with only the verification logic since it's the part that requires the most review and discussion. Once it's done we can see how to integrate it with the current upload flow (probably as an asynchronous task).

cc @woodruffw @ewjoachim

@facutuesca facutuesca requested a review from a team as a code owner August 29, 2024 17:28
@woodruffw
Copy link
Member

  • If a port is present, it must be 443 (we could also remove this, and require that no port is present)

I'm +1 on removing this outright -- I think the volume of legitimate users who actually need to explicitly list a port is probably vanishingly small 🙂

)
r.raise_for_status()

content = next(r.iter_content(max_length_bytes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think r.raw.read(max_length_bytes) will be slightly faster + more idiomatic here, since we're already setting stream=True 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

r.raise_for_status()

content = next(r.iter_content(max_length_bytes))
return content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: should we explicitly r.close() before existing this method? As-is, I think this will leave the connection open and dangling until the server hangs up, meaning that we could end up slowly exhausting the number of available outbound sockets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This also applies to the new urllib3 implementation, so we now call

    r.drain_conn()
    r.release_conn()

before returning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is there a way we can avoid drain_conn? My understanding is that r.drain_conn(...) will block until the entire response is read (and dropped), meaning that this will end up effectively reading the whole response instead of truncating after the first X bytes.

Per: https://urllib3.readthedocs.io/en/stable/reference/urllib3.response.html#urllib3.response.HTTPResponse.drain_conn

(If I got that right, then I think we can avoid this by removing the call and doing nothing else, since we're already using a separate connection pool per request and there's no reason to release back to a pool that we don't reuse 🙂)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, but shouldn't we close the connection? The docs mention that if you don't care about returning the connection to the pool, you can call close() to close the connection:

You can call the close() to close the connection, but this call doesn’t return the connection to the pool, throws away the unread data on the wire, and leaves the connection in an undefined protocol state. This is desirable if you prefer not reading data from the socket to re-using the HTTP connection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I recommend calling HTTPResponse.close() instead of drain_conn() if you don't plan on reusing the connection, that will close out the socket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

Comment on lines 98 to 116
# The domain name should not resolve to a private or shared IP address
try:
address_tuples = socket.getaddrinfo(user_uri.host, user_uri.port)
for family, _, _, _, sockaddr in address_tuples:
ip_address: ipaddress.IPv4Address | ipaddress.IPv6Address | None = None
if family == socket.AF_INET:
ip_address = ipaddress.IPv4Address(sockaddr[0])
elif family == socket.AF_INET6:
ip_address = ipaddress.IPv6Address(sockaddr[0])
if ip_address is None or not ip_address.is_global:
return False
except (socket.gaierror, ipaddress.AddressValueError):
return False

# We get the first 1024 bytes
try:
content = _get_url_content(user_uri, max_length_bytes=1024)
except requests.exceptions.RequestException:
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging: this has an unfortunate TOC/TOU weakness, where getaddrinfo might return a public IP, and the subsequent resolution in requests might return a private/internal one. In other words, an attacker could race the check here to get it to pass.

I think the correct way to do this is to either inspect the socket object underneath the requests response, or to use the resolved IP directly and employ the Host header + SNI to resolve the correct domain from the server. But both of these are annoying to do 🙂

For the latter, HostHeaderSSLAdapter from requests_toolbelt is possibly the most painless approach: https://toolbelt.readthedocs.io/en/latest/adapters.html#requests_toolbelt.adapters.host_header_ssl.HostHeaderSSLAdapter

(This may also be possible more easily via urllib3 instead -- maybe @sethmlarson knows?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be easier with urllib3, you can send your own Host and SNI values more directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirmed with Wireshark that this works for urllib3:

http = urllib3.HTTPSConnectionPool(
    host="93.184.215.14",
    port=443,
    headers={"Host": "example.com"},
    server_hostname="example.com",
    assert_hostname="example.com",
)
resp = http.request("GET", "/")

Sends SNI of example.com, asserts example.com on the cert, sends example.com in the Host header, but doesn't do any DNS resolution. I should probably add a super-strict integration test for this construction to the urllib3 test suite, the individual parts are tested but having it all tested together is quite nice to double-check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks @sethmlarson!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The implementation now uses urllib3

@ewjoachim
Copy link
Contributor

(I just opened a random blog (typed "some blog" in google, got into a list of best blog per category, in "education", first link was https://blog.ed.ted.com/, and its complete <head> tag it about 10k bytes. I think 100k bytes is probably much safer than 1024)

@facutuesca
Copy link
Contributor Author

(I just opened a random blog (typed "some blog" in google, got into a list of best blog per category, in "education", first link was https://blog.ed.ted.com/, and its complete <head> tag it about 10k bytes. I think 100k bytes is probably much safer than 1024)

Changed to 100000 bytes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants