Add logic to verify URLs using HTML meta tag #16597

facutuesca · 2024-08-29T17:28:11Z

Part of #8635, this PR adds a function to verify arbitrary URLs by parsing their HTML and looking for a specific meta tag.

Concretely, a webpage with a meta tag in its header element like the following:

<meta content="package1 package2" namespace="pypi.org" rel="me" />

would pass validation for the project1 and project2 PyPI projects.

This PR only adds the function and its tests. The function is not used anywhere yet.

This implementation takes into account the discussion in the issue linked above, starting with this comment: #8635 (comment).

Concretely:

URLs must use https://
The hostname must be a regular name (i.e.: domain.tld), it cannot be an P address (e.g: https://100.100.100.100)
If a port is present, it must be 443 (we could also remove this, and require that no port is present)
Before getting the HTML, we resolve the URL to an IP address, and check that it's a global IP and not a private or shared IP
We limit the amount of content we download to 1024 bytes (this number was an arbitrary choice, it's open to changes)
HTML is parsed using lxml, which recovers from partial HTML, meaning only reading the first N bytes should be fine as long as it contains the tag we are looking for

I'm opening this PR with only the verification logic since it's the part that requires the most review and discussion. Once it's done we can see how to integrate it with the current upload flow (probably as an asynchronous task).

cc @woodruffw @ewjoachim

woodruffw · 2024-08-29T17:49:39Z

If a port is present, it must be 443 (we could also remove this, and require that no port is present)

I'm +1 on removing this outright -- I think the volume of legitimate users who actually need to explicitly list a port is probably vanishingly small 🙂

woodruffw · 2024-08-29T17:53:51Z

warehouse/packaging/metadata_verification.py

+    )
+    r.raise_for_status()
+
+    content = next(r.iter_content(max_length_bytes))


I think r.raw.read(max_length_bytes) will be slightly faster + more idiomatic here, since we're already setting stream=True 🙂

woodruffw · 2024-08-29T17:55:31Z

warehouse/packaging/metadata_verification.py

+    r.raise_for_status()
+
+    content = next(r.iter_content(max_length_bytes))
+    return content


Thinking out loud: should we explicitly r.close() before existing this method? As-is, I think this will leave the connection open and dangling until the server hangs up, meaning that we could end up slowly exhausting the number of available outbound sockets.

Good catch! This also applies to the new urllib3 implementation, so we now call

r.drain_conn() r.release_conn()

before returning

Hmm, is there a way we can avoid drain_conn? My understanding is that r.drain_conn(...) will block until the entire response is read (and dropped), meaning that this will end up effectively reading the whole response instead of truncating after the first X bytes.

Per: https://urllib3.readthedocs.io/en/stable/reference/urllib3.response.html#urllib3.response.HTTPResponse.drain_conn

(If I got that right, then I think we can avoid this by removing the call and doing nothing else, since we're already using a separate connection pool per request and there's no reason to release back to a pool that we don't reuse 🙂)

I think that makes sense, but shouldn't we close the connection? The docs mention that if you don't care about returning the connection to the pool, you can call close() to close the connection:

You can call the close() to close the connection, but this call doesn’t return the connection to the pool, throws away the unread data on the wire, and leaves the connection in an undefined protocol state. This is desirable if you prefer not reading data from the socket to re-using the HTTP connection.

Yeah I recommend calling HTTPResponse.close() instead of drain_conn() if you don't plan on reusing the connection, that will close out the socket.

woodruffw · 2024-08-29T18:01:26Z

warehouse/packaging/metadata_verification.py

+    # The domain name should not resolve to a private or shared IP address
+    try:
+        address_tuples = socket.getaddrinfo(user_uri.host, user_uri.port)
+        for family, _, _, _, sockaddr in address_tuples:
+            ip_address: ipaddress.IPv4Address | ipaddress.IPv6Address | None = None
+            if family == socket.AF_INET:
+                ip_address = ipaddress.IPv4Address(sockaddr[0])
+            elif family == socket.AF_INET6:
+                ip_address = ipaddress.IPv6Address(sockaddr[0])
+            if ip_address is None or not ip_address.is_global:
+                return False
+    except (socket.gaierror, ipaddress.AddressValueError):
+        return False
+
+    # We get the first 1024 bytes
+    try:
+        content = _get_url_content(user_uri, max_length_bytes=1024)
+    except requests.exceptions.RequestException:
+        return False


Flagging: this has an unfortunate TOC/TOU weakness, where getaddrinfo might return a public IP, and the subsequent resolution in requests might return a private/internal one. In other words, an attacker could race the check here to get it to pass.

I think the correct way to do this is to either inspect the socket object underneath the requests response, or to use the resolved IP directly and employ the Host header + SNI to resolve the correct domain from the server. But both of these are annoying to do 🙂

For the latter, HostHeaderSSLAdapter from requests_toolbelt is possibly the most painless approach: https://toolbelt.readthedocs.io/en/latest/adapters.html#requests_toolbelt.adapters.host_header_ssl.HostHeaderSSLAdapter

(This may also be possible more easily via urllib3 instead -- maybe @sethmlarson knows?)

This should be easier with urllib3, you can send your own Host and SNI values more directly.

Just confirmed with Wireshark that this works for urllib3:

http = urllib3.HTTPSConnectionPool( host="93.184.215.14", port=443, headers={"Host": "example.com"}, server_hostname="example.com", assert_hostname="example.com", ) resp = http.request("GET", "/")

Sends SNI of example.com, asserts example.com on the cert, sends example.com in the Host header, but doesn't do any DNS resolution. I should probably add a super-strict integration test for this construction to the urllib3 test suite, the individual parts are tested but having it all tested together is quite nice to double-check.

Amazing, thanks @sethmlarson!

Thanks! The implementation now uses urllib3

ewjoachim · 2024-08-30T12:43:23Z

(I just opened a random blog (typed "some blog" in google, got into a list of best blog per category, in "education", first link was https://blog.ed.ted.com/, and its complete <head> tag it about 10k bytes. I think 100k bytes is probably much safer than 1024)

facutuesca · 2024-09-03T13:38:16Z

(I just opened a random blog (typed "some blog" in google, got into a list of best blog per category, in "education", first link was https://blog.ed.ted.com/, and its complete <head> tag it about 10k bytes. I think 100k bytes is probably much safer than 1024)

Changed to 100000 bytes

Add logic to verify URLs using HTML meta tag

62305be

facutuesca requested a review from a team as a code owner August 29, 2024 17:28

woodruffw reviewed Aug 29, 2024

View reviewed changes

Address review comments

a5ccaac

facutuesca added 3 commits September 3, 2024 15:44

Use read() to get response content

456f87b

Close the connection using .close()

1bc7aa7

Fix tests

f0e12da

This was referenced Sep 19, 2024

Validate the contents of identity centric metadata #8635

Open

Important project links are pushed down into the "unverified details" section #15903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logic to verify URLs using HTML meta tag #16597

Add logic to verify URLs using HTML meta tag #16597

facutuesca commented Aug 29, 2024 •

edited

Loading

woodruffw commented Aug 29, 2024

woodruffw Aug 29, 2024

facutuesca Sep 3, 2024

woodruffw Aug 29, 2024

facutuesca Sep 3, 2024

woodruffw Sep 3, 2024

facutuesca Sep 3, 2024

sethmlarson Sep 3, 2024

facutuesca Sep 3, 2024

woodruffw Aug 29, 2024

sethmlarson Aug 29, 2024

sethmlarson Aug 29, 2024

woodruffw Aug 29, 2024

facutuesca Sep 3, 2024

ewjoachim commented Aug 30, 2024

facutuesca commented Sep 3, 2024

Add logic to verify URLs using HTML meta tag #16597

Are you sure you want to change the base?

Add logic to verify URLs using HTML meta tag #16597

Conversation

facutuesca commented Aug 29, 2024 • edited Loading

woodruffw commented Aug 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ewjoachim commented Aug 30, 2024

facutuesca commented Sep 3, 2024

facutuesca commented Aug 29, 2024 •

edited

Loading