-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Resolve daily Tranco list on the probes #1604
Comments
Thanks for writing up this proposal. This does sound like an interesting idea to explore. I will leave here a few thoughts on things we should consider while designing an experiment to do this. The biggest concern for sure is user safety. The main things we should assess what impact they may have are:
One way in which we could address this is through either making this measurement only run on certain classes of probes (for example only the ones running on servers or iThena) or provide an additional opt-in mechanism to turn on these tests. Another approach would be to see if it's feasible to do some level of sanitisation of the top N tranco domains and exclude things that are known to be bad, though this is likely to be something quite challenging. In any case the risk to end users is probably the first and most important thing that needs to be assessed. Regarding the implementation, I think we would rather have this be implemented as a general purpose DNS resolution test for which the provisioning of inputs is handled entirely on the backend-side. This is similar to how we run our web_connectivity test where we have the ability to return different sets of addresses depending on the probe that's requesting it or even disable running the test altogether. Another aspect to consider is what the potential impact of running such a test at scale on all our probes. Do you have a sense for how many domains you would like to have tested and with what frequency? For example our backend already has support for performing url list prioritization in order to maximize coverage (i.e. only run inputs that haven't been run in a certain day in a particular network) and spread the load across multiple probes. The reason to ask this, is that we need to consider both how much additional network load a potential test like this would add onto our probes (both for measurement execution, but also for upload to our collector) and what load this would add to our backend infrastructure that would need process this additional measurement data. |
Thanks for the detailed reply.
Could you describe or provide a link describing what you mean with “our public IP scrubbing functionality”?
Not sure what a suspicious domain would be, and it might change depending on the region. From a quick inspection of today's list, multiple domains could be concerning: e.g., prohibited social media websites, newspapers, adult content websites.
Resolving the top 1K (or up to 10K) might be interesting. One resolution per domain, per day, per probe (or per AS) would be an upper bound on the frequency. The Tranco list is updated daily, so a more frequent check does not bring more value. As you suggest provisioning domains to resolve from the backend, we could implement a smarter selection of domains to resolve among the top 1K, similar to the URL list prioritization you describe. For example, only resolve domains that we did not yet resolve in the last 72 hours for in the AS the probe is located.
Starting with only servers and iThena probes would already help a lot compared to the current situation where we do not have the data at all. Providing an opt-in test to the user operated probes is an interesting next step. Moreover, running the tests on servers and on iThena probes is less likely to put at risk users. It is also less concerning to resolve hundreds of domains in such setting.
From the description of the experiment, the output contains the data we are interested in: network location of the probe (AS, IP, CC, time) and the result of the DNS resolution for the domain.
Using the resolver configured on the probe is more interesting, however using a hard-coded public resolver may be sufficient if using the probe's resolver put the user at risk somehow. I'd like to emphasize that the data collected with this proposal could be of broader use and not only help in our case. |
Context
The Tranco list (https://tranco-list.eu) is a “A Research-Oriented Top Sites Ranking Hardened Against Manipulation”. It is often used by researchers to assess what were the most popular domains for a given time. The list is generated daily since 2019.
When doing historical network simulations (e.g., in the case of research around Tor) it is helpful to know what were the popular destinations at the simulated time. However, those domains need to be resolved to IP addresses to be meaningful in the simulation. DNS queries can yield different results depending on the geographic location of the client. DNS records can also be updated over time.
For those reasons, I feel there is a venue for a “DNS resolved Tranco list” dataset and OONI might be the right place to build and host this data. The goal of the dataset is to have a history of how the DNS queries for popular domains were resolved over time and across countries.
Feature proposal
Create a new
nettest
to resolve the top N domains of today's Tranco list. For each domain, the probe would ideally contribute the following data back to OONI:The value of N needs to be set to a reasonable number: the Tranco list can be downloaded for the top 1M or the full list (today's list contains ~4.5M domains). In my opinion, starting with a smaller value for N (like 1K or 2K) may already provide interesting data.
As the list is designed to be hardened against manipulation, one might expect the same set of domains to appear in the top of the list for multiple days in a row.
Alternatives
While writing the proposal, I also considered the Ripe Atlas (https://atlas.ripe.net) to run these measurements. However, their credit model seems restrictive: to resolve the top 1K (A and AAAA), once a day, on 50 different probes I would need to run 47 probes 24/7 to earn enough credits.
Contributing
If this proposal is accepted by OONI, I may take part in the development of the feature and submit PRs.
The text was updated successfully, but these errors were encountered: