-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
android: properly handle NXDOMAIN errors #2029
Comments
It seems this problem is interesting enough to deserve some in-depth level of double checking. We need to figure out whether there are some specifics that make it work somewhere or whether it's all broken. Let's use as test case the So, the first objective is to figure our in which platforms does it happen.
So, this seems to be an Android-only problem. This begs two questions: (1) what is the root cause and (2) how to test it. |
So, debugging this issue is not immediately straightforward. I think it makes sense for me to explain how I did it. The first thing to do is to start an emulator using Android studio. Then, you need root access (IIUC this is mainly to be able to write the disk): adb root Then, you need to compile GOOS=android GOARCH=386 CGO_ENABLED=1 CC=$HOME/sdk/ooni-android/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/i686-linux-android30-clang LD=$HOME/sdk/ooni-android/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/ld go build -v ./internal/cmd/miniooni Then you need to copy the $HOME/sdk/ooni-android/platform-tools/adb push miniooni /cache Then you need to shell into Android: ~/sdk/ooni-android/platform-tools/adb shell At this point it's straightforward: cd /cache
chmod +x miniooni
./miniooni -i https://thefoobarbay.com web_connectivity And here's the measurement: https://explorer.ooni.org/measurement/20220221T135225Z_webconnectivity_IT_30722_n1_ersvWVUkwEUJf22g?input=https%3A%2F%2Fthefoobarbay.com So, the nice fact about this is that we can reproduce using |
To gather more information about what is happening, here's a simple test program: #include <sys/types.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netdb.h>
int main() {
struct addrinfo *res;
struct addrinfo hints;
memset(&hints, 0, sizeof (hints));
hints.ai_flags |= AI_CANONNAME;
int r = getaddrinfo("thefoobarbay.com", NULL, &hints, &res);
if (r != 0) {
fprintf(stderr, "failure: %d %s\n", r, gai_strerror(r));
exit(1);
}
struct addrinfo *aip = res;
for (; aip != NULL; aip = aip->ai_next) {
fprintf(stderr, "- %p\n", aip);
}
freeaddrinfo(res);
return 0;
} We can compile using ~/sdk/ooni-android/ndk/23.1.7779620/toolchains/llvm/prebuilt/linux-x86_64/bin/i686-linux-android30-clang -Wall -Wextra getaddrinfo.c After copying and running on the device, as we did above, we obtain: ./a.out
failure: 7 No address associated with hostname Now, the interesting thing is that, when running the same small program locally on GNU/Linux, I get: ./a.out
failure: -2 Name or service not known If we read the /*
* Error return codes from getaddrinfo()
*/
#define EAI_ADDRFAMILY 1 /* address family for hostname not supported */
#define EAI_AGAIN 2 /* temporary failure in name resolution */
#define EAI_BADFLAGS 3 /* invalid value for ai_flags */
#define EAI_FAIL 4 /* non-recoverable failure in name resolution */
#define EAI_FAMILY 5 /* ai_family not supported */
#define EAI_MEMORY 6 /* memory allocation failure */
#define EAI_NODATA 7 /* no address associated with hostname */
#define EAI_NONAME 8 /* hostname nor servname provided, or not known */
#define EAI_SERVICE 9 /* servname not supported for ai_socktype */
#define EAI_SOCKTYPE 10 /* ai_socktype not supported */
#define EAI_SYSTEM 11 /* system error returned in errno */
#define EAI_BADHINTS 12 /* invalid value for hints */
#define EAI_PROTOCOL 13 /* resolved protocol is unknown */
#define EAI_OVERFLOW 14 /* argument buffer overflow */
#define EAI_MAX 15 whereas, if we read my GNU/Linux system's /* Error values for `getaddrinfo' function. */
# define EAI_BADFLAGS -1 /* Invalid value for `ai_flags' field. */
# define EAI_NONAME -2 /* NAME or SERVICE is unknown. */
# define EAI_AGAIN -3 /* Temporary failure in name resolution. */
# define EAI_FAIL -4 /* Non-recoverable failure in name res. */
# define EAI_FAMILY -6 /* `ai_family' not supported. */
# define EAI_SOCKTYPE -7 /* `ai_socktype' not supported. */
# define EAI_SERVICE -8 /* SERVICE not supported for `ai_socktype'. */
# define EAI_MEMORY -10 /* Memory allocation failure. */
# define EAI_SYSTEM -11 /* System error returned in `errno'. */
# define EAI_OVERFLOW -12 /* Argument buffer overflow. */
# ifdef __USE_GNU
# define EAI_NODATA -5 /* No address associated with NAME. */
# define EAI_ADDRFAMILY -9 /* Address family for NAME not supported. */
# define EAI_INPROGRESS -100 /* Processing request in progress. */
# define EAI_CANCELED -101 /* Request canceled. */
# define EAI_NOTCANCELED -102 /* Request not canceled. */
# define EAI_ALLDONE -103 /* All requests done. */
# define EAI_INTR -104 /* Interrupted by a signal. */
# define EAI_IDN_ENCODE -105 /* IDN encoding failed. */
So, we can conclude that on Android Now, I think this should clarify what is the actual problem. Consider how Go handles this result value: if gerrno != 0 {
isErrorNoSuchHost := false
isTemporary := false
switch gerrno {
case C.EAI_SYSTEM:
if err == nil {
// err should not be nil, but sometimes getaddrinfo returns
// gerrno == C.EAI_SYSTEM with err == nil on Linux.
// The report claims that it happens when we have too many
// open files, so use syscall.EMFILE (too many open files in system).
// Most system calls would return ENFILE (too many open files),
// so at the least EMFILE should be easy to recognize if this
// comes up again. golang.org/issue/6232.
err = syscall.EMFILE
}
case C.EAI_NONAME:
err = errNoSuchHost
isErrorNoSuchHost = true
default:
err = addrinfoErrno(gerrno)
isTemporary = addrinfoErrno(gerrno).Temporary()
}
return nil, "", &DNSError{Err: err.Error(), Name: name, IsNotFound: isErrorNoSuchHost, IsTemporary: isTemporary}
} and it should be clear why we don't get the |
So, the simplest fix for this issue seems to add a specific filtering rule for Android that treats the given string specifically. A more structurde fix seems that we call |
The issue at ooni/probe#2029 is fixed if we directly call getaddrinfo and correctly map its return code. While the main reason to propose this diff is to fix the above mentioned issue, we should note that this diff also paves the way for ooni/probe#1569. (Of course, regarding ooni/probe#1569, we don't have the Rcode when we're calling getaddrinfo, but the spirit of ooni/probe#1569 is that we should include the lowest-level error we have seen and, when we're calling getaddrinfo, such an error is getaddrinfo's retval.)
The problem is explained in ooni/probe#2029. I am also working on a more comprehensive fix for this issue in #698. This diff WILL NOT need forwardporting. It's just meant as an hotfix for the release/3.14 branch and it's not such that I'd be happy keeping it in release/3.15+.
Commit ooni/probe-cli@2b48dcf contains an hotfix that should be good enough for the |
We're not going to close this issue now, since we're still working out a solution for |
This cherry-picks 2b48dcf for the release/3.15 branch. Original commit message follows: - - - The problem is explained in ooni/probe#2029. I am also working on a more comprehensive fix for this issue in #698. This diff WILL NOT need forwardporting. It's just meant as an hotfix for the release/3.14 branch and it's not such that I'd be happy keeping it in release/3.15+.
For |
Here's another measurement with this issue: https://explorer.ooni.org/measurement/20220121T025535Z_webconnectivity_PK_17557_n1_s8CkxsxBttmZoFZ1?input=http%3A%2F%2Fgovernmentofbalochistan.blogspot.com%2F (cc: @gurshabad) |
This commit changes our system resolver to call getaddrinfo directly when CGO is enabled. This change allows us to: 1. obtain the CNAME easily 2. obtain the real getaddrinfo retval See ooni/probe#2029 See ooni/probe#2033
I am about to merge ooni/probe-cli#764, which provides a theory explaining the issue we were witnessing for Android devices. A theory does not mean that it's necessarily reality, but it's convincing. Here's the theory:
See https://github.com/ooni/probe-cli/pull/764/files#diff-3b2397df28603c7fc86efad2e2e2d7a77b2e2806e7cb74b0eaea110aa1238f25R92 for the complete discussion. |
This commit changes our system resolver to call getaddrinfo directly when CGO is enabled. This change allows us to: 1. obtain the CNAME easily 2. obtain the real getaddrinfo retval 3. handle platform specific oddities such as `EAI_NODATA` returned on Android devices See ooni/probe#2029 and ooni/probe#2029 (comment) in particular. See ooni/probe#2033 for documentation regarding the desire to see `getaddrinfo`'s retval. See ooni/probe#2118 for possible follow-up changes.
See ooni/probe-cli#765, where I identified I needed to sync up spec with probe-cli. ## Checklist - [x] I have read the [contribution guidelines](https://github.com/ooni/spec/blob/master/CONTRIBUTING.md) - [x] reference issue for this pull request: ooni/probe#2029 - [x] related ooni/probe-cli pull request: ooni/probe-cli#765 - [x] If I changed a spec, I also bumped its version number and/or date
After #764, the build for CGO_ENABLED=0 has been broken for miniooni: https://github.com/ooni/probe-cli/runs/6636995859?check_suite_focus=true Likewise, it's not possible to run tests with CGO_ENABLED=0. To make tests work with `CGO_ENABLED=0`, I needed to sacrifice some unit tests run for the CGO case. It is not fully clear to me what was happening here, but basically `getaddrinfo_cgo_test.go` was compiled with CGO being disabled, even though the ``//go:build cgo` flag was specified. Additionally, @hellais previously raised a valid point in the review of #698: > Another issue we should consider is that, if I understand how > this works correctly, depending on whether or not we have built > with CGO_ENABLED=0 on or not, we are going to be measuring > things in a different way (using our cgo inspired getaddrinfo > implementation or using netgo). This might present issues when > analyzing or interpreting the data. > > Do we perhaps want to add some field to the output data format that > gives us an indication of which DNS resolution code was used to > generate the the metric? This comment is relevant to the current commit because #698 is the previous iteration of #764. So, while fixing the build and test issues, let us also distinguish between the CGO_ENABLED=1 and CGO_ENABLED=0 cases. Before this commit, OONI used "system" to indicate the case where we were using net.DefaultResolver. This behavior dates back to the Measurement Kit days. While it is true that ooni/probe-engine and ooni/probe-cli could have been using netgo in the past when we said "system" as the resolver, it also seems reasonable to continue to use "system" top indicate getaddrinfo. So, the choice here is basically to use "netgo" from now on to indicate the cases in which we were built with CGO_ENABLED=0. This change will need to be documented into ooni/spec along with the introduction of the `android_dns_cache_no_data` error. ## Checklist - [x] I have read the [contribution guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md) - [x] reference issue for this pull request: ooni/probe#2029 - [x] if you changed anything related how experiments work and you need to reflect these changes in the ooni/spec repository, please link to the related ooni/spec pull request: ooni/spec#242
After ooni/probe-cli#765, there's one (hopefully last) residual issue: if I cross compile, say, |
In ooni/probe#2029 (comment), we explained why calling it "netgo" would be incorrect. In other words, we can get the platform's `getaddrinfo` as long as we're not cross compiling. We do cross compile `ooniprobe`, actually it's not even possible to cross compile it. For increased accuracy, we should stop cross compiling `miniooni` as well, so it would also directly use `getaddrinfo`. This diff fixes at the same time ooni/probe-cli and ooni/spec and we'll open two pull requests in parallel.
In ooni/probe#2029 (comment), we explained why calling it "netgo" would be incorrect. In other words, we can get the platform's `getaddrinfo` as long as we're not cross compiling. We do cross compile `ooniprobe`, actually it's not even possible to cross compile it. For increased accuracy, we should stop cross compiling `miniooni` as well, so it would also directly use `getaddrinfo`. This diff fixes at the same time ooni/probe-cli and ooni/spec and we'll open two pull requests in parallel.
In ooni/probe#2029 (comment), we explained why calling it "netgo" would be incorrect. In other words, we can get the platform's `getaddrinfo` as long as we're not cross compiling. We do cross compile `ooniprobe`, actually it's not even possible to cross compile it. For increased accuracy, we should stop cross compiling `miniooni` as well, so it would also directly use `getaddrinfo`. This diff fixes at the same time ooni/probe-cli and ooni/spec and we'll open two pull requests in parallel.
I created #2120 to track the issue that we cannot consider the result of |
Most work required by this issue has been done. It remains to merge ooni/probe-cli#698. |
Ah, wait, another piece of work is to implement the |
(This issue looked like medium priority when we triaged it, but it's actually high priority now.) |
As of 3.17.0-alpha.1, we correctly detect this issue for Android by using the |
This diff backports ad952ac. See ooni/probe#2029 (comment)
Part of ooni/probe#2029. The general idea is to modify v0.4 in a subsequent PR to make it WAI for this test case.
Part of ooni/probe#2029. The general idea is to modify v0.4 in a subsequent PR to make it WAI for this test case. Currently v0.4 does not correctly support this case, which is why we're doing this work. (See also ooni/probe#2499).
This diff changes the handling of android_dns_cache_no_data to mark it and any other DNS inconsistenct as DNS anomaly. The previous implementation was not handling any DNS inconsistency as an anomaly, therefore I needed to adjust a bunch of summary_test.go tests. Part of ooni/probe#2499 and ooni/probe#2029. I bumped the version number because this is a significant change in the analysis algorithm implemented by the probe.
## Checklist - [x] I have read the [contribution guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md) - [x] reference issue for this pull request: ooni/probe#2499 - [x] if you changed anything related to how experiments work and you need to reflect these changes in the ooni/spec repository, please link to the related ooni/spec pull request: N/A - [x] if you changed code inside an experiment, make sure you bump its version number ## Summary This diff changes the `summary.go` algorithm of Web Connectivity v0.4 to handle `android_dns_cache_no_data` as an anomaly when the DNS is inconsistent. We continue handling `dns_nxdomain_error` as an anomaly when the DNS is inconsistent, as demonstrated by the fact that the corresponding netem test is still passing. This diff also bumps the version number to v0.4.3. Version v0.4.2 did not handle this case, which caused measurements to be marked as failed as documented by ooni/probe#2499. This diff is also related to ooni/probe#2029, in the sense that it is slightly improving our analysis results when the is an NXDOMAIN error (even if it's masked by Android's DNS cache behavior). While there, add empty lines to improve the code readability.
…#1210) Part of ooni/probe#2029. The general idea is to modify v0.4 in a subsequent PR to make it WAI for this test case. Currently v0.4 does not correctly support this case, which is why we're doing this work. (See also ooni/probe#2499).
…i#1211) ## Checklist - [x] I have read the [contribution guidelines](https://github.com/ooni/probe-cli/blob/master/CONTRIBUTING.md) - [x] reference issue for this pull request: ooni/probe#2499 - [x] if you changed anything related to how experiments work and you need to reflect these changes in the ooni/spec repository, please link to the related ooni/spec pull request: N/A - [x] if you changed code inside an experiment, make sure you bump its version number ## Summary This diff changes the `summary.go` algorithm of Web Connectivity v0.4 to handle `android_dns_cache_no_data` as an anomaly when the DNS is inconsistent. We continue handling `dns_nxdomain_error` as an anomaly when the DNS is inconsistent, as demonstrated by the fact that the corresponding netem test is still passing. This diff also bumps the version number to v0.4.3. Version v0.4.2 did not handle this case, which caused measurements to be marked as failed as documented by ooni/probe#2499. This diff is also related to ooni/probe#2029, in the sense that it is slightly improving our analysis results when the is an NXDOMAIN error (even if it's masked by Android's DNS cache behavior). While there, add empty lines to improve the code readability.
This issue is linked from the spec and it's not something which is directly actionable atm. Closing. |
We noticed that certain NXDOMAIN errors are not properly handled by the engine.
Here is a sample measurement that results in an unknown_failure: https://explorer.ooni.org/measurement/20220215T025431Z_webconnectivity_HK_9231_n1_oU1CBKrQbHFoou5h?input=http%3A%2F%2Fwww.hongkongwatch.org
The text was updated successfully, but these errors were encountered: