Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up make tests #16295

Merged
merged 4 commits into from
Aug 9, 2024
Merged

Speed up make tests #16295

merged 4 commits into from
Aug 9, 2024

Conversation

twm
Copy link
Contributor

@twm twm commented Jul 17, 2024

  • Prevent nodatadog DNS requests from escaping to the wider world while running the tests (see 9da9f2a for context)
  • Disable Postgres fsync for a ~6% speedup

I also tried:

  • Setting METRICS_BACKEND=warehouse.metrics.NullMetrics on the tests container in docker-compose.yml
  • pytest --dist=worksteal which I've seen make a big difference.

Neither had a measurable effect on runtime.

@twm twm requested a review from a team as a code owner July 17, 2024 21:01
@woodruffw woodruffw added testing Test infrastructure and individual tests developer experience Anything that improves the experience for Warehouse devs labels Jul 18, 2024
@woodruffw
Copy link
Member

LGTM, thank you @twm! I don't get any local performance boost from these on macOS, but I tried on my Linux machine last night and got roughly 6% as well (I didn't seem to hit the notdatadog issue, but maybe that's because of my resolver setup).

@twm
Copy link
Contributor Author

twm commented Jul 25, 2024

Awesome, glad this helps! Should I keep rebasing this (any my other PRs) while they await review/merge?

@miketheman
Copy link
Member

  • Prevent nodatadog DNS requests from escaping to the wider world while running the tests (see 9da9f2a for context)

Thanks for providing the context and details. I've been trying to understand how this happens, and can't seem to dig it up anything.

Tests that emit metrics are often configured to use the metrics fixture either directly or via pyramid_services fixture, and the only case I have found are a subset of tests for tests/unit/metrics/test_services.py::TestDataDogMetrics.

Could you retry the test suite without the modification of starting the notdatadog container, but exclude this set of tests confirm/refute that this is happening from those tests only?
Should be: make tests TESTARGS="-k 'not TestDataDogMetrics'"

Since we would prefer to keep the amount of containers started for tests low, I'd prefer to not autostart another one if possible.

  • Disable Postgres fsync for a ~6% speedup

Would gladly accept this on its own, nice speedup!

@twm
Copy link
Contributor Author

twm commented Aug 2, 2024

  • Prevent nodatadog DNS requests from escaping to the wider world while running the tests (see 9da9f2a for context)

Thanks for providing the context and details. I've been trying to understand how this happens, and can't seem to dig it up anything.

Tests that emit metrics are often configured to use the metrics fixture either directly or via pyramid_services fixture, and the only case I have found are a subset of tests for tests/unit/metrics/test_services.py::TestDataDogMetrics.

Could you retry the test suite without the modification of starting the notdatadog container, but exclude this set of tests confirm/refute that this is happening from those tests only? Should be: make tests TESTARGS="-k 'not TestDataDogMetrics'"

Since we would prefer to keep the amount of containers started for tests low, I'd prefer to not autostart another one if possible.

I totally appreciate that, and am happy to dig in more. This was just the hack that let me get #16260 into reviewable shape.

I'm currently working to reproduce. It would slightly surprise me — there are only about a dozen tests there — but possible. I recall from a past life that the DataDog client is aggressive about doing DNS resolution up front.

  • Disable Postgres fsync for a ~6% speedup

Would gladly accept this on its own, nice speedup!

Awesome, I've extracted it to #16384.

@twm
Copy link
Contributor Author

twm commented Aug 2, 2024

On main I don't see the same slowdown I did two weeks ago, though packet capture indicates there are notdatadog resolution requests escaping.

Given that, I think the extreme slowdown I observed was caused related to link congestion issues on my end. I identified some bad hardware on my network that was causing packet loss a few days after filing this. As DNS is pretty sensitive to loss I think that may explain it.

Would it still be helpful to track down the notdatadog leaks? The perf impact of doing so is clearly not as I represented above, but it seems good hygiene nonetheless.

@miketheman
Copy link
Member

Glad that the slowdown is no longer apparent.

Would it still be helpful to track down the notdatadog leaks? The perf impact of doing so is clearly not as I represented above, but it seems good hygiene nonetheless.

I think so. Calls are explicitly allowed to notdatadog via pytest_socket configuration here:

warehouse/pyproject.toml

Lines 90 to 91 in a1f916b

"--disable-socket",
"--allow-hosts=localhost,::1,notdatadog,stripe",

but removing that doesn't cause any failures, so that's probably not it either. You could try removing the allowance from the config and running the tests on Linux - maybe there's some sneaky differences in how container networking bridges work these days (there definitely are).

If you have the time and desire to hunt further, please do!

@woodruffw
Copy link
Member

FWIW, my colleagues @DarkaMaul and @facutuesca were able to reproduce the notdatadog slowdown in their local testing: both had Tailscale installed and enabled. Not sure if that's what affected @twm, but disabling it prevented the requests from escaping and inducing timeouts 🙂

@twm
Copy link
Contributor Author

twm commented Aug 8, 2024

I did spend some time on this a few days ago and found that the resolutions are definitely not solely from TestDataDogMetrics — I still saw them when skipping that suite.

I have two theories to look into:

  1. The call is coming from inside the house: when pytest-socket normalizes the allowed hosts it does so per-test.
  2. DataDogMetrics is (somehow) getting created and a metric sent, which would cause DogStatsd to call getaddrinfo(), which isn't blocked by pytest-socket.

The first theory seems strongest to me. I don't understand why notdatadog is in the pytest-socket allow list in the first place — surely it should be allowed iff the container is present? Why allow it if the tests don't need it to pass?

@twm twm force-pushed the faster-tests branch 2 times, most recently from 3e7f14c to aadac72 Compare August 8, 2024 23:20
twm added 3 commits August 8, 2024 16:21
I found the test suite took ~30 minutes to run on my system, with CPU
and I/O curiously idle for the bulk of that period.

It turns out that they were furiously attempting to resolve the hostname
"notdatadog". As I'm working in a VM behind NAT they were getting
rate-limited by dnsmasq, which logged messages like this:

    Jul 16 23:16:24 vmhost dnsmasq[2230]: Maximum number of concurrent DNS queries reached (max: 150)

The quick fix is to run the notdatadog stub container alongside the
tests, which gives a runtime of ~1:30 on my system.
pytest-socket resolves the hostnames in the --allow-hosts setting for
each test, so removing notdatadog saves ~4k DNS resolution attempts.

Note that the tests run with a stripe container present. Resolving the
stripe hostname doesn't result in a DNS resolution because getaddrinfo
resolves the name by reading the /etc/hosts file populated by Docker
(same story for localhost).
@twm
Copy link
Contributor Author

twm commented Aug 9, 2024

Some experimentation with packet capture indicates that yes, the culprit is pytest-socket's hostname normalization. The immediate fix here is to remove notdatadog from the allowed hosts. A better fix would be to add some sort of caching in pytest-socket — perhaps a session-scoped fixture could own a name resolution cache? I'm happy to pitch in there if you like, @miketheman.

This is ready for review.

@DarkaMaul
Copy link
Contributor

DarkaMaul commented Aug 9, 2024

Just a bit late to the party here, but I can confirm from my tests that the culprit is indeed the pytest-socket hostname normalization because it calls getaddrinfo.

The problem stems from getaddrinfo, and for instance, with a Tailscale connexion enabled:

python3 -c 'import socket; socket.getaddrinfo("not-existing", None)'  0,03s user 0,01s system 0% cpu 5,071 total

And without :

python3 -c 'import socket; socket.getaddrinfo("not-existing", None)'  0,03s user 0,01s system 24% cpu 0,156 total

Adding notdatadog to the list of running containers or removing it from the list of allowed hosts works because getaddrinfo either doesn't need to resolve the address or manage to resolve it.

@miketheman
Copy link
Member

I don't understand why notdatadog is in the pytest-socket allow list in the first place — surely it should be allowed iff the container is present? Why allow it if the tests don't need it to pass?

I put that there back in February, with a note that without it, runnings tests with debug log level spat out lots of errors. I can't seem to reporduce that now, so probably not necessary any more.

A better fix would be to add some sort of caching in pytest-socket — perhaps a session-scoped fixture could own a name resolution cache? I'm happy to pitch in there if you like, @miketheman.

Indeed, that could be great!

Just a bit late to the party here, but I can confirm from my tests that the culprit is indeed the pytest-socket hostname normalization because it calls getaddrinfo.

...

Adding notdatadog to the list of running containers or removing it from the list of allowed hosts works because getaddrinfo either doesn't need to resolve the address or manage to resolve it.

Thanks for boiling down the reproduction issue.

I'm going to go with these changes, thanks for figuring out the complex interaction with Tailscale!

COMMAND_ARGS="$@"
COMMAND_ARGS=( "$@" )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice improvement

@miketheman miketheman enabled auto-merge (squash) August 9, 2024 14:19
@miketheman miketheman merged commit fb1b567 into pypi:main Aug 9, 2024
17 checks passed
@twm
Copy link
Contributor Author

twm commented Aug 20, 2024

A better fix would be to add some sort of caching in pytest-socket — perhaps a session-scoped fixture could own a name resolution cache? I'm happy to pitch in there if you like, @miketheman.

Indeed, that could be great!

PR for this here: miketheman/pytest-socket#369

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
developer experience Anything that improves the experience for Warehouse devs testing Test infrastructure and individual tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants