Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration tests are too slow!!! help please! #3382

Open
6 of 17 tasks
tlimoncelli opened this issue Jan 16, 2025 · 13 comments
Open
6 of 17 tasks

Integration tests are too slow!!! help please! #3382

tlimoncelli opened this issue Jan 16, 2025 · 13 comments

Comments

@tlimoncelli
Copy link
Contributor

Hey folks!

I have a big "ask" for yall, especially the maintainers of Cloudflare, Mythic Beasts, TransIP, Azure, CNR, and HEDNS. I don't need this solved today, but please consider taking time to look into this in the next month or so.

I'm asking everyone to please please please investigate the source of slowness and try to improve the run time.

Slow integration tests == slow development. A lot of my dev time is spent waiting for the last 3-4 providers to complete their tests. If all the tests ran in under 5 minutes my productivity would go way up! I'd have more time for important improvements.

The time it takes to run all the integration tests is slowly creeping up. There are now 5 providers that take more than 10 minutes to complete the tests.

Here's the current ranking: (Slowest to fastest) (source]

  • CLOUDFLAREAPI 15m 21s
  • MYTHICBEASTS 12m 2s
  • TRANSIP 10m 36s
  • AZURE_DNS 10m 12s
  • CNR 10m 10s
  • HEDNS 9m 57s
  • GCLOUD 9m 50s
  • GANDI_V5 7m 13s
  • NAMEDOTCOM 5m 24s
  • DIGITALOCEAN 5m 9s
  • NS1 5m 2s
  • SAKURACLOUD 4m 36s
  • HEXONET 4m 28s
  • ROUTE53 4m 28s
  • AXFRDDNS_DNSSEC 2m 58s
  • AXFRDDNS 2m 2s
  • BIND 2m 1s

I've already checked off the ones that run in less than 5 minutes. Obviously the 3-4 slowest should get the most attention.

Here's some things that have worked in the past:

  1. Open a support ticket with the provider. Sometimes they can increase the rate limit for a single account. I find it best to open a ticket saying that your integration tests involving their API are slow and ask them to investigate if we are hitting any rate limits or make suggestions on how to improve the situation. (that is... explain it in terms of a problem "our integration tests are slow" and let them suggest potential solutions. If you start with "please increase our rate limit" it turns into a battle over "why?").
  2. Disable the slowest integration tests. For example, pager101, pager601, pager1201 don't need to run all the time.
  3. Enable batch updates. Some APIs provide a way to update many records at once, or upload an entire zone instead of individual updates.

Some changes that would help all providers:

  • Any GHA experts out there? I need a hero to volunteer to look at the current setup and make general suggestions.
    • Are we caching the right things? Are there caches that are doing more setup work than benefit?
    • Maybe providers that are very slow can be split to do half the tests on one domain and the other half on another. Anyone want to take a stab at that?
    • Is it possible to run the tests on a faster VM?
  • Dashboard gurus? It'd like a dashboard that shows the run times, perhaps highlighting any major slow-downs.
  • Any Go test experts? Any suggestions on how to to improve the tests? Maybe we should reconfigure the integration tests to have a "fast version" that skips some of the more unusual tests, which we could run periodically.
  • Any puzzle experts? Here's a puzzle: which integration tests are equivalent and therefore can be removed as duplicates? They start here: link

Any help you can provide will be greatly appreciated! Thanks in advance!

Best,
Tom

CC: @tresni @tomfitzhenry @blackshadev @vatsalyagoel @matthewmgamble @KaiSchwarz-cnic @rblenkinsopp @riyadhalnur @TomOnTime @tlimoncelli @hnrgrgr @tresni

@tlimoncelli
Copy link
Contributor Author

GCLOUD: #3383 and #3384 shave about 2 minutes off run time (from 7 minutes to 9 minutes)

@blackshadev
Copy link
Contributor

I will look at SAKURACLOUD , I know it also uses the zonal diff-ing so I expect the TransIP the perform roughly simular. The provider doesn't do much anymore, so it should be easy enough to pull some performence from. There is always the chance the TransIP API is just slow as it can be, but that should be easy enough to investigate for me.

@KaiSchwarz-cnic
Copy link
Contributor

KaiSchwarz-cnic commented Jan 17, 2025

@tlimoncelli I am working on a first improvement which could cover a ~30% improvement for CNR. I added another idea to our backlog, but that will take a while.

@rblenkinsopp
Copy link
Contributor

I will look to disable the pager tests for HEDNS which should shave some time off that, I'm not sure there is much else I'll be able to do though as it's not an official API and in the past when I tried to contact them about adding one, I received no responses.

@tlimoncelli
Copy link
Contributor Author

@tlimoncelli I am working on a first improvement which could cover a ~30% improvement for CNR. I added another idea to our backlog, but that will take a while.

Sounds great! Thank you!

@KaiSchwarz-cnic
Copy link
Contributor

CNR -> Find the first step covered by this PR: #3391

@blackshadev
Copy link
Contributor

I tried to draw some inspiration from SAKURACLOUD, but we kind of to exactly the same and I am unable to make it faster based on that. I do not do much in my code, so I checkt how much time I spent waiting for TransIPs API

My integration test suite ran in: 519 seconds
Of which I was waiting on TransIP for: 518 seconds

At this point I do not think I can do less API calls. I already "get the whole zone at once" and "replace the whole zone at once"

I can send an email to them, but maybe @cafferata has better internal contacts? I will need to scour LinkedIn to find someone or mail a general support email.

@jauderho
Copy link
Contributor

@tlimoncelli If you are looking for more details around telemetry, you could look into using Honeycomb to track and trace GHA buildevents: https://github.com/honeycombio/gha-buildevents

cc: @lizthegrey

@KaiSchwarz-cnic
Copy link
Contributor

@tlimoncelli we checked further possibilities for CentralNic Reseller (CNR) and rolled out some improvements in direction of http communication as part of our software dependency. We tested this update in our dnscontrol fork with no success while tests outside of DNSControl looked crazy.

100 commands sequentially requested were 4 times faster. I haven't checked the way how the integration tests work. Maybe you're recreating the provider instance per test case? If so, that might be a reason. Maybe you can provide the point of code / your insights.

@tlimoncelli
Copy link
Contributor Author

@tlimoncelli we checked further possibilities for CentralNic Reseller (CNR) and rolled out some improvements in direction of http communication as part of our software dependency. We tested this update in our dnscontrol fork with no success while tests outside of DNSControl looked crazy.

100 commands sequentially requested were 4 times faster. I haven't checked the way how the integration tests work. Maybe you're recreating the provider instance per test case? If so, that might be a reason. Maybe you can provide the point of code / your insights.

Interesting data point!

Here's a bunch of random thoughts and observations:

My first hunch was that the cnr provider is re-authenticating for each call. I don't think that's the problem, since with CNR_DEBUGMODE=2 I see exactly 1 http POST per test.

My next hunch was that cnr is being reinitialized for each test. (What you suggested above.) However I reviewed the code and integration_test.go only calls getProvider() once (I verified this by adding a Printf to getProvider(). (getProvider() initializes the provider's client.)

My next hunch is that Github Actions is simply rate-limiting us, or is running on very slow VMs. However ROUTE53 runs in about 2 minutes with the same number of API calls. Could something else be doing the rate limiting?

When I run CNR's integration tests from my desktop (with all VPNs and other things disabled), it runs in 4m43s, which is much faster than GHA's 6m30s. However it isn't 4 times faster, as your test observed.

Are the 100 commands you sent alternating DNSZoneRRList and ModifyDNSZone? I notice that DNSZoneRRList executes very fast while there is a full 1-second pause after each ModifyDNSZone. (this is from manually observing CNR_DEBUGMODE=2 output).

I really appreciate all the effort you are putting into this. My hope is that we fix a performance issue that improves performance for all your customers (I'm ever the optimist!). That said, I'm happy that we're down to 6.5 minutes. Maybe we've done enough for now.

Tom

@jauderho
Copy link
Contributor

jauderho commented Jan 22, 2025

@tlimoncelli

I did some more digging and found an action from Catchpoint: https://github.com/catchpoint/workflow-telemetry-action

Did some quick testing on an action and looks like it might be what you are looking for. It does not look like the results are publicly available based on limited testing.

The nice part is that it is pretty easy to add. All you need is:

      - name: Collect Workflow Telemetry
        uses: catchpoint/workflow-telemetry-action@v2

@tlimoncelli
Copy link
Contributor Author

@tlimoncelli

I did some more digging and found an action from Catchpoint: https://github.com/catchpoint/workflow-telemetry-action

Did some quick testing on an action and looks like it might be what you are looking for. It does not look like the results are publicly available based on limited testing.

The nice part is that it is pretty easy to add. All you need is:

      - name: Collect Workflow Telemetry
        uses: catchpoint/workflow-telemetry-action@v2

Wow! That is super easy!

I gave it a try in #3401

I tried it with and without the actions/cache and didn't see much of a different. (you'll see many graphs posted as comments. The last one disables actions/cache and actions/upload-artifact). TBH, I'm not sure exactly what is being cached.

@KaiSchwarz-cnic
Copy link
Contributor

@tlimoncelli we checked further possibilities for CentralNic Reseller (CNR) and rolled out some improvements in direction of http communication as part of our software dependency. We tested this update in our dnscontrol fork with no success while tests outside of DNSControl looked crazy.
100 commands sequentially requested were 4 times faster. I haven't checked the way how the integration tests work. Maybe you're recreating the provider instance per test case? If so, that might be a reason. Maybe you can provide the point of code / your insights.

Interesting data point!

Here's a bunch of random thoughts and observations:

My first hunch was that the cnr provider is re-authenticating for each call. I don't think that's the problem, since with CNR_DEBUGMODE=2 I see exactly 1 http POST per test.

My next hunch was that cnr is being reinitialized for each test. (What you suggested above.) However I reviewed the code and integration_test.go only calls getProvider() once (I verified this by adding a Printf to getProvider(). (getProvider() initializes the provider's client.)

My next hunch is that Github Actions is simply rate-limiting us, or is running on very slow VMs. However ROUTE53 runs in about 2 minutes with the same number of API calls. Could something else be doing the rate limiting?

When I run CNR's integration tests from my desktop (with all VPNs and other things disabled), it runs in 4m43s, which is much faster than GHA's 6m30s. However it isn't 4 times faster, as your test observed.

Are the 100 commands you sent alternating DNSZoneRRList and ModifyDNSZone? I notice that DNSZoneRRList executes very fast while there is a full 1-second pause after each ModifyDNSZone. (this is from manually observing CNR_DEBUGMODE=2 output).

I really appreciate all the effort you are putting into this. My hope is that we fix a performance issue that improves performance for all your customers (I'm ever the optimist!). That said, I'm happy that we're down to 6.5 minutes. Maybe we've done enough for now.

Tom

Thanks so much, we'll dig deeper whenever we have some time left.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants