Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add terraform.ProviderScheduler #178

Merged
merged 4 commits into from
Mar 27, 2023
Merged

Conversation

ulucinar
Copy link
Collaborator

@ulucinar ulucinar commented Mar 22, 2023

Description of your changes

Related to: crossplane-contrib/provider-upjet-aws#325

This PR adds terraform.ProviderScheduler interface and three implementations of it (terraform.NoOpProviderScheduler, a noop implementation, terraform.SharedProviderScheduler that shares Terraform provider processes among multiple reconciliation loops with a configured TTL, and a terraform.WorkspaceProviderScheduler that shares a Terraform provider process between the CLI invocations done in the context of a single reconciliation loop). Providers may opt to reenable the shared gRPC server runtime based on these schedulers, which properly isolate the forked Terraform providers to prevent some external resource leakage issues we had observed in the past with the shared server runtime. We have also performed a set of external resource leakage tests with this runtime, which will be discussed below.

The max ttl configuration for the terraform.SharedProviderScheduler puts a limit on the lifetime of a forked native plugin process, the TTL of a forked plugin process is incremented each time the Terraform CLI is to be invoked against it. So basically the TTL of a plugin process holds the number of times the process has been used to handle requests from Terraform client. After the a plugin process expires, the scheduler attempts to replace it. The scheduler also keeps track of whether a plugin process is actively in use or not and replacements are not allowed if the plugin process is in use. If the scheduler finds that an expired plugin process is in use, it allows new reuse requests for a grace period (measured as a percentage of the max ttl). After this grace period if the plugin process has not been replaced yet, then any new reuse attempts will be denied.

The terraform.WorkspaceProviderScheduler has a higher isolation level and shares a plugin process between the multiple Terraform CLI invocations and the multiple gRPC requests made by those CLI invocations only during a certain managed resource lifecycle event such as an observe, create, update or delete. It's meant to be used with certain provider configurations where using the shared scheduler will cause race conditions in the native provider.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

We are still evaluating the performance characteristics of this runtime. We are also collecting feedback from the community members who gave the provider packages consuming this runtime a try. Relevant issues are here:

We have also performed two long-running tests (one 2d 16h long, the other 4d 16h long) with a modified https://github.com/upbound/platform-ref-aws Crossplane configuration package (modified to depend on a upbound/provider-aws package that consumes this runtime). No leaks were observed. We also could not observe external resource leaks during an experiment with this runtime with 210 cognitoidp.UserPool resources spanning over 7 AWS regions, with 30 MRs per region. In this experiment, we also configured the poll interval to 1m to increase the likelihood of a race condition. This experiment lasted for 19h. Through another set of experiments involving 2 AWS regions conducted at the Terraform layer (not involving a Crossplane provider or upjet or the schedulers being discussed here) to stress test the isolation principle we've implemented. After 12h, no external resource leaks were observed.

Copy link
Member

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulucinar I left a few comments to be sure that I understand the some parts of PR.

// ApplyAsync makes a terraform apply call without blocking and calls the given
// function once that apply call finishes.
func (w *Workspace) ApplyAsync(callback CallbackFn) error {
if !w.LastOperation.MarkStart("apply") {
return errors.Errorf("%s operation that started at %s is still running", w.LastOperation.Type, w.LastOperation.StartTime().String())
}
ctx, cancel := context.WithDeadline(context.TODO(), w.LastOperation.StartTime().Add(defaultAsyncTimeout))
w.providerInUse.Increment()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I see, for async ops, we call the providerInUse.Increment() function in the ...Async func body. And for sync ops, we call the Increment function in the runTF. I want to ask to be sure that I do not miss anything. What is the reason of this difference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the async execution cases, we increment the in-use counter via the reconciler goroutine to make sure that the reservations are actually performed by the reconciler goroutine. The Workspace.runTF might be executed by the reconciler goroutine (either if the execution mode for an MR is sync or if we always execute a Terraform CLI always in sync mode, e.g., terraform apply -refresh-only invocations), or it might be executed by a worker goroutine asynchronously. In these async cases, we would like to make sure that the goroutine who drives the scheduling decisions (i.e., the goroutine which drives the scheduler implementation) is actually the goroutine responsible for making the reservations (incrementing the in-use count makes a reservation on the shared plugin process, the scheduler is not allowed to replace it as long as it's actively in use).

@@ -57,6 +55,48 @@ type ProviderRequirement struct {
// ProviderConfiguration holds the setup configuration body
type ProviderConfiguration map[string]any

// ToProviderHandle converts a provider configuration to a handle
// for the provider scheduler.
func (pc ProviderConfiguration) ToProviderHandle() (ProviderHandle, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are using the ProviderConfig hash, for deciding whether we need a new tf-provider process to, right? By this way, we will resolve the resource leak issues that we observed before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct. ProviderHandle is basically a hash of the Terraform provider configuration block. Clients pass a ProviderHandle to the scheduler while making reservation requests to identify their requests.

@sergenyalcin
Copy link
Member

sergenyalcin commented Mar 27, 2023

@ulucinar We have a ttl value as 100. And also the ttlBudget is 0.1. As I understand, this means that, the native provider plugin process will be restarted after every 110 usage. For clusters with a low number of MRs, this means that the process will survive for a long time. Considering the memory leak problem on the Terraform side, how would it be to set a timeout for the relevant process? We can cover this in other iterations as well.

@sergenyalcin
Copy link
Member

sergenyalcin commented Mar 27, 2023

nit: What about using ttlMargin instead of ttlBudget to avoid ambiguity? We may also consider changing this in the next iteration.

@ulucinar
Copy link
Collaborator Author

@ulucinar We have a ttl value as 100. And also the ttlBudget is 0.1. As I understand, this means that, the native provider plugin process will be restarted after every 110 usage. For clusters with a low number of MRs, this means that the process will survive for a long time. Considering the memory leak problem on the Terraform side, how would it be to set a timeout for the relevant process? We can cover this in other iterations as well.

Thanks for the suggestions. The assumption in the current implementation is that the native plugin process leaks memory as it responds to client requests and memory leakage (if any) will be minimal if it's idle, i.e., not replying client requests. So we currently base the replacement decisions on the ttl.

In further iterations, we can discuss setting timeouts, making replacement decisions based on the memory consumption and probably on some other criteria. We will also need to define a cap on the number of native plugin processes the shared scheduler forks. For instance, what happens if there are a thousand AWS accounts that are actively used in a cluster? The implementation will attempt to fork a process without any limits. We will be addressing similar issues in the next iterations.

@sergenyalcin
Copy link
Member

@ulucinar
Copy link
Collaborator Author

nit: What about using ttlMargin instead of ttlBudget to avoid ambiguity? We may also consider changing this in the next iteration.

Done.

Copy link
Member

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulucinar LGTM! This is an important milestone in the context of addressing the performance issues.

@przysiadZeSztanga
Copy link

Testing that with 500MR and settings like (--max-reconcile-rate=5)
CPU usage is significantly lower than on provider version 0.31 and 0.30 but I see tons of errors like:

observe failed: cannot schedule a native provider during observe: xxxxxxxxxxxxxxx: cannot schedule native Terraform provider process: native provider reuse budget has been exceeded: invocationCount: 113, ttl: 100'

Also max rate seems not be affecting process, I see lots of terraform apply running in parallel:

18948 1 2000 S 817m 5% 7 0% terraform apply -auto-approve -input=false -lock=false -json 19247 1 2000 S 817m 5% 6 0% terraform apply -auto-approve -input=false -lock=false -json 18718 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 18757 1 2000 S 817m 5% 5 0% terraform apply -auto-approve -input=false -lock=false -json 19409 1 2000 S 817m 5% 2 0% terraform apply -auto-approve -input=false -lock=false -json 19001 1 2000 S 817m 5% 4 0% terraform apply -auto-approve -input=false -lock=false -json 19076 1 2000 S 884m 6% 2 0% terraform apply -auto-approve -input=false -lock=false -json 18887 1 2000 S 818m 5% 7 0% terraform apply -auto-approve -input=false -lock=false -json 18926 1 2000 S 817m 5% 0 0% terraform apply -auto-approve -input=false -lock=false -json 18797 1 2000 S 817m 5% 2 0% terraform apply -auto-approve -input=false -lock=false -json 19300 1 2000 S 817m 5% 6 0% terraform apply -auto-approve -input=false -lock=false -json 19655 1 2000 S 817m 5% 4 0% terraform apply -auto-approve -input=false -lock=false -json 19105 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 19699 1 2000 S 817m 5% 4 0% terraform apply -auto-approve -input=false -lock=false -json 18978 1 2000 S 817m 5% 2 0% terraform apply -auto-approve -input=false -lock=false -json 19210 1 2000 S 817m 5% 6 0% terraform apply -auto-approve -input=false -lock=false -json 19592 1 2000 S 817m 5% 6 0% terraform apply -auto-approve -input=false -lock=false -json 19170 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 19552 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 19759 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 19726 1 2000 S 817m 5% 4 0% terraform apply -auto-approve -input=false -lock=false -json 18836 1 2000 S 817m 5% 1 0% terraform apply -auto-approve -input=false -lock=false -json 19683 1 2000 S 817m 5% 0 0% terraform apply -auto-approve -input=false -lock=false -json 19044 1 2000 S 817m 5% 4 0% terraform apply -auto-approve -input=false -lock=false -json 18689 1 2000 S 817m 5% 6 0% terraform apply -auto-approve -input=false -lock=false -json 19103 1 2000 S 817m 5% 5 0% terraform apply -auto-approve -input=false -lock=false -json 18628 1 2000 S 817m 5% 2 0% terraform apply -auto-approve -input=false -lock=false -json

@przysiadZeSztanga
Copy link

przysiadZeSztanga commented Mar 30, 2023

Some CPU and memory observations same test scenario:
0->500 MR (route53 records) in 1 claim
node dedicated for aws provider:
c6i.2xlarge with 8 cores and 16 GB memory

provider version: 0.32.0-rc.1

Limit/request settings:
Limits: cpu: 6 memory: 12Gi Requests: cpu: 6 memory: 12Gi
provider configuration params:

 extraArgs:
  - --max-reconcile-rate=5
  - --poll=30min

0.31.0 provider:
image
0.32.0-rc.1 provider
image

  • creation of MR is much longer
  • fix made some improvement but it rather looks like a fuse, not to kill Provider during large bulk creation/updates
  • still high (full) cpu usage for terraform operations on ready/synced/created MRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants