-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add terraform.ProviderScheduler #178
Conversation
…erRunners Signed-off-by: Alper Rifat Ulucinar <[email protected]>
9426eae
to
7b5ae1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ulucinar I left a few comments to be sure that I understand the some parts of PR.
// ApplyAsync makes a terraform apply call without blocking and calls the given | ||
// function once that apply call finishes. | ||
func (w *Workspace) ApplyAsync(callback CallbackFn) error { | ||
if !w.LastOperation.MarkStart("apply") { | ||
return errors.Errorf("%s operation that started at %s is still running", w.LastOperation.Type, w.LastOperation.StartTime().String()) | ||
} | ||
ctx, cancel := context.WithDeadline(context.TODO(), w.LastOperation.StartTime().Add(defaultAsyncTimeout)) | ||
w.providerInUse.Increment() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I see, for async ops, we call the providerInUse.Increment()
function in the ...Async func body. And for sync ops, we call the Increment
function in the runTF. I want to ask to be sure that I do not miss anything. What is the reason of this difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the async execution cases, we increment the in-use counter via the reconciler goroutine to make sure that the reservations are actually performed by the reconciler goroutine. The Workspace.runTF
might be executed by the reconciler goroutine (either if the execution mode for an MR is sync or if we always execute a Terraform CLI always in sync mode, e.g., terraform apply -refresh-only
invocations), or it might be executed by a worker goroutine asynchronously. In these async cases, we would like to make sure that the goroutine who drives the scheduling decisions (i.e., the goroutine which drives the scheduler implementation) is actually the goroutine responsible for making the reservations (incrementing the in-use count makes a reservation on the shared plugin process, the scheduler is not allowed to replace it as long as it's actively in use).
@@ -57,6 +55,48 @@ type ProviderRequirement struct { | |||
// ProviderConfiguration holds the setup configuration body | |||
type ProviderConfiguration map[string]any | |||
|
|||
// ToProviderHandle converts a provider configuration to a handle | |||
// for the provider scheduler. | |||
func (pc ProviderConfiguration) ToProviderHandle() (ProviderHandle, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are using the ProviderConfig hash, for deciding whether we need a new tf-provider process to, right? By this way, we will resolve the resource leak issues that we observed before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, correct. ProviderHandle
is basically a hash of the Terraform provider configuration block. Clients pass a ProviderHandle
to the scheduler while making reservation requests to identify their requests.
@ulucinar We have a ttl value as 100. And also the ttlBudget is 0.1. As I understand, this means that, the native provider plugin process will be restarted after every 110 usage. For clusters with a low number of MRs, this means that the process will survive for a long time. Considering the memory leak problem on the Terraform side, how would it be to set a timeout for the relevant process? We can cover this in other iterations as well. |
nit: What about using |
Thanks for the suggestions. The assumption in the current implementation is that the native plugin process leaks memory as it responds to client requests and memory leakage (if any) will be minimal if it's idle, i.e., not replying client requests. So we currently base the replacement decisions on the ttl. In further iterations, we can discuss setting timeouts, making replacement decisions based on the memory consumption and probably on some other criteria. We will also need to define a cap on the number of native plugin processes the shared scheduler forks. For instance, what happens if there are a thousand AWS accounts that are actively used in a cluster? The implementation will attempt to fork a process without any limits. We will be addressing similar issues in the next iterations. |
Some test results that are done via this PR: crossplane-contrib/provider-upjet-aws#576 (comment) |
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ulucinar LGTM! This is an important milestone in the context of addressing the performance issues.
Signed-off-by: Alper Rifat Ulucinar <[email protected]>
Signed-off-by: Alper Rifat Ulucinar <[email protected]>
Signed-off-by: Alper Rifat Ulucinar <[email protected]>
Testing that with 500MR and settings like (--max-reconcile-rate=5)
Also max rate seems not be affecting process, I see lots of terraform apply running in parallel:
|
Description of your changes
Related to: crossplane-contrib/provider-upjet-aws#325
This PR adds
terraform.ProviderScheduler
interface and three implementations of it (terraform.NoOpProviderScheduler
, a noop implementation,terraform.SharedProviderScheduler
that shares Terraform provider processes among multiple reconciliation loops with a configured TTL, and aterraform.WorkspaceProviderScheduler
that shares a Terraform provider process between the CLI invocations done in the context of a single reconciliation loop). Providers may opt to reenable the shared gRPC server runtime based on these schedulers, which properly isolate the forked Terraform providers to prevent some external resource leakage issues we had observed in the past with the shared server runtime. We have also performed a set of external resource leakage tests with this runtime, which will be discussed below.The max ttl configuration for the
terraform.SharedProviderScheduler
puts a limit on the lifetime of a forked native plugin process, the TTL of a forked plugin process is incremented each time the Terraform CLI is to be invoked against it. So basically the TTL of a plugin process holds the number of times the process has been used to handle requests from Terraform client. After the a plugin process expires, the scheduler attempts to replace it. The scheduler also keeps track of whether a plugin process is actively in use or not and replacements are not allowed if the plugin process is in use. If the scheduler finds that an expired plugin process is in use, it allows new reuse requests for a grace period (measured as a percentage of the max ttl). After this grace period if the plugin process has not been replaced yet, then any new reuse attempts will be denied.The
terraform.WorkspaceProviderScheduler
has a higher isolation level and shares a plugin process between the multiple Terraform CLI invocations and the multiple gRPC requests made by those CLI invocations only during a certain managed resource lifecycle event such as an observe, create, update or delete. It's meant to be used with certain provider configurations where using the shared scheduler will cause race conditions in the native provider.I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
We are still evaluating the performance characteristics of this runtime. We are also collecting feedback from the community members who gave the provider packages consuming this runtime a try. Relevant issues are here:
We have also performed two long-running tests (one 2d 16h long, the other 4d 16h long) with a modified https://github.com/upbound/platform-ref-aws Crossplane configuration package (modified to depend on a
upbound/provider-aws
package that consumes this runtime). No leaks were observed. We also could not observe external resource leaks during an experiment with this runtime with 210cognitoidp.UserPool
resources spanning over 7 AWS regions, with 30 MRs per region. In this experiment, we also configured the poll interval to 1m to increase the likelihood of a race condition. This experiment lasted for 19h. Through another set of experiments involving 2 AWS regions conducted at the Terraform layer (not involving a Crossplane provider or upjet or the schedulers being discussed here) to stress test the isolation principle we've implemented. After 12h, no external resource leaks were observed.