Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer PreferencesService to lazy/later and/or worker thread #149

Open
breedx-splk opened this issue Nov 16, 2023 · 6 comments
Open

Defer PreferencesService to lazy/later and/or worker thread #149

breedx-splk opened this issue Nov 16, 2023 · 6 comments

Comments

@breedx-splk
Copy link
Contributor

As part of the first disk buffering PR #145, it was raised by @marandaneto that the PreferencesService is loading shared preferences from the context, which does perform IO and will result in a warning when StrictMode is used. This suggests that this work should either be deferred until needed or moved to a worker thread.

See:
#145 (comment)

@LikeTheSalad
Copy link
Contributor

I think the solution should be to run all the initialization in a background thread and provide callbacks for the user to know when OTel is ready to be used.

The reason I mention this is because PreferencesService isn't the only blocking work needed during initialization. The calls for checking available cache disk space should also be done out of the main thread, and it'll be needed to create signals exporters for initializing the OpenTelemetry instance. Plus (probably won't be needed over here) though at least in the Elastic agent, we make a request to an NTP server during initialization to get the real time for signals' timestamps, since mobile devices' clock isn't too reliable for accuracy. There might be other solutions that wouldn't involve an NTP server though, so that's why I'm not sure if it'll be added here, however, my point is that initializing OTel in the main thread seems like it could cause troubles in the long run if we need to add features that take some time to set up.

@marandaneto
Copy link
Member

I think the solution should be to run all the initialization in a background thread and provide callbacks for the user to know when OTel is ready to be used.

In theory, it's ok but in practice, this has downsides, for example:

  • Some features to be installed require to be in the main thread, which is fine, we can always post an execution to the main thread.
  • For tracing instrumentation, the SDK must be init. as soon as possible, and we can only guarantee that if the SDK is init in the main thread, executing all the code sync.
  • For crash reporting, it's also important that the SDK is init. as soon as possible, for example, to capture errors during app start.

Maybe an option is to init. the "core" parts of the SDK in the main thread and defer everything else to a worker.

Another option is just to defer the blocking bits to a worker (preferred IMO), and await the worker whenever needed even if that blocks for a tiny bit, an example of that is the storage init., it's gonna be init. in a worker, but if a trace or error has to be cached in storage and the worker is still ongoing, it's ok to await until it's finished.

For the NTP bits, I don't know any other solution than an NTP sync, there are libs that do that async instead, that doesn't need to be a blocking call, the very first few events before the NTP sync is finished might be off, but that's fine if all the following events are accurate.

Another option is doing a clock drift on the server side, for example, event A has 12:00 as a time when calling the API to send the event, you also send the current timestamp (request time), let's say, 12:10, the server compares with its current timestamp and do a clock drift, if the server is 12:15 when it receives the request, and the request was done at 12:10 device time, the server knows that the device is 5 minutes late, so the event that was at 12:00 is 12:05.
This solution would force vendors to implement clock drift, so probably not ideal.

@LikeTheSalad
Copy link
Contributor

  • For crash reporting, it's also important that the SDK is init. as soon as possible, for example, to capture errors during app start.

I agree, although it seems like having a lazy init for when disk caching is enabled could also mean losing errors during app start, since they won't be cached in disk before the lazy disk set-up work is done. I guess we could have some exporter wrapper that's init-aware and sends the data straight to the server if the disk set-up is not done yet... Though I'm wondering also what would be a nice amount of "opinionated" features we should add by default to this lib before someone complains about unexpected behaviors.

Maybe an option is to init. the "core" parts of the SDK in the main thread and defer everything else to a worker.

I think this is the ideal, balanced choice 👍 although the signal exporters, which are the ones in charge of caching data in disk, are unfortunately part of that "core", so it wouldn't be that easy, there's probably something we could do such as what I mentioned about having a wrapper exporter that's init-aware and knows where to delegate the data depending on the init state, though that's only an assumption of mine, I'll need to check how feasible it is in code.

Another option is just to defer the blocking bits to a worker (preferred IMO), and await the worker whenever needed even if that blocks for a tiny bit

If I understood correctly, this part suggests blocking the main thread for a tiny bit, not long enough to cause any harm to the user experience, while ensuring that no data gets lost even if some signals happen during the app init. This is what we're currently doing in the Elastic agent to avoid losing signals, also because, in my experience, the SharedPref's init part (done in PreferencesService's creation) as well as reading from it hasn't caused any user issues in practice (unless the amount of data in the prefs file is huge, which is not the case here and I don't see it happening in the future either), it's only the writing part that could block the thread in a harmful way, which is not currently done as part of OTel's initialization, so I think this is an interesting trade-off to consider.

For the NTP bits, I don't know any other solution than an NTP sync, there are libs that do that async instead, that doesn't need to be a blocking call, the very first few events before the NTP sync is finished might be off, but that's fine if all the following events are accurate.

I agree, and I'm glad to hear this as this is the way we're currently doing so in the Elastic agent so it's a nice feedback 👍 The downside is, as you mentioned, that the first couple of signals will be off, but I think it's still worth it, besides, we're making TrueTime to use SharedPrefs as its cache system to avoid unnecessary extra network calls.

To summarize, I think all options will have pros and cons, I don't see a "perfect" one, but generally speaking, I like the idea of having OTel ready in the main thread to make sure no data gets lost during the app's initialization, and in order to achieve that, I think we should be able to do certain disk operations (reading ones, not writing) in the main thread without causing issues to the end user, and leave all the writing and network related operations to be initialized in a worker one. I think that'd be the most straightforward approach at least based on my own experience with disk-reading operations, though I'm open to any opinions on why reading from sharedprefs could cause important issues to the app's initialization if done in the main thread for this use case, and if it's really an issue, then I guess the alternative would be to try and add special behaviors to the agent, such as bypassing the disk buffering exporter if the disk-related setup is not ready before some data gets created.

@bidetofevil
Copy link
Contributor

bidetofevil commented Dec 12, 2023

Another option is to offer async init as an option (but not the default) with the known issues that that may bring. Having the SDK ready and instrumenting after the init method is invoked is powerful and gives us some guarantees about when telemetry is expected to roll in, but not everyone requires that level of guarantee, so it may make sense to offer that as an option for those who don't want disk reads of simply don't want to the app to be blocked on startup.

And FWIW, having certain services' init be deferred or done async is an attractive compromise to improve the init time and do more expensive work off the startup blocking path, but there's a level of complexity and overhead that it will create to make it just a bit annoying to maintain. It's less so if it's instrumentation that collects data passively, as you just have to make clear that the data collection lifecycle for that instrumentation is decoupled from the core SDK. But anything that activates abilities that affect manual telemetry recording by an app will have to come with API that reports back whether that feature is ready, which is cumbersome to use. So unless there a really really good reason, it's best if that's avoided.

@LikeTheSalad
Copy link
Contributor

Thanks @bidetofevil - I agree, it might be useful to provide that as an option just in case some projects would prefer losing some data over making reading disk operations or adding any kind of overhead during app launch 👍

So to summarize, based on what's been discussed in this thread, it seems like this would be the overall conclusion:

  • By default, we should keep the agent's initialization work in the main thread, except for the bits related to making network requests and/or writing disk operations (which there aren't any atm afaik, but I think we should still mention it for clarification purposes), in other words, doing a best effort to keep the initialization in the main thread while optimizing it whenever possible to avoid affecting the UI performance.
  • We might want to provide an async option in the future for doing the whole initialization in a worker thread, which will provide a callback to let users know when the initialization is finished so they can start sending telemetry afterward. We should mention somewhere in the docs that, by using the async init option, there's a risk of losing some app launch related data.

@Ejdangerfield
Copy link

Ejdangerfield commented Oct 16, 2024

Just wanting to bump this up to see if there will be a future release where some of the initialization work will be moved off main. We are seeing a few ANRs on lower end devices in relation to the OpenTelemetryRumBuilder.build() call. The trace is as follows:

android.os.BinderProxy.transact (BinderProxy.java:624)
android.os.storage.IStorageManager$Stub$Proxy.getAllocatableBytes (IStorageManager.java:2819)
android.os.storage.StorageManager.getAllocatableBytes (StorageManager.java:2298)
android.os.storage.StorageManager.getAllocatableBytes (StorageManager.java:2288)
io.opentelemetry.android.internal.services.CacheStorage.getAvailableSpace (CacheStorage.java:66)
io.opentelemetry.android.internal.services.CacheStorage.ensureCacheSpaceAvailable (CacheStorage.java:50)
io.opentelemetry.android.internal.features.persistence.DiskManager.getMaxFolderSize (DiskManager.kt:58)
io.opentelemetry.android.OpenTelemetryRumBuilder.createStorageConfiguration (OpenTelemetryRumBuilder.java:338)
io.opentelemetry.android.OpenTelemetryRumBuilder.build (OpenTelemetryRumBuilder.java:286)
io.opentelemetry.android.OpenTelemetryRumBuilder.build (OpenTelemetryRumBuilder.java:271)

In the documentation for StorageManager.allocateBytes it specifically calls out that it should only be called from a worker thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants