Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Tracing index is not re-created in opensearch. Dataprepper needs restart? #4951

Open
AdaptiveStep opened this issue Sep 17, 2024 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@AdaptiveStep
Copy link

AdaptiveStep commented Sep 17, 2024

Describe the bug
When events are sent to opensearch, usually the index is created if it doesn't exist. This happens for all data except when dataprepper recieves traces. When dataprepper starts up, it creates the necessary tracing indexes for spans and servicemaps, once but never again unless restarted.

If the index is removed during dataprepper runtime, an error saying "index is missing" shows up extremely often, possibly filling up the buffer and eventually causing packetdrops.

To Reproduce
Send traces to dataprepper as per usual. And you will see the trace index in the management/index page on in the "opensearch dashboards gui".

However, if you delete the index, it never gets recreated again! Even if new traces are being sent to dataprepper! Only re-starting dataprepper seems to "recreate" the index again. This can probably be easily fixed so that indexes are recreated if they don't exist in opensearch.

Expected behavior
1: the Span Index needs to be re-created if it doesn't exist when new events come to dataprepper. (And when they are sent to opensearch).

2: the serviceMap index needs to be re-created if it doesn't exist.

Environment (please complete the following information):
I tried this in dataprepper on kubernetes using the dataprepper helmchart.

Additional context
I tried this using the otel demo apps. It seems pretty consistent with all their traces. If the index is removed, it never gets re-created again unless dataprepper is restarted. Neither the "service-map index" nor the "span index" get recreated.

@KarstenSchnitter
Copy link
Collaborator

This might be related to #3342 and maybe #3506. The index setup used for spans is a little complicated. It usually uses a write alias, that points to a concrete span index.

@AdaptiveStep can you elaborate on your setup? Do you use the default index configuration or do you provide a custom config? When you delete the current index, do you keep the write alias if you have that? Can you provide the error log of DataPrepper, that contains the "index missing" message?

@dlvenable dlvenable added help wanted Extra attention is needed and removed untriaged labels Sep 17, 2024
@AdaptiveStep
Copy link
Author

About Alias:
I didn't touch the alias. Only removed index.

Log message:
I think it said that the index is missing. I'll reproduce the bug again later when I have time and paste the exact log message here.

My config:

  • KIND cluster (0.23.0)
  • Opensearch started with the operator. (v2.16) (latest)
  • Dataprepper started with the helmchart. (simple deployment, 1 replica). (helmchart v. 0.1.0) (latest)
  • Dataprepper configured according to documentation. (otel_metrics_source + otel_traces_source + otel_logs_source). Basic and vanilla OTEL config.

gRPC is sent from the "OpenTelemetry collector pod" -> to -> "the dataprepper pod".

Just normal basic otel stuff. Basically everything is default, latest version as we speak.
And Everything works. (Except that one thing.)

Everything works and if you go into the Opensearch GUI you will see the "otel-v1-apm-span-000001" index. Delete this index and it will never be recreated again. Only by restarting dataprepper will it be recreated.

The servicemaps index seems buggy too if the the "otel-v1-apm-span-000001"-index gets removed. If both are removed, none of them are coming back. This might explain why the rollover for that other person didn't work.

If you remove the metrics index they get recreated.
If you remove the logs index, it gets recreated.

My investigation so far:
How come it can re-create logs and metrics indexes but not spans? Makes no sense. (Jaeger and prometheus successfully received the same spans, so the traces are good!). Also, I completely failed sending trace data directly from the "OtelCollector" -> to -> "Opensearch", which was strange too. Maybe the errors are on opensearch level? It cannot be the OtelColellector because its cooperating well with other apps! Has anyone managed to use the opentelemetry collector with opensearch directly? A final OTEL irritation is that sometimes the service_maps are sent via metrics (this is an industry standard within the grafana stack using Tempo). This service_map problem is however a separate issue.
The "tracing-index-recreation-falure" a serious risk for severe longterm dataloss if someone accidentally removes this single index. It might even be a severe security issue if XDR and other agents are relying on opensearch data! Alerts and anomaly detections will then not be triggered if they depend on this index unless dataprepper is restarted!! An attacker only then has to remove this index to disable the entire security pipeline and hope nobody restarts dataprepper. I have not tested dataprepper OTEL features with higher dataprepper replica counts.

Summary:
Steps to reproduce bug:
Just send traces to opensearch and try removing the span index via the gui. Indexes never get re-created.

@JannikBrand
Copy link
Contributor

JannikBrand commented Oct 14, 2024

The difference between OTEL logs/metrics and traces comes from the index setup as mentioned by @KarstenSchnitter.

  • Logs/metrics: Data Prepper ingests into an index. If the index is not there the index will simply be created (due to the behavior of the OpenSearch bulk API).
  • Trace spans: Initially, Data Prepper creates the otel-v1-apm-span-000001 index and maps it to an index alias otel-v1-apm-span. When data is ingested, Data Prepper ingests into the index alias, which will point to the underlying index. If all otel-v1-apm-span-* indices (or maybe just the current write index) gets deleted then the alias cannot be resolved to an index anymore when data is ingested.
    I assume that custom logic would be needed in order to recreate the current write index, since it depends on which otel-v1-apm-span-* indices are already existing inside the cluster.
  • Trace service map: Since this is only a single index (without alias), it would probably be easy to achieve recreation during runtime.

The question is why you are deleting the current write index (otel-v1-apm-span-XXXXXX)?
If you want to delete the data of this index perform a manual index rollover after which you can safely delete the original index.

There is ongoing work to move towards the index alias/rollover approach for logs/metrics as well with #3929.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
Development

No branches or pull requests

4 participants