Significant metric loss when using Vector's log_to_metric vs Datadog for nginx logs #22309

law · 2025-01-27T23:35:59Z

law
Jan 27, 2025

I'm experiencing a significant discrepancy between metrics generated via Vector's log_to_metric transform and Datadog when processing the same nginx logs. The Vector pipeline consistently shows approximately 1/10th of the traffic volume compared to Datadog's measurements.

Setup details:

Processing custom-formatted nginx logs
Vector pipeline: nginx logs → log_to_metric transform → Mimir
Metrics viewed through Grafana
Same time periods and filtering criteria used in both Grafana and Datadog for comparison

Current observations:

Mimir distributor logs show no errors
Vector logs are mostly clean, with occasional parsing failures (few per hour) for specific cases like "client request-body too large" messages
All components appear to be functioning, just with unexpectedly low data volume

Configurations:

Mimir config: https://pastebin.com/ixUGPwNP
Vector config: https://pastebin.com/2RfCGUK8

Questions:

What debugging approaches would you recommend to trace where metrics might be getting dropped?
Are there specific Vector or Mimir metrics I should monitor to identify potential bottlenecks?
Would logs from any other components be helpful for diagnosis?

Any guidance on troubleshooting techniques or configuration issues would be greatly appreciated.

pront · 2025-01-28T16:18:25Z

pront
Jan 28, 2025
Maintainer

Hi @law, here is a list of tools and techniques I use when debugging complex pipelines:

vector top to see component metrics - docs here
- For example, you can inspect how many events are produced by sources vs how many events reach the sinks
- You can also add an internal_metrics source and inspect the metrics
With vector tap you see the input and/or outputs of your components - guide here
- caveat: sinks don't have outputs, you will need to check the downstream systems for actual payloads
- Alternative: for each source or transform that I want to test, I connect it to a console sink (doesn't scale to many components but it is a very quick sanity check)

For your case specifically, I would investigate if events are dropped or if events get stuck in a retry loop. For example, is this component_discarded_events_total metric greater than 0?

1 reply

law Feb 4, 2025
Author

I enabled internal_metrics, but the only step that I see that is adding to component_discarded_events_total is the filter_monolith_nginx transform which... is kind of expected. That step is there to get rid of all the non-traffic-relevant nginx logs like "upstream disconnect", "invalid header", and so on and so on, so the actual 'generate_metrics_from_logs' step can succeed.

jszwedko · 2025-01-28T17:09:13Z

jszwedko
Jan 28, 2025
Maintainer

One potential thing to check here, since you mentioned it was off by a factor of 10, is if the metric type and interval is being set correctly in Datadog. We've seen issues in the past where the interval or type was wrong, causing things to appear to be off by a factor of 10. Also, what version of Vector are you running?

6 replies

jszwedko Feb 4, 2025
Maintainer

🤔 can you check the metric type and interval of the one that Vector is publishing?

law Feb 4, 2025
Author

in Grafana, or Vector?

Vector is sending the metrics as:

    generate_nginx_metrics:
      type: log_to_metric
      inputs:
        - parse_nginx
      metrics:
        - type: counter
          name: http_requests_total
          namespace: nginx
          field: .nginx.status
          timestamp: .timestamp
          tags:
            path_root: "{{ \"{{\" }} nginx.path_root {{ \"}}\" }}"
            status: "{{ \"{{\" }} nginx.status {{ \"}}\" }}"
            method: "{{ \"{{\" }} nginx.method {{ \"}}\" }}"
            app: "{{ \"{{\" }} app {{ \"}}\" }}"
            eks_clustername: "{{ \"{{\" }} eks_clustername {{ \"}}\" }}"
            env: "{{ \"{{\" }} env {{ \"}}\" }}"
            pod_name: "{{ \"{{\" }} kubernetes_pod_name {{ \"}}\" }}"
        - type: histogram
          name: http_response_time_seconds
          namespace: nginx
          field: .nginx.response_time
          tags:
            path_root: "{{ \"{{\" }} nginx.path_root {{ \"}}\" }}"
            status: "{{ \"{{\" }} nginx.status {{ \"}}\" }}"
            method: "{{ \"{{\" }} nginx.method {{ \"}}\" }}"
            app: "{{ \"{{\" }} app {{ \"}}\" }}"
            eks_clustername: "{{ \"{{\" }} eks_clustername {{ \"}}\" }}"
            env: "{{ \"{{\" }} env {{ \"}}\" }}"
            pod_name: "{{ \"{{\" }} kubernetes_pod_name {{ \"}}\" }}"

Grafana is graphing it with:
sum by($group_by) (rate(nginx_http_requests_total{env="$env", eks_clustername="$eks_clustername"}[$__rate_interval]))

with additional options set as follows:

Legend: auto
Min Step: 1
Format: time series
Type: Range

jszwedko Feb 4, 2025
Maintainer

Ah, sorry, I meant in Datadog since my hypothesis is that it is something about the metric type or interval in Datadog that is causing you to see a 10x difference. Similar issue from a while ago: #7806

Also have you tried sending the metric to Prometheus/Grafana instead? I'm curious if it shows up correctly there to help narrow down the issue.

law Feb 4, 2025
Author

Vector isn't publishing to Datadog, it's only sending to Mimir. We've got the Datadog agent in these clusters also shipping logs (to the DD endpoint), and we're processing/'enriching'/graphing that data in the Datadog interface via a 'Datadog Log Pipeline'. Would it be helpful to share the Datadog Pipeline that's acting on the raw monolith-nginx logs?

I'm not quite sure what you're asking with:

Also have you tried sending the metric to Prometheus/Grafana instead?

We've got Vector sending to Mimir, and Grafana is querying Mimir. We don't have Prometheus running in these clusters.

jszwedko Feb 4, 2025
Maintainer

Ah, I see, I was thinking you were sending the metric to Datadog too, but it sounds like you are sending to Mimir (and just comparing to the log count in Datadog). My hypothesis Datadog metric type isn't relevant then (though it is still oddly similar in that the observed value is 1/10).

One other shot in the dark is: is it possible that logs have the same timestamp? I'm not sure how mimir works, but some storage systems require unique points (metric name, metric tags, timestamp). If mimir is one of those, then points may be being overwritten. To validate that, you could try adding an aggregate transform after log_to_metric (example).

Unfortunately nothing else is immediately jumping out looking at your configuration 🤔

law · 2025-02-04T05:46:33Z

law
Feb 4, 2025
Author

Thank you so much for getting back to me, and my apologies for the long reply-time - I didn't get an email for some reason, and just checked the thread today on a whim. I've run vector top on an instance where there's an nginx node, and came up with the following output:

ID              Output  Kind     Type              Events In         Bytes In            Events Out       Bytes Out           Errors  Memory Used
filter_monolith  --      transfo  filter            3.55 M (110/s)   N/A                1.27 M (73/s)    N/A                 --      --
generate_nginx  --      transfo  log_to_metric     1.27 M (73/s)    N/A                2.54 M (146/s)   N/A                 --      --
kubernetes      --      source   kubernetes_logs   3.55 M (110/s)   1.73 GiB (57.45 K  3.55 M (109/s)   N/A                 24      --
kubernetes_transform  --      transfo  remap             3.55 M (110/s)   N/A                3.55 M (109/s)   N/A                 --      --
loki           --      sink     loki              1.27 M (73/s)    N/A                1.27 M (--/s)    95.87 MiB (--/s)    --      --
parse_nginx    --      transfo  remap             1.27 M (73/s)    N/A                1.27 M (73/s)    N/A                 --      --
prometheus     --      sink     prometheus_remo    2.54 M (146/s)   N/A                134.48 k (--/s)  24.00 MiB (--/s)    1       --
s3_alb_logs    --      source   aws_s3            N/A              N/A                N/A              5.40 MiB (--/s)     --      --

this is on a pod where vector (and nginx) have been running for about 4 hours. What stands out to me is the 'prometheus' sink. It is receiving 2.54M events (146/s), outputting only 134.48k events. There appears to be significant event reduction here, but I'm not sure why. There's no errors in the vector logs. Do I need to increase logging to get a better understanding of what's going on here?

2 replies

pront Feb 6, 2025
Maintainer

There appears to be significant event reduction here

Hm, I am not familiar with the implementation details but it's highly likely the sink might be reducing and batching events.

law Feb 7, 2025
Author

How might I look into that a bit more? If the sink is doing the reduction/batching, how is that making its way back to vector and showing up in the 'top' output?

law · 2025-02-04T23:32:22Z

law
Feb 4, 2025
Author

More grist for the mill:

I captured all the vector tap output from node "ip-192-168-240-222.us-west-2.compute.internal" for 5 minutes - from 10:32pm to 10:37pm local-time, dumped it to a text-file

using 'grep', I got all the lines matching the string 'monolith-nginx' from that text-file, and put them in their own text-file.

wc -l nginx-output.txt shows that file has 36,997 log-lines in it. Further analysis shows that the only pod-name in that 5-minute snippet is 'monolith-nginx-57f8c68bc8-d4jr6'

I go over to datadog log-explorer, set my time-range to 10:32pm-10:37pm, tell it to only find logs for the pod-name 'monolith-nginx-57f8c68bc8-d4jr6'.

Datadog log-count? 14,181. I don't perzactly know what that means, but I'm even MORE stumped now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant metric loss when using Vector's log_to_metric vs Datadog for nginx logs #22309

{{title}}

Replies: 4 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Significant metric loss when using Vector's log_to_metric vs Datadog for nginx logs #22309

law Jan 27, 2025

Replies: 4 comments · 9 replies

pront Jan 28, 2025 Maintainer

law Feb 4, 2025 Author

jszwedko Jan 28, 2025 Maintainer

jszwedko Feb 4, 2025 Maintainer

law Feb 4, 2025 Author

jszwedko Feb 4, 2025 Maintainer

law Feb 4, 2025 Author

jszwedko Feb 4, 2025 Maintainer

law Feb 4, 2025 Author

pront Feb 6, 2025 Maintainer

law Feb 7, 2025 Author

law Feb 4, 2025 Author

law
Jan 27, 2025

Replies: 4 comments 9 replies

pront
Jan 28, 2025
Maintainer

law Feb 4, 2025
Author

jszwedko
Jan 28, 2025
Maintainer

jszwedko Feb 4, 2025
Maintainer

law Feb 4, 2025
Author

jszwedko Feb 4, 2025
Maintainer

law Feb 4, 2025
Author

jszwedko Feb 4, 2025
Maintainer

law
Feb 4, 2025
Author

pront Feb 6, 2025
Maintainer

law Feb 7, 2025
Author

law
Feb 4, 2025
Author