Queue full #69

razumv · 2019-12-10T08:35:47Z

Hello, I have a problem with queue in prometheus + graphite-remote-adapter
level=warn ts=2019-12-10T08:31:54.018127762Z caller=queue_manager.go:230 component=remote queue="0:http://***/write?graphite.default-prefix=kube_poly_ " msg="Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed."

Prometheus & adapter config is default
only 10% of metrics from 70 computers reach

InformatiQ · 2019-12-10T10:34:05Z

when graphite-remote-adapter is unable to send the metrics to graphite for any reason it's internal queue gets filled and hence stops receiving metrics. Prometheus will start droppping samples so it doesn't fill its queue.
you might want to add more instances of graphite-remote-adapter to support the load.
what we do is have many graphite remote-adapters behind an LB to make it easy to scale as needed

razumv · 2019-12-12T13:26:33Z

For example now
remote_adapter_sent_batch_duration_seconds_sum{} 127918.57609338008
remote_adapter_sent_batch_duration_seconds_count{} 152966
remote_adapter_sent_samples_total{} 14458568
It turns out I only got 152k from 14kk data?

I use 1 remote addr for all my adapters now(5)

OK, this is my conf
#Prometheus:
remote_write:

url: "http://GRA:9201/write?graphite.default-prefix=___"
remote_timeout: 15s
remote_read:
url: "http://GRA:9201/read?graphite.default-prefix=___"
#GRA:
- '--graphite.read.url=''http://graphite:8035'''
- '--graphite.write.carbon-address=graphite:2003'
- '--log.level=info'
- '--write.timeout=10s'

My prometheus log:
level=info ts=2019-12-12T06:53:29.923422835Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=1 to=17
level=info ts=2019-12-12T06:53:39.923304302Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=17 to=35
level=info ts=2019-12-12T06:53:49.923441611Z caller=queue_manager.go:343 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Currently resharding, skipping."
level=info ts=2019-12-12T06:53:59.923354973Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=35 to=98
level=info ts=2019-12-12T06:54:19.923450376Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=98 to=153
level=info ts=2019-12-12T06:54:29.92329724Z caller=queue_manager.go:343 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Currently resharding, skipping."
level=info ts=2019-12-12T06:54:39.923358133Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=153 to=274
level=info ts=2019-12-12T06:57:29.923270195Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=274 to=165
level=info ts=2019-12-12T07:00:19.923636921Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=165 to=106
level=info ts=2019-12-12T07:00:39.92329415Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=106 to=152
level=info ts=2019-12-12T07:03:19.923396969Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=152 to=98

This happens periodically in the prometheus log:
level=warn ts=2019-12-12T08:17:47.125876374Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded"
level=warn ts=2019-12-12T08:17:47.3388834Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded"
level=warn ts=2019-12-12T08:17:48.171610956Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded"
level=warn ts=2019-12-12T08:17:48.181891084Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded"
In adapter log:
{"caller":"write.go:167","component":"web","err":"request context cancelled","level":"warn","msg":"Error sending samples to remote storage","num_samples":100,"storage":"graphite","ts":"2019-12-12T08:17:48.434Z"}
{"caller":"write.go:167","component":"web","err":"request context cancelled","level":"warn","msg":"Error sending samples to remote storage","num_samples":100,"storage":"graphite","ts":"2019-12-12T08:17:48.532Z"}
{"caller":"write.go:167","component":"web","err":"request context cancelled","level":"warn","msg":"Error sending samples to remote storage","num_samples":100,"storage":"graphite","ts":"2019-12-12T08:17:48.534Z"}

InformatiQ · 2019-12-12T19:09:06Z

are you sure graphite is not having any issues? it could be slow at ingesting the samples.
what is the cpu/mem resource usage like on th GRA instances?

razumv · 2019-12-12T19:15:52Z

Now adapter instances are not limited in resources. Now there are 5 of them, they consume about 0.1 processor cores and 700 megabytes of RAM each. graphite was deployed through docker-compose.

InformatiQ · 2019-12-13T13:54:19Z

how does the resource usage of graphit elook like? any errors in graphite side?

razumv · 2019-12-16T08:41:23Z

Graphite is 40% loaded, today we’ll deploy in a cluster and try to write to it.
But can it be that the first packet of metrics does not have time to go and the next starts to go?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue full #69

Queue full #69

razumv commented Dec 10, 2019

InformatiQ commented Dec 10, 2019

razumv commented Dec 12, 2019 •

edited

Loading

InformatiQ commented Dec 12, 2019

razumv commented Dec 12, 2019

InformatiQ commented Dec 13, 2019

razumv commented Dec 16, 2019

Queue full #69

Queue full #69

Comments

razumv commented Dec 10, 2019

InformatiQ commented Dec 10, 2019

razumv commented Dec 12, 2019 • edited Loading

InformatiQ commented Dec 12, 2019

razumv commented Dec 12, 2019

InformatiQ commented Dec 13, 2019

razumv commented Dec 16, 2019

razumv commented Dec 12, 2019 •

edited

Loading