Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue full #69

Open
razumv opened this issue Dec 10, 2019 · 6 comments
Open

Queue full #69

razumv opened this issue Dec 10, 2019 · 6 comments

Comments

@razumv
Copy link

razumv commented Dec 10, 2019

Hello, I have a problem with queue in prometheus + graphite-remote-adapter
level=warn ts=2019-12-10T08:31:54.018127762Z caller=queue_manager.go:230 component=remote queue="0:http://***/write?graphite.default-prefix=kube_poly_ " msg="Remote storage queue full, discarding sample. Multiple subsequent messages of this kind may be suppressed."

Prometheus & adapter config is default
only 10% of metrics from 70 computers reach

@InformatiQ
Copy link
Contributor

when graphite-remote-adapter is unable to send the metrics to graphite for any reason it's internal queue gets filled and hence stops receiving metrics. Prometheus will start droppping samples so it doesn't fill its queue.
you might want to add more instances of graphite-remote-adapter to support the load.
what we do is have many graphite remote-adapters behind an LB to make it easy to scale as needed

@razumv
Copy link
Author

razumv commented Dec 12, 2019

For example now
remote_adapter_sent_batch_duration_seconds_sum{} 127918.57609338008
remote_adapter_sent_batch_duration_seconds_count{} 152966
remote_adapter_sent_samples_total{} 14458568
It turns out I only got 152k from 14kk data?

I use 1 remote addr for all my adapters now(5)

OK, this is my conf
#Prometheus:
remote_write:

My prometheus log:
level=info ts=2019-12-12T06:53:29.923422835Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=1 to=17
level=info ts=2019-12-12T06:53:39.923304302Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=17 to=35
level=info ts=2019-12-12T06:53:49.923441611Z caller=queue_manager.go:343 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Currently resharding, skipping."
level=info ts=2019-12-12T06:53:59.923354973Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=35 to=98
level=info ts=2019-12-12T06:54:19.923450376Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=98 to=153
level=info ts=2019-12-12T06:54:29.92329724Z caller=queue_manager.go:343 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Currently resharding, skipping."
level=info ts=2019-12-12T06:54:39.923358133Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=153 to=274
level=info ts=2019-12-12T06:57:29.923270195Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=274 to=165
level=info ts=2019-12-12T07:00:19.923636921Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=165 to=106
level=info ts=2019-12-12T07:00:39.92329415Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=106 to=152
level=info ts=2019-12-12T07:03:19.923396969Z caller=queue_manager.go:340 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Remote storage resharding" from=152 to=98

This happens periodically in the prometheus log:
level=warn ts=2019-12-12T08:17:47.125876374Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded"
level=warn ts=2019-12-12T08:17:47.3388834Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded"
level=warn ts=2019-12-12T08:17:48.171610956Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded"
level=warn ts=2019-12-12T08:17:48.181891084Z caller=queue_manager.go:531 component=remote queue="0:http://GRA:9201/write?graphite.default-prefix=___ " msg="Error sending samples to remote storage" count=100 err="context deadline exceeded"
In adapter log:
{"caller":"write.go:167","component":"web","err":"request context cancelled","level":"warn","msg":"Error sending samples to remote storage","num_samples":100,"storage":"graphite","ts":"2019-12-12T08:17:48.434Z"}
{"caller":"write.go:167","component":"web","err":"request context cancelled","level":"warn","msg":"Error sending samples to remote storage","num_samples":100,"storage":"graphite","ts":"2019-12-12T08:17:48.532Z"}
{"caller":"write.go:167","component":"web","err":"request context cancelled","level":"warn","msg":"Error sending samples to remote storage","num_samples":100,"storage":"graphite","ts":"2019-12-12T08:17:48.534Z"}

@InformatiQ
Copy link
Contributor

are you sure graphite is not having any issues? it could be slow at ingesting the samples.
what is the cpu/mem resource usage like on th GRA instances?

@razumv
Copy link
Author

razumv commented Dec 12, 2019

Now adapter instances are not limited in resources. Now there are 5 of them, they consume about 0.1 processor cores and 700 megabytes of RAM each. graphite was deployed through docker-compose.

@InformatiQ
Copy link
Contributor

how does the resource usage of graphit elook like? any errors in graphite side?

@razumv
Copy link
Author

razumv commented Dec 16, 2019

Graphite is 40% loaded, today we’ll deploy in a cluster and try to write to it.
But can it be that the first packet of metrics does not have time to go and the next starts to go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants