Dynamic Zero Config Prometheus setup #4

chrisbecke · 2022-09-30T12:29:02Z

Description

As a consumer of a Swarm, I want to deploy a stack that contains its own Prometheus instance. This prometheus instance already knows how to scrape all the services in this stack. However, all the metrics need to be scraped by the swarms main Prometheus instance to arrive in the central Grafana dashboard.

Proposal

The main prometheus instance can contain a federation job. Something like this :-

  - job_name: federate-prometheus
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~".*"}'
    dns_sd_configs:
    - names: [ tasks.scrape.target ]
      type: 'A'
      port: 9090

Two additional requirements are present: a common prometheus network, and a convention based naming approach: each child prometheus instance needs to add itself to the common prometheus network, and declare an alias there that allows its discovery by the main Prometheus instance.

Stack local prometheus instances could use this minimal declaration to become eligable for scraping.

networks:
  prometheus:
    external: true
    
services:
   prometheus:
     networks:
       default:
       prometheus: 
         aliases: ["scrape.target"]

Result
By probing scrape.target via dns_sd_configs the main instance gets a dynamic list of IPs of all active stack local prometheus instances, and grabs all their metrics via the /federate endpoint.

The text was updated successfully, but these errors were encountered:

s4ke · 2022-09-30T12:35:35Z

Thanks for this proposal. Here are some thoughts:

This would actually decouple some things, which would make this a simple extension point
Right now the stack "requires" traefik to be configured as it is right now, but with your proposal it would invert the control here which i really like
But: this would mean a lot of extra prometheus instances and higher storage usage if the federated instances need storage

chrisbecke · 2022-09-30T12:59:39Z

In terms of storage, each Prometheus instance manages its own storage, so once federated, the stack local instances only need enough retention for their own private rules. If any.
In the case that the main Prometheus instance (serving a main Grafana instance for visualization and Grafana based alerting) is the only important data store, you can set the stack local instances to really short retention periods, and not mount the db for persistence at all, so its pruned if/when the stack local Prometheus is restarted.

The params: match[]: specifies which labels the main Prometheus instance scrapes, so in a bigger setup it might be needed to come up with a convention for filtering which metrics the main Prometheus scrapes. If stack local Prometheus metrics have their own node-exporter metrics etc there could be a massive bloom of metrics, in which case filtering metrics that are explicitly labeled for scraping would be necessary.

s4ke · 2022-11-17T11:03:27Z

Note: If we do this, we should do the implementation over at https://github.com/neuroforgede/swarmsible-stacks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic Zero Config Prometheus setup #4

Dynamic Zero Config Prometheus setup #4

chrisbecke commented Sep 30, 2022

s4ke commented Sep 30, 2022 •

edited

Loading

chrisbecke commented Sep 30, 2022

s4ke commented Nov 17, 2022

Dynamic Zero Config Prometheus setup #4

Dynamic Zero Config Prometheus setup #4

Comments

chrisbecke commented Sep 30, 2022

Description

Proposal

s4ke commented Sep 30, 2022 • edited Loading

chrisbecke commented Sep 30, 2022

s4ke commented Nov 17, 2022

s4ke commented Sep 30, 2022 •

edited

Loading