Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Zero Config Prometheus setup #4

Open
chrisbecke opened this issue Sep 30, 2022 · 3 comments
Open

Dynamic Zero Config Prometheus setup #4

chrisbecke opened this issue Sep 30, 2022 · 3 comments

Comments

@chrisbecke
Copy link

Description

As a consumer of a Swarm, I want to deploy a stack that contains its own Prometheus instance. This prometheus instance already knows how to scrape all the services in this stack. However, all the metrics need to be scraped by the swarms main Prometheus instance to arrive in the central Grafana dashboard.

Proposal

The main prometheus instance can contain a federation job. Something like this :-

  - job_name: federate-prometheus
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~".*"}'
    dns_sd_configs:
    - names: [ tasks.scrape.target ]
      type: 'A'
      port: 9090

Two additional requirements are present: a common prometheus network, and a convention based naming approach: each child prometheus instance needs to add itself to the common prometheus network, and declare an alias there that allows its discovery by the main Prometheus instance.

Stack local prometheus instances could use this minimal declaration to become eligable for scraping.

networks:
  prometheus:
    external: true
    
services:
   prometheus:
     networks:
       default:
       prometheus: 
         aliases: ["scrape.target"]

Result
By probing scrape.target via dns_sd_configs the main instance gets a dynamic list of IPs of all active stack local prometheus instances, and grabs all their metrics via the /federate endpoint.

@s4ke
Copy link
Member

s4ke commented Sep 30, 2022

Thanks for this proposal. Here are some thoughts:

  1. This would actually decouple some things, which would make this a simple extension point
  2. Right now the stack "requires" traefik to be configured as it is right now, but with your proposal it would invert the control here which i really like
  3. But: this would mean a lot of extra prometheus instances and higher storage usage if the federated instances need storage

@chrisbecke
Copy link
Author

In terms of storage, each Prometheus instance manages its own storage, so once federated, the stack local instances only need enough retention for their own private rules. If any.
In the case that the main Prometheus instance (serving a main Grafana instance for visualization and Grafana based alerting) is the only important data store, you can set the stack local instances to really short retention periods, and not mount the db for persistence at all, so its pruned if/when the stack local Prometheus is restarted.

The params: match[]: specifies which labels the main Prometheus instance scrapes, so in a bigger setup it might be needed to come up with a convention for filtering which metrics the main Prometheus scrapes. If stack local Prometheus metrics have their own node-exporter metrics etc there could be a massive bloom of metrics, in which case filtering metrics that are explicitly labeled for scraping would be necessary.

@s4ke
Copy link
Member

s4ke commented Nov 17, 2022

Note: If we do this, we should do the implementation over at https://github.com/neuroforgede/swarmsible-stacks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants