Skip to content

Releases: m-lab/prometheus-support

Weekly release: 2018-06-04 to 2018-06-11

11 Jun 18:35
Choose a tag to compare

Upgrade of github-receiver from v0.2 to v0.3.

  • Enable alert routing between ops-tracker and dev-tracker on a per-alert basis. All alerts now include a "repo" label.

Multiple updates and one new dashboard:

  • NDT_GlobalTestRate.json -- includes a three week overlay to visually contrast past performance.
  • Ops_PlatformOverview.json -- restricts some queries to only platform instances of the node exporter.
  • Ops_SwitchOverview.json -- updates the selection query that identifies sites, so all sites are available even if they are offline currently.
  • Pipeline_Embargo.json -- a new dashboard for visually comparing scraper output to embago input and embargo output to etl input.

New alerts:

  • SnmpScrapingDownAtSite -- corrects an earlier alert that would only fire if all SNMP metrics from all nodes were missing. This new alert fires when a single site stops collecting SNMP metrics.
  • VdlimitMetricsMissingForNode -- an integrity check to guarantee that all metrics used by mlab-ns are available.

Weekly release: add github issue search links to slack channel notifications

04 Jun 21:13
Choose a tag to compare
Merge pull request #235 from m-lab/sandbox-soltesz

Add a link to slack messages with a link to open github alerts

Weekly release: move github-receiver alerts to dev-tracker

04 Jun 16:37
Choose a tag to compare

This release moves github issues created by the github-receiver to the dev-tracker repo (instead of scraper).

As well, this release changes LameDuckMetricMissingForNode so alerts can identify which instance is firing.

New alerts for collectd-mlab metrics.

15 May 17:05
Choose a tag to compare

We recently added a metric for whether collectd-mlab is healthy on nodes. This release simply adds two new alerts for when collectd-mlab is either down or missing.

Monitor blackbox_exporter instances (correctly)

07 May 17:44
Choose a tag to compare

A previous release attempted to implement monitoring of our blackbox_exporter instances, but did so incorrectly. This release fixes that previous incorrect implementation. Additionally, it turns of service auto-discovery for the BBE instance running in the prometheus-federation k8s cluster in favor of manually specifying a target rule in the Prom configs. Doing things makes configurations for the IPv4 and IPv6 BBE instances more or less the same, since auto-discovery won't work for the IPv6 instance running on a Linode VM.

Alerts for experiment metrics, alerts for BBE, bugfix in 1 dashboard

01 May 21:58
Choose a tag to compare

The bulk of this release is new Prometheus alerts. We now have alerting for:

  • blackbox_exporter metrics that mlab-ns now relies on.
  • alerts for the blackbox_exporter instances themselves.
  • alerts for a new instance of node_exporter than is running on

Additionally this release contains:

  • A new Prometheus scrape job which will scrape a node_exporter instance on EB.
  • A bugfix to the Ops_PlatformOverview Grafana dashboard.

Weekly release: Retire status.* URLs for mlab-oti

16 Apr 17:16
Choose a tag to compare

This release completes the turndown of the legacy status.* URLs for the Prometheus, Grafana, and Alertmanager stack.

All projects and all services should now be accessed via their TLS names.

Weekly release: Add Ops Overview dashboard & Alertmanager with Basic Auth

09 Apr 15:53
Choose a tag to compare

This release includes:

Grafana Updates / Fixes:

  • Add dashboard -- Ops: Platform Overview
  • Increase the nginx ingress's proxy-connect-timeout to exceed the Prometheus query timeout. Should fix "Gateway Timeout" errors.

Alerts Changes

  • Alertmanager links sent to slack will have basic auth credentials embedded. So, clicking on those links should "just work" without prompting for a username / password.
  • Adds new alert case for NagiosExporterUnavailable
  • Updates ParserDailyVolumeTooLow to only count rows that use status="ok".

Weekly release: TLS & Basic Auth

02 Apr 19:35
Choose a tag to compare
  • Add TLS & Basic Auth to Grafana, Prometheus, and Alertmanager
  • Add monitoring for data-processing-cluster and minimal alerts on etl-gardener
  • Fix table names for bigquery-exporter queries
  • Update nagios exporter alerts to cover both deployments of nagios
  • Adds additional IPv6 targets to prometheus

IPv6 monitoring

26 Mar 17:08
Choose a tag to compare

The principal change in this release is the addition of IPv6 monitoring. Since GCP doesn't currently support IPv6 for most applications, monitoring is enabled via a remote (Linode in this case) VM running several Docker instances (one for each GCP project) of the Prometheus blackbox_exporter.