Releases: m-lab/prometheus-support
Weekly release: 2018-06-04 to 2018-06-11
Upgrade of github-receiver from v0.2 to v0.3.
- Enable alert routing between ops-tracker and dev-tracker on a per-alert basis. All alerts now include a "repo" label.
Multiple updates and one new dashboard:
- NDT_GlobalTestRate.json -- includes a three week overlay to visually contrast past performance.
- Ops_PlatformOverview.json -- restricts some queries to only platform instances of the node exporter.
- Ops_SwitchOverview.json -- updates the selection query that identifies sites, so all sites are available even if they are offline currently.
- Pipeline_Embargo.json -- a new dashboard for visually comparing scraper output to embago input and embargo output to etl input.
New alerts:
- SnmpScrapingDownAtSite -- corrects an earlier alert that would only fire if all SNMP metrics from all nodes were missing. This new alert fires when a single site stops collecting SNMP metrics.
- VdlimitMetricsMissingForNode -- an integrity check to guarantee that all metrics used by mlab-ns are available.
Weekly release: add github issue search links to slack channel notifications
Merge pull request #235 from m-lab/sandbox-soltesz Add a link to slack messages with a link to open github alerts
Weekly release: move github-receiver alerts to dev-tracker
This release moves github issues created by the github-receiver to the dev-tracker repo (instead of scraper).
As well, this release changes LameDuckMetricMissingForNode so alerts can identify which instance is firing.
New alerts for collectd-mlab metrics.
We recently added a metric for whether collectd-mlab is healthy on nodes. This release simply adds two new alerts for when collectd-mlab is either down or missing.
Monitor blackbox_exporter instances (correctly)
A previous release attempted to implement monitoring of our blackbox_exporter instances, but did so incorrectly. This release fixes that previous incorrect implementation. Additionally, it turns of service auto-discovery for the BBE instance running in the prometheus-federation k8s cluster in favor of manually specifying a target rule in the Prom configs. Doing things makes configurations for the IPv4 and IPv6 BBE instances more or less the same, since auto-discovery won't work for the IPv6 instance running on a Linode VM.
Alerts for experiment metrics, alerts for BBE, bugfix in 1 dashboard
The bulk of this release is new Prometheus alerts. We now have alerting for:
- blackbox_exporter metrics that mlab-ns now relies on.
- alerts for the blackbox_exporter instances themselves.
- alerts for a new instance of node_exporter than is running on eb.measurementlab.net
Additionally this release contains:
- A new Prometheus scrape job which will scrape a node_exporter instance on EB.
- A bugfix to the Ops_PlatformOverview Grafana dashboard.
Weekly release: Retire status.* URLs for mlab-oti
This release completes the turndown of the legacy status.* URLs for the Prometheus, Grafana, and Alertmanager stack.
All projects and all services should now be accessed via their TLS names.
Weekly release: Add Ops Overview dashboard & Alertmanager with Basic Auth
This release includes:
Grafana Updates / Fixes:
- Add dashboard -- Ops: Platform Overview
- Increase the nginx ingress's proxy-connect-timeout to exceed the Prometheus query timeout. Should fix "Gateway Timeout" errors.
Alerts Changes
- Alertmanager links sent to slack will have basic auth credentials embedded. So, clicking on those links should "just work" without prompting for a username / password.
- Adds new alert case for NagiosExporterUnavailable
- Updates ParserDailyVolumeTooLow to only count rows that use status="ok".
Weekly release: TLS & Basic Auth
- Add TLS & Basic Auth to Grafana, Prometheus, and Alertmanager
- Add monitoring for data-processing-cluster and minimal alerts on etl-gardener
- Fix table names for bigquery-exporter queries
- Update nagios exporter alerts to cover both deployments of nagios
- Adds additional IPv6 targets to prometheus
IPv6 monitoring
The principal change in this release is the addition of IPv6 monitoring. Since GCP doesn't currently support IPv6 for most applications, monitoring is enabled via a remote (Linode in this case) VM running several Docker instances (one for each GCP project) of the Prometheus blackbox_exporter.