Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to fetch labels from Loki (Failed to call resource)... in Grafana when TLS is enabled in Loki #344

Closed
przemeklal opened this issue Feb 7, 2024 · 8 comments

Comments

@przemeklal
Copy link
Member

przemeklal commented Feb 7, 2024

Bug Description

Grafana can't query Loki when TLS is enabled.

The following error is thrown on the Loki Data Source page.
image

In grafana.log I can see this whenever I click the "Test" button:

logger=context userId=1 orgId=1 uname=admin t=2024-02-07T12:48:58.800571751Z level=error msg="Failed to call resource" error="Get \"https://10.red.red.red/cos-loki-0/loki/api/v1/labels?start=1707309535023000000&end=1707310135023000000\": x509: cannot validate certificate for 10.red.red.red because it doesn't contain any IP SANs" traceID=

10.red.red.red is Traefik's IP, Loki URL appears as https://10.red.red.red/cos-loki-0 in Grafana.

Loki itself seems to be okay, the unit is active/idle, and I can run these commands just fine (please note -k parameter for curl):

$ curl -k https://10.red.red.red/cos-loki-0/services
memberlist-kv => Running
ingester-querier => Running
distributor => Running
query-frontend => Running
ingester => Running
server => Running
analytics => Running
query-scheduler-ring => Running
query-frontend-tripperware => Running
ruler => Running
cache-generation-loader => Running
ring => Running
query-scheduler => Running
store => Running
querier => Running
$ curl -k https://10.red.red.red/cos-loki-0/ready
ready

I tried toggling this switch without any luck:
image

The same errors appear in grafana logs.

To Reproduce

Deploy COS Lite with the TLS overlay and go to Grafana. Versions are listed below.

Environment

Versions:

alertmanager                     0.25.0   active      1  alertmanager-k8s              latest/stable   96
ca                                        active      1  self-signed-certificates      latest/edge     60
catalogue                                 active      1  catalogue-k8s                 latest/stable   33
external-ca                               active      1  self-signed-certificates      latest/edge     60
grafana                          9.2.1    active      1  grafana-k8s                   latest/stable   93
loki                             2.9.4    active      1  loki-k8s                      latest/edge    117
prometheus                       2.47.2   active      1  prometheus-k8s                latest/stable  159
scrape-interval-config-metrics   n/a      active      1  prometheus-scrape-config-k8s  latest/edge     45
scrape-interval-config-monitors  n/a      active      1  prometheus-scrape-config-k8s  latest/edge     45
traefik                          2.10.4   active      1  traefik-k8s                   latest/stable  166

Relevant log output

logger=context userId=1 orgId=1 uname=admin t=2024-02-07T12:48:58.800571751Z level=error msg="Failed to call resource" error="Get \"https://10.red.red.red/cos-loki-0/loki/api/v1/labels?start=1707309535023000000&end=1707310135023000000\": x509: cannot validate certificate for 10.red.red.red because it doesn't contain any IP SANs" traceID=

Additional context

No response

@Abuelodelanada
Copy link
Contributor

Grafana is trying to access Loki through and external URL provided by Traefik. That URL is a https one since we are using the tls overlay.
The certificate provided by Traefik is signed by the external-ca charm. Grafana does not trust that CA.

We need to add one more relation in tls-overlay.yaml for Grafana to trust that cert:

Integration provider                Requirer                     Interface              Type     Message
external-ca:send-ca-cert            grafana:receive-ca-cert      certificate_transfer   regular  

@przemeklal
Copy link
Member Author

@Abuelodelanada While the relation you mentioned might be needed as well, please note that it's not CA cert verification that's failing, the error is:

x509: cannot validate certificate for 10.red.red.red because it doesn't contain any IP SANs

I checked the cert presented by the Loki endpoint visible in Grafana (well, by Traefik) https://10.red.red.red/cos-loki-0/ and indeed, there aren't any SANs for this IP and the subject looks like this:

Subject: CN = <redacted>-cos-catalog.<redacted>.com, x500UniqueIdentifier = <redacted-uuid>

So trying to reach Loki on the Traefik IP instead of that FQDN will always fail as long as the Traefik IP is not included in SANs in that cert.

Also, I tried the suggested relation and it didn't make any difference, Grafana is still failing to query Loki with the same error message.

@przemeklal
Copy link
Member Author

Update: I tried to add a new Loki data source: https://<redacted>-cos-catalog.<redacted>.com/cos-loki-0

The error message has changed to x509: certificate signed by unknown authority (I went back to grafana latest/stable for this exercise).

After switching "Skip TLS verify" to True in the UI:
image
(This one could be also fixed with the external-ca:send-ca-cert - grafana:receive-ca-cert relation you suggested).

It has finally turned green:
image

So, in my opinion, we have a few possible solutions here:

  • Make sure that Traefik IP is included in the presented certificate.
  • Make Loki (?) present the full Traefik's FQDN instead of the IP address.

@Abuelodelanada
Copy link
Contributor

Hi @przemeklal!

You are right, I was only testing with IP, not domain name.

Let's try both scenarios:

Testing with Traefik using public IP

Let's deploy cos-lite this way in the model ip:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:cos]
╰─$ juju add-model ip
microk8s:admin/cos -> microk8s:admin/domain

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:ip]
╰─$ juju deploy cos-lite --channel=edge --trust --overlay ./tls-overlay.yaml --overlay ./offers-overlay.yaml

Let's install curl in grafana/0 container:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:ip]
╰─$ juju ssh --container grafana grafana/0 "apt update && apt install -y curl"

Let's try to curl loki endpoint using traefik external IP from inside the grafana workload container:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:ip]
╰─$ juju ssh --container grafana grafana/0 "curl https://192.168.1.251/ip-loki-0/ready"
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
ERROR command terminated with exit code 60

Now, let's add the relation:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:ip]
╰─$ juju relate grafana:receive-ca-cert external-ca

Let's try again to curl loki endpoint using traefik external IP from inside the grafana workload container:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:ip]
╰─$ juju ssh --container grafana grafana/0 "curl https://192.168.1.251/ip-loki-0/ready"                                                  
ready

Testing with Traefik using a domain name

Let's deploy cos-lite this way in de model domain:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:ip]
╰─$ juju add-model domain
microk8s:admin/ip -> microk8s:admin/domain

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:domain]
╰─$ juju deploy cos-lite --channel=edge --trust --overlay ./tls-overlay.yaml --overlay ./offers-overlay.yaml

Set external hostname for traefik:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:domain]
╰─$ juju config traefik external_hostname=foobar.com  

Let's install curl in grafana/0 container:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:domain]
╰─$ juju ssh --container grafana grafana/0 "apt update && apt install -y curl"

Let's add foobar.com domain pointing to our traefik external IP in the file /etc/hosts in grafana/0 workload container:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:domain]
╰─$ juju ssh --container grafana grafana/0 "echo 192.168.1.250 foobar.com >> /etc/hosts"

Now, let's try to curl loki endpoint using traefik external domain from inside the grafana workload container:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:domain]
╰─$ juju ssh --container grafana grafana/0 "curl https://foobar.com/domain-loki-0/ready"
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
ERROR command terminated with exit code 60

Now, let's add the relation:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:domain]
╰─$ juju relate grafana:receive-ca-cert external-ca

Let's try again to curl loki endpoint using traefik external domain from inside the grafana workload container:

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:domain]
╰─$ juju ssh --container grafana grafana/0 "curl https://foobar.com/domain-loki-0/ready"                                                  
ready

@przemeklal
Copy link
Member Author

Great that it works for you but my issue is that the Traefik endpoint cert has only FQDN while in Grafana, Loki's data source URL uses an IP. This should be consistent in my opinion. Also, alertmanager and prometheus data sources don't use bare IP addresses, but .cluster.local FQDNs which adds even more confusion.

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:domain]
╰─$ juju ssh --container grafana grafana/0 "echo 192.168.1.250 foobar.com >> /etc/hosts"

This isn't going to be persistent, is it? Also, is this foobar.com domain configured anywhere in the bundle and does it show in any relation data? I have to say I don't fully understand this test scenario.

@przemeklal
Copy link
Member Author

przemeklal commented Feb 16, 2024

@dstathis @simskij Issues that we discussed on the call:

  1. In tls-overlay.yaml traefik is related to external-ca and not ca.
  2. The cert presented by traefik uses DNS name specified in MAAS but not the traefik IP.
  3. Alertmanager data source uses IP address instead of the external_hostname set in traefik juju config.

tl;dr of how we got it to work in the end:

  • use MAAS DNS name assigned to the traefik IP:
    juju config traefik external_hostname=" <redacted>-cos-catalog.<redacted>.com"
    
    • somehow the self-signed-certificates charm was smart enough to do reverse DNS lookup and generate the cert only for the A record it received from the MAAS DNS server, despite this FQDN not being present anywhere in the bundles
    • the above step fixed the issue with IP not being present in SANs (since we're not using the bare IP anymore)
  • removed external-ca charm and its relation with traefik
  • related ca charm to traefik to replace the old external-ca relation
  • remove and recreate the grafana-source relation between grafana and alertmanager (loki managed to update it on its own)
  • manually removed the catalogue-0 pod and let juju recreate it

All data sources in Grafana work now, although Prometheus still uses the .cluster.local URL.

@Abuelodelanada
Copy link
Contributor

╭─ubuntu@charm-dev-juju-31 ~/repos [microk8s:domain]
╰─$ juju ssh --container grafana grafana/0 "echo 192.168.1.250 foobar.com >> /etc/hosts"

This isn't going to be persistent, is it? Also, is this foobar.com domain configured anywhere in the bundle and does it show in any relation data? I have to say I don't fully understand this test scenario.

Nop, I did that just to try to reproduce your situation

@Abuelodelanada
Copy link
Contributor

Hi @przemeklal

As far as I know you could solve this in a call with Simme and Dylan. Closing.
If it is not the case, please re-open it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants