Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store cache with Redis / Valkey #66

Open
seang96 opened this issue Oct 5, 2024 · 17 comments
Open

Store cache with Redis / Valkey #66

seang96 opened this issue Oct 5, 2024 · 17 comments
Labels
enhancement New feature or request

Comments

@seang96
Copy link

seang96 commented Oct 5, 2024

Saw this mention in another DNS service, and thought it would be a nice addition though may also be an anti goal of keeping it simple, keeping existing support would hopefully eliminate that as a concern.

@cottand
Copy link
Owner

cottand commented Oct 7, 2024

What problem would the cache solve that the current in-memory cache does not address?

Is the use-case sharing the cache between several Leng instances?

@cottand cottand added the enhancement New feature or request label Oct 7, 2024
@seang96
Copy link
Author

seang96 commented Oct 7, 2024

Yes that would be the use case. I imagine most records would have a decent enough ttls for it to be beneficial in reducing lookups to the upstream dns server.

@cottand
Copy link
Owner

cottand commented Oct 7, 2024

I am not super convinced of how beneficial this would be, considering the complexity it adds. Assuming the shared cache is on a different node:

  1. Is the built-in cache hit-rate within a single Leng instance so low that a shared cache would make a big difference?
  2. Do you really prefer a query over the network to a redis cache than to an upstream DNS?

I don't think we can answer (1) without knowing the built-in cache hit-rate. I think I will add a metric for this in the next release, it is something that I am now curious about!

As for (2), I guess that very much depends on your setup. Even if your redis cache is in the same datacenter as Leng, and you are going to 1.1.1.1 for DNS, I'd expect the difference to be under 10ms. So you get an idea, ping 1.1.1.1 gives me an average of 1.4ms in a Contabo server, and 4.2ms in a Hetzner server.

So I think I can only see this being very useful if your machines are wired up together in your house, but the internet is far away?

At any rate, I will build a new metric for the cache hit-rate and we can see how that does. Let me know what you think!

@seang96
Copy link
Author

seang96 commented Oct 7, 2024

That would be a good metric to add!

Another note that was pointed out would be increased privacy from the upstream DNS server, they wouldn't be able to see as much / track your habits.

That being said I am also not sure how much of an advantage it is to be worth it, it wouldn't be complex for me to configure at least and I'd be happy to test, but code complexity for saving a couple ms may not be worth it.

Edit: also jealous of your ping time. I get 20+ on my coax and 50+ on my backup 5g wan!

@cottand
Copy link
Owner

cottand commented Nov 2, 2024

Hi @seang96 did you manage to use the metric in order to estimate your cache hit-rate?

On my end, I have since your issue spotted this 0xERR0R/blocky#945 which reflects the added complexity Redis can bring to leng.

@seang96
Copy link
Author

seang96 commented Nov 2, 2024

I am using v1.6.1 and I still don't see the metric. It looks like that was released before it was added looking at the dates of commits and releases?

@cottand
Copy link
Owner

cottand commented Nov 2, 2024

ah sorry about that - it's been in master for a while, just not as part of a release. You can run it with the tag sha-20f09ef (assuming you are using containers here)

@seang96
Copy link
Author

seang96 commented Nov 2, 2024

I updated and see it now. Githubs mobile app sucks and doesn't show packages or at least I can't find it. I'll let it collect for a little bit and let you know how it goes.

Thanks!

@seang96
Copy link
Author

seang96 commented Nov 8, 2024

Statics wise so far I get around 12%-14% combined as cache hits. One of them is generally 14-20% and others are 8-12%.

I did have another idea of trying to use DoH which I can cache via nginx proxy itself on the server side and with cache-control header on the client side. Unfortunately I get timeouts with DoH, nginx is reporting 499 response code with a timeout of 5 seconds on the backend on occasion. It looks like you can adjust the timeout on DoH, but at 10s it still times out at 5. It looks like the timeout for the DNS server itself is 5 seconds and non configurable which makes DoH max timeout 5 seconds? Can the DNS server timeout be configurable? I think its hard coded here main.go#L74-L75

@cottand
Copy link
Owner

cottand commented Nov 10, 2024

thanks for reproting back!

The timeout you linked is the API server (think metrics) The timeout you linked is that of the DNS servers (TCP, UDP), not the DoH server. The timeout for answering DoH requests is here and like you said is configurable.

The timeout for making upstream DNS queries (including DoH) is defined in timeout = in secods, in the top level of the settings (it is not well documented, sorry, will look to improve that).

In the meantime, I also just merged #71 which I think will be useful to measure how long the upstreams are taking.

Is your DNS upstream really taking 5s? Or is leng being that slow? If that is the case I would much rather look into fixing that rather than a shared Redis cache. Hopefully you do not need for a cache as much if leng is faster and you do more cache its. Your hit-rate is surprisingly slow as well, maybe look into growing the cache size a little? That might improve it

@seang96
Copy link
Author

seang96 commented Nov 10, 2024

I don't think the cache size is the issue its the DNS requests load is split among 3 instances.

I'd say the timeouts are more important as well. Maybe it should separate as a new issue to look at? Anyways its on the /dns-query endpoint so not metrics that's timing out at 5s and it always times out at 5 seconds even with DoH timeout at 10000 and the upstream doh timeout_s set to 10. I did some additional routing to make the upstream route to coax ISP instead of my 5G ISP which helped decrease timeouts but it still occurs through DoH.

Is the DNS service being used for the lookups still with the DoH service? I see config.DnsOverHttpServer.Bind being passed into the DoH service so I assumed it might be the culprit.

@cottand
Copy link
Owner

cottand commented Nov 10, 2024

Sorry if I was not straightforward in my message, but did you try setting timeout in the config at the top-level?

Is the DNS service being used for the lookups still with the DoH service?

All lookups all resolved the same way initially: first it tries DoH, then it tries the protocol of the request (defaulting back to TCP if you were initially using DoH)

I don't think the cache size is the issue its the DNS requests load is split among 3 instances.

It's true that each server should only receive one third of the requests in this case, but with a big enough cache and a big enough TTL (about 3x bigger on average for your example) the load-balanced servers should converge to the same hit-rate that single non-load balanced one would have. This becomes impractical if the TTL is too big (as some DNS records would be too old), but I was curious nonetheless

@seang96
Copy link
Author

seang96 commented Nov 10, 2024

I am using ghcr.io/cottand/leng:sha-db020fc and don't see the metric for timeouts yet. Do I need highCardinalityEnabled true for this metric?

Also side note, my ISP sucks and has high latency enough that online gaming disconnects very frequently so I wouldn't say it's too bizarre to see sudden spikes. I had them come over 7 times and got nothing out of it, I think it's an upstream issue they won't fix. Fiber is being actively installed in my town so hopefully I won't have this issue in the next year or so.

Here is my config, I didn't see any other timeouts in the docs so let me know if I missed any.

    # log configuration
    # format: comma separated list of options, where options is one of
    #   file:<filename>@<loglevel>
    #   stderr>@<loglevel>
    #   syslog@<loglevel>
    # loglevel: 0 = errors and important operations, 1 = dns queries, 2 = debug
    # e.g. logconfig = "file:leng.log@2,syslog@1,stderr@2"
    logconfig = "stderr@0"

    # apidebug enables the debug mode of the http api library
    apidebug = false

    # address to bind to for the DNS server
    bind = "0.0.0.0:53"

    # address to bind to for the API server
    api = "0.0.0.0:8080"

    # concurrency interval for lookups in miliseconds
    interval = 100

    # question cache capacity, 0 for infinite but not recommended (this is used for storing logs)
    questioncachecap = 5000

    # manual custom dns entries - comments for reference
    customdnsrecords = [
        # "example.mywebsite.tld      IN A       10.0.0.1",
        # "example.other.tld          IN CNAME   wikipedia.org"
    ]

    [Blocking]
        # response to blocked queries with a NXDOMAIN
        nxdomain = false
        # ipv4 address to forward blocked queries to
        nullroute = "0.0.0.0"
        # ipv6 address to forward blocked queries to
        nullroutev6 = "0:0:0:0:0:0:0:0"
        # manual blocklist entries
        blocklist = [
        ]
        # list of sources to pull blocklists from, stores them in ./sources
        sources = [
            "https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts",
            "https://v.firebog.net/hosts/AdguardDNS.txt",
            "https://osint.digitalside.it/Threat-Intel/lists/latestdomains.txt",
            "https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/SmartTV.txt",
            "https://raw.githubusercontent.com/Perflyst/PiHoleBlocklist/master/android-tracking.txt",
            "https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt",
            "https://mirror1.malwaredomains.com/files/justdomains",
            "https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts",
            "https://sysctl.org/cameleon/hosts",
            "https://s3.amazonaws.com/lists.disconnect.me/simple_tracking.txt",
            "https://s3.amazonaws.com/lists.disconnect.me/simple_ad.txt",
            "https://gitlab.com/quidsup/notrack-blocklists/raw/master/notrack-blocklist.txt"
        ]
        # list of locations to recursively read blocklists from (warning, every file found is assumed to be a hosts-file or domain list)
        sourcedirs = ["./sources"]
        sourcesStore = "./sources"
        # manual whitelist entries - comments for reference
        whitelist = [
          "docker.io",
          "*.docker.io"
        ]



    [Upstream]
        # Dns over HTTPS provider to use.
        DoH = "https://cloudflare-dns.com/dns-query"
        # nameservers to forward queries to
        nameservers = ["1.1.1.1:53", "1.0.0.1:53"]
        # query timeout for dns lookups in seconds
        timeout_s = 10
        # cache entry lifespan in seconds
        expire = 600
        # cache capacity, 0 for infinite
        maxcount = 0

    # Prometheus metrics - enable
    [Metrics]
        enabled = true
        highCardinalityEnabled = false
        path = "/metrics"

    [DnsOverHttpServer]
        enabled = true
        bind = "0.0.0.0:80"
        timeoutMs = 10000

@cottand
Copy link
Owner

cottand commented Nov 11, 2024

ah I hope fiber solves your troubles!

Try this in the config:

tiemout=10

[Metrics]
    histogramsEnabled = true

Again, sorry for the lack of docs for these, I will make sure to work on that before the next release

@seang96
Copy link
Author

seang96 commented Nov 11, 2024

No worries! I hope so too haha

I saw the timeout in the code afterward and added it, still got the 5 second timeout with DoH. I enabled the metric setting now so I'll see how that goes.

@seang96
Copy link
Author

seang96 commented Nov 16, 2024

I tried out blocky and have yet to experience any timeouts with my setup. The only change is literally using the blocky image / config. I am using multiple upstreams with blocky though, but I imagine clpudflare by itself for upstream dns wouldn't cause timeouts either.

@cottand
Copy link
Owner

cottand commented Nov 17, 2024

😢 sad that it was just leng timing out.

If blocky is working better for you you should stick with that, but I would still love to fix this for leng. Could I ask you to try to use blocky with a single upstream and see?

I also saw you forked leng to change the hardcoded 5s upstream lookup to 10s. Did this not help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants