Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection closures running piko as StatefulSet #147

Closed
yquansah opened this issue Jul 29, 2024 · 18 comments · Fixed by #155
Closed

Connection closures running piko as StatefulSet #147

yquansah opened this issue Jul 29, 2024 · 18 comments · Fixed by #155
Assignees

Comments

@yquansah
Copy link
Contributor

yquansah commented Jul 29, 2024

It seems as though when I run piko as a StatefulSet using tcp, I am running into random connection closures (from the agent?).

These are the logs I see from the agent:

{"level":"info","ts":"2024-07-29T18:18:19.116Z","subsystem":"proxy.tcp.access","msg":"connection opened","endpoint-id":"my-redis-endpoint"}
{"level":"debug","ts":"2024-07-29T18:18:49.116Z","subsystem":"proxy.tcp","msg":"copy to conn closed","endpoint-id":"my-redis-endpoint","error":"writeto tcp [::1]:41612->[::1]:6379: read tcp [::1]:41612->[::1]:6379: use of closed network connection"}
{"level":"info","ts":"2024-07-29T18:18:49.116Z","subsystem":"proxy.tcp.access","msg":"connection closed","endpoint-id":"my-redis-endpoint"}

One interesting thing I noticed as that the connection closures seem to happen at a regular interval from the time a connection is opened (30s). This does not happen when I just run 1 replica of the server interestingly enough, only when I run more than 1 using the gossip protocol.

Here is a repro config, running the workload on kubernetes (minikube cluster)...

apiVersion: v1
kind: Service
metadata:
  name: piko-forward-redis
  labels:
    app: piko-forward-redis
spec:
  type: NodePort
  ports:
    - port: 6001
      protocol: TCP
      name: forward
  selector:
    app: piko-forward-redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: piko-forward-redis
spec:
  selector:
    matchLabels:
      app: piko-forward-redis
  template:
    metadata:
      labels:
        app: piko-forward-redis
    spec:
      containers:
        - name: piko-forward-redis
          image: piko-hyperbolic:a587abc
          imagePullPolicy: IfNotPresent
          args:
            - forward
            - tcp
            - --connect.url
            - http://piko.default.svc:8000
            - "0.0.0.0:6001"
            - my-redis-endpoint
          ports:
            - containerPort: 6001
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-server
  namespace: default
spec:
  selector:
    matchLabels:
      app: redis-server
  template:
    metadata:
      labels:
        app: redis-server
    spec:
      containers:
        - name: piko-agent
          image: piko-hyperbolic:a587abc
          imagePullPolicy: IfNotPresent
          args:
            - agent
            - tcp
            - --log.level
            - debug
            - --server.bind-addr
            - 0.0.0.0:5000
            - --connect.url
            - http://piko.default.svc:7000
            - my-redis-endpoint
            - "6379"
          ports:
            - containerPort: 5000
        - name: redis-server
          image: redis:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 6379
---
apiVersion: v1
kind: Service
metadata:
  name: piko
  labels:
    app: piko
spec:
  ports:
    - port: 8000
      name: proxy
    - port: 8002
      name: admin
    - port: 8003
      name: gossip
    - port: 7000
      name: upstream
      targetPort: 7000
  selector:
    app: piko
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: server-config
data:
  server.yaml: |
    cluster:
      node_id_prefix: ${POD_NAME}-
      join:
        - piko.default.svc
    upstream:
      bind_addr: 0.0.0.0:7000
    log:
      level: debug
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: piko
spec:
  selector:
    matchLabels:
      app: piko
  serviceName: "piko"
  replicas: 3
  template:
    metadata:
      labels:
        app: piko
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: piko
          image: piko-hyperbolic:a587abc
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8000
              name: proxy
            - containerPort: 7000
              name: upstream
            - containerPort: 8002
              name: admin
            - containerPort: 8003
              name: gossip
          args:
            - server
            - --config.path
            - /config/server.yaml
            - --config.expand-env
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          volumeMounts:
            - name: config
              mountPath: "/config"
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: server-config
            items:
              - key: "server.yaml"
                path: "server.yaml"

I will try and dig into the gossip protocol tomorrow, but just wanted to raise this issue in case there were any quick hints from your end @andydunstall.

Note

The image names would need to be changed from the above config. The images there were just from my local Docker build.

@andydunstall
Copy link
Owner

andydunstall commented Jul 30, 2024

Hey @yquansah, thanks for raising this

I can't think of anything off the top of my head, but I'll play about this this later today to try and reproduce

I will try and dig into the gossip protocol tomorrow

I'd be surprised if it was caused by gossip, which only gossips around what upstreams are connected, but won't (or at least shouldn't) close connections

@yquansah
Copy link
Contributor Author

@andydunstall Yeah nice. I am also going to play around with it later this afternoon. Thanks!

@andydunstall
Copy link
Owner

andydunstall commented Jul 30, 2024

Hey @yquansah, thanks for the above example, I'm able to reproduce

So when you run with multiple nodes, if you connect to node N1 but the upstream is connected to N2, then N1 forwards all traffic to N2

Since Piko forward connects to the server using WebSockets (to wrap the underlying TCP connection and work with a HTTP load balancer), if N1 receives the WebSocket connection that it needs to forward to N2, it falls back to using the HTTP proxy to forward the WebSocket to N2

However by default the HTTP proxy has a timeout of 30 seconds (--proxy.timeout), meaning N1 closes the connection to N2 after 30 seconds, meaning the end-to-end connection is dropped. So the quick fix is to set --proxy.timeout 0 to disable the timeout, and I'll think about a better way to handle this tomorrow (the timeout should only apply to HTTP requests not hijacked WebSocket connections, as the same issue applies to using the HTTP proxy as normal with a WebSocket client)

@yquansah
Copy link
Contributor Author

@andydunstall Really great information here, I appreciate the sleuthing, and it definitely makes sense. We'll go with the proxy timeout for now as you suggested for a quick fix.

@condaatje
Copy link

hey, we gave this a whirl and are getting this error when the container starts:

config: proxy: missing timeout

here is the yaml for the deployment:

   30  │ template:
    1  │ │ metadata:
    2  │ │ │ annotations:
    3  │ │ │ │ kubectl.kubernetes.io/restartedAt: "2024-07-30T13:05:09-07:00"
    4  │ │ │ creationTimestamp: null
    5  │ │ │ labels:
    6  │ │ │ │ app: piko
    7  │ │ spec:
    8  │ │ │ containers:
    9  │ │ │ - args:
   10  │ │ │ │ - server
   11  │ │ │ │ - --config.path
   12  │ │ │ │ - /config/server.yaml
   13  │ │ │ │ - --config.expand-env
   14  │ │ │ │ - --proxy.timeout
   15  │ │ │ │ - "0"

perhaps I'm doing something wrong here?

@andydunstall
Copy link
Owner

andydunstall commented Aug 14, 2024

config: proxy: missing timeout

Ah sorry I forgot its required, I'll add a patch now

A quick work around is just set a very large timeout (such as --proxy.timeout=24h)

(Sorry I haven't had much time to fix the underlying issue, I'll try and get to soon)

@andydunstall
Copy link
Owner

andydunstall commented Aug 14, 2024

I've merged #151 to allow a timeout of 0, so if you re-build main it should be ok

@yquansah
Copy link
Contributor Author

@andydunstall Quick one. Thank you!

@condaatje
Copy link

awesome! could I trouble you for a tagged image @andydunstall ? right now we are pulling piko:v0.6.0 , but would normally not want to pull :latest in our architecture.

@andydunstall
Copy link
Owner

andydunstall commented Aug 14, 2024

yep sure will tag v0.7.0 v0.6.1

@andydunstall
Copy link
Owner

andydunstall commented Aug 14, 2024

@condaatje Thats done: ghcr.io/andydunstall/piko:v0.6.1

@condaatje
Copy link

awesome thanks!

@condaatje
Copy link

by the way - if you'd like to join our alpha (launching today) we'd be honored to have you! especially because we've absolutely loved working with piko so far: https://x.com/hyperbolic_labs/status/1823779096650015026

@andydunstall
Copy link
Owner

Thanks! I don't have anything I could use a GPU for, but the product looks great!

@andydunstall
Copy link
Owner

andydunstall commented Aug 15, 2024

I've merged #155, which only applies --proxy.timeout to non-websocket (and non-TCP) connections, so you shouldn't need to set --proxy.timeout=0 anymore (sorry it took a while for me to get to)

So I'll close this for now. Let me know if theres anything else I can help with!

@condaatje
Copy link

appreciated!

and this is great - could I trouble you for a v0.6.2 tag?

@andydunstall
Copy link
Owner

@condaatje sure tagged

@condaatje
Copy link

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants