Connection closures running `piko` as `StatefulSet` #147

yquansah · 2024-07-29T18:26:06Z

It seems as though when I run piko as a StatefulSet using tcp, I am running into random connection closures (from the agent?).

These are the logs I see from the agent:

{"level":"info","ts":"2024-07-29T18:18:19.116Z","subsystem":"proxy.tcp.access","msg":"connection opened","endpoint-id":"my-redis-endpoint"}
{"level":"debug","ts":"2024-07-29T18:18:49.116Z","subsystem":"proxy.tcp","msg":"copy to conn closed","endpoint-id":"my-redis-endpoint","error":"writeto tcp [::1]:41612->[::1]:6379: read tcp [::1]:41612->[::1]:6379: use of closed network connection"}
{"level":"info","ts":"2024-07-29T18:18:49.116Z","subsystem":"proxy.tcp.access","msg":"connection closed","endpoint-id":"my-redis-endpoint"}

One interesting thing I noticed as that the connection closures seem to happen at a regular interval from the time a connection is opened (30s). This does not happen when I just run 1 replica of the server interestingly enough, only when I run more than 1 using the gossip protocol.

Here is a repro config, running the workload on kubernetes (minikube cluster)...

apiVersion: v1
kind: Service
metadata:
  name: piko-forward-redis
  labels:
    app: piko-forward-redis
spec:
  type: NodePort
  ports:
    - port: 6001
      protocol: TCP
      name: forward
  selector:
    app: piko-forward-redis
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: piko-forward-redis
spec:
  selector:
    matchLabels:
      app: piko-forward-redis
  template:
    metadata:
      labels:
        app: piko-forward-redis
    spec:
      containers:
        - name: piko-forward-redis
          image: piko-hyperbolic:a587abc
          imagePullPolicy: IfNotPresent
          args:
            - forward
            - tcp
            - --connect.url
            - http://piko.default.svc:8000
            - "0.0.0.0:6001"
            - my-redis-endpoint
          ports:
            - containerPort: 6001
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-server
  namespace: default
spec:
  selector:
    matchLabels:
      app: redis-server
  template:
    metadata:
      labels:
        app: redis-server
    spec:
      containers:
        - name: piko-agent
          image: piko-hyperbolic:a587abc
          imagePullPolicy: IfNotPresent
          args:
            - agent
            - tcp
            - --log.level
            - debug
            - --server.bind-addr
            - 0.0.0.0:5000
            - --connect.url
            - http://piko.default.svc:7000
            - my-redis-endpoint
            - "6379"
          ports:
            - containerPort: 5000
        - name: redis-server
          image: redis:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 6379
---
apiVersion: v1
kind: Service
metadata:
  name: piko
  labels:
    app: piko
spec:
  ports:
    - port: 8000
      name: proxy
    - port: 8002
      name: admin
    - port: 8003
      name: gossip
    - port: 7000
      name: upstream
      targetPort: 7000
  selector:
    app: piko
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: server-config
data:
  server.yaml: |
    cluster:
      node_id_prefix: ${POD_NAME}-
      join:
        - piko.default.svc
    upstream:
      bind_addr: 0.0.0.0:7000
    log:
      level: debug
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: piko
spec:
  selector:
    matchLabels:
      app: piko
  serviceName: "piko"
  replicas: 3
  template:
    metadata:
      labels:
        app: piko
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: piko
          image: piko-hyperbolic:a587abc
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8000
              name: proxy
            - containerPort: 7000
              name: upstream
            - containerPort: 8002
              name: admin
            - containerPort: 8003
              name: gossip
          args:
            - server
            - --config.path
            - /config/server.yaml
            - --config.expand-env
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          volumeMounts:
            - name: config
              mountPath: "/config"
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: server-config
            items:
              - key: "server.yaml"
                path: "server.yaml"

I will try and dig into the gossip protocol tomorrow, but just wanted to raise this issue in case there were any quick hints from your end @andydunstall.

Note

The image names would need to be changed from the above config. The images there were just from my local Docker build.

The text was updated successfully, but these errors were encountered:

andydunstall · 2024-07-30T06:28:36Z

Hey @yquansah, thanks for raising this

I can't think of anything off the top of my head, but I'll play about this this later today to try and reproduce

I will try and dig into the gossip protocol tomorrow

I'd be surprised if it was caused by gossip, which only gossips around what upstreams are connected, but won't (or at least shouldn't) close connections

yquansah · 2024-07-30T15:16:11Z

@andydunstall Yeah nice. I am also going to play around with it later this afternoon. Thanks!

andydunstall · 2024-07-30T18:14:31Z

Hey @yquansah, thanks for the above example, I'm able to reproduce

So when you run with multiple nodes, if you connect to node N₁ but the upstream is connected to N₂, then N₁ forwards all traffic to N₂

Since Piko forward connects to the server using WebSockets (to wrap the underlying TCP connection and work with a HTTP load balancer), if N₁ receives the WebSocket connection that it needs to forward to N₂, it falls back to using the HTTP proxy to forward the WebSocket to N₂

However by default the HTTP proxy has a timeout of 30 seconds (--proxy.timeout), meaning N₁ closes the connection to N₂ after 30 seconds, meaning the end-to-end connection is dropped. So the quick fix is to set --proxy.timeout 0 to disable the timeout, and I'll think about a better way to handle this tomorrow (the timeout should only apply to HTTP requests not hijacked WebSocket connections, as the same issue applies to using the HTTP proxy as normal with a WebSocket client)

yquansah · 2024-07-31T12:42:15Z

@andydunstall Really great information here, I appreciate the sleuthing, and it definitely makes sense. We'll go with the proxy timeout for now as you suggested for a quick fix.

condaatje · 2024-08-14T12:15:20Z

hey, we gave this a whirl and are getting this error when the container starts:

config: proxy: missing timeout

here is the yaml for the deployment:

   30  │ template:
    1  │ │ metadata:
    2  │ │ │ annotations:
    3  │ │ │ │ kubectl.kubernetes.io/restartedAt: "2024-07-30T13:05:09-07:00"
    4  │ │ │ creationTimestamp: null
    5  │ │ │ labels:
    6  │ │ │ │ app: piko
    7  │ │ spec:
    8  │ │ │ containers:
    9  │ │ │ - args:
   10  │ │ │ │ - server
   11  │ │ │ │ - --config.path
   12  │ │ │ │ - /config/server.yaml
   13  │ │ │ │ - --config.expand-env
   14  │ │ │ │ - --proxy.timeout
   15  │ │ │ │ - "0"

perhaps I'm doing something wrong here?

andydunstall · 2024-08-14T12:33:11Z

config: proxy: missing timeout

Ah sorry I forgot its required, I'll add a patch now

A quick work around is just set a very large timeout (such as --proxy.timeout=24h)

(Sorry I haven't had much time to fix the underlying issue, I'll try and get to soon)

andydunstall · 2024-08-14T12:37:20Z

I've merged #151 to allow a timeout of 0, so if you re-build main it should be ok

yquansah · 2024-08-14T12:45:06Z

@andydunstall Quick one. Thank you!

condaatje · 2024-08-14T15:20:45Z

awesome! could I trouble you for a tagged image @andydunstall ? right now we are pulling piko:v0.6.0 , but would normally not want to pull :latest in our architecture.

andydunstall · 2024-08-14T15:45:08Z

yep sure will tag ~~v0.7.0~~ v0.6.1

andydunstall · 2024-08-14T16:04:57Z

@condaatje Thats done: ghcr.io/andydunstall/piko:v0.6.1

condaatje · 2024-08-14T16:50:29Z

awesome thanks!

condaatje · 2024-08-14T20:30:22Z

by the way - if you'd like to join our alpha (launching today) we'd be honored to have you! especially because we've absolutely loved working with piko so far: https://x.com/hyperbolic_labs/status/1823779096650015026

andydunstall · 2024-08-15T04:13:24Z

Thanks! I don't have anything I could use a GPU for, but the product looks great!

andydunstall · 2024-08-15T06:30:04Z

I've merged #155, which only applies --proxy.timeout to non-websocket (and non-TCP) connections, so you shouldn't need to set --proxy.timeout=0 anymore (sorry it took a while for me to get to)

So I'll close this for now. Let me know if theres anything else I can help with!

condaatje · 2024-08-15T06:36:04Z

appreciated!

and this is great - could I trouble you for a v0.6.2 tag?

andydunstall · 2024-08-15T06:49:08Z

@condaatje sure tagged

condaatje · 2024-08-15T07:05:49Z

thanks!

andydunstall mentioned this issue Aug 14, 2024

server: remove proxy timeout validation #151

Merged

andydunstall self-assigned this Aug 15, 2024

andydunstall mentioned this issue Aug 15, 2024

proxy: only apply proxy timeout to non-websocket requests #155

Merged

andydunstall closed this as completed in #155 Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection closures running `piko` as `StatefulSet` #147

Connection closures running `piko` as `StatefulSet` #147

yquansah commented Jul 29, 2024 •

edited

Loading

andydunstall commented Jul 30, 2024 •

edited

Loading

yquansah commented Jul 30, 2024

andydunstall commented Jul 30, 2024 •

edited

Loading

yquansah commented Jul 31, 2024

condaatje commented Aug 14, 2024

andydunstall commented Aug 14, 2024 •

edited

Loading

andydunstall commented Aug 14, 2024 •

edited

Loading

yquansah commented Aug 14, 2024

condaatje commented Aug 14, 2024

andydunstall commented Aug 14, 2024 •

edited

Loading

andydunstall commented Aug 14, 2024 •

edited

Loading

condaatje commented Aug 14, 2024

condaatje commented Aug 14, 2024

andydunstall commented Aug 15, 2024

andydunstall commented Aug 15, 2024 •

edited

Loading

condaatje commented Aug 15, 2024

andydunstall commented Aug 15, 2024

condaatje commented Aug 15, 2024

Connection closures running piko as StatefulSet #147

Connection closures running piko as StatefulSet #147

Comments

yquansah commented Jul 29, 2024 • edited Loading

Note

andydunstall commented Jul 30, 2024 • edited Loading

yquansah commented Jul 30, 2024

andydunstall commented Jul 30, 2024 • edited Loading

yquansah commented Jul 31, 2024

condaatje commented Aug 14, 2024

andydunstall commented Aug 14, 2024 • edited Loading

andydunstall commented Aug 14, 2024 • edited Loading

yquansah commented Aug 14, 2024

condaatje commented Aug 14, 2024

andydunstall commented Aug 14, 2024 • edited Loading

andydunstall commented Aug 14, 2024 • edited Loading

condaatje commented Aug 14, 2024

condaatje commented Aug 14, 2024

andydunstall commented Aug 15, 2024

andydunstall commented Aug 15, 2024 • edited Loading

condaatje commented Aug 15, 2024

andydunstall commented Aug 15, 2024

condaatje commented Aug 15, 2024

Connection closures running `piko` as `StatefulSet` #147

Connection closures running `piko` as `StatefulSet` #147

yquansah commented Jul 29, 2024 •

edited

Loading

andydunstall commented Jul 30, 2024 •

edited

Loading

andydunstall commented Jul 30, 2024 •

edited

Loading

andydunstall commented Aug 14, 2024 •

edited

Loading

andydunstall commented Aug 14, 2024 •

edited

Loading

andydunstall commented Aug 14, 2024 •

edited

Loading

andydunstall commented Aug 14, 2024 •

edited

Loading

andydunstall commented Aug 15, 2024 •

edited

Loading