Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run benchmarks in k8s due to excessive spilling & OOM #44

Open
andygrove opened this issue Nov 14, 2024 · 0 comments
Open

Cannot run benchmarks in k8s due to excessive spilling & OOM #44

andygrove opened this issue Nov 14, 2024 · 0 comments

Comments

@andygrove
Copy link
Member

andygrove commented Nov 14, 2024

I cannot get benchmarks running in k8s. I suspect that too many tasks are being scheduled in parallel.

I added resource constraints in the code:

@ray.remote(num_cpus=1)
def execute_query_stage(

...

@ray.remote(num_cpus=1)
def execute_query_partition(

I am running the benchmark with

RAY_ADDRESS='http://localhost:8265' ray job submit --working-dir `pwd` -- python3 tpcbench.py --benchmark tpch --queries /home/ray/datafusion-benchmarks/tpch/queries/ --data /mnt/bigdata/tpch/sf100  --concurrency 4

My cluster definition is:

apiVersion: ray.io/v1alpha1
kind: RayCluster
metadata:
  name: datafusion-ray-cluster
spec:
  headGroupSpec:
    rayStartParams:
      num-cpus: "0"
    template:
      spec:
        containers:
          - name: ray-head
            image: andygrove/datafusion-ray-tpch:latest
            imagePullPolicy: Always
            resources:
              limits:
                cpu: 2
                memory: 8Gi
              requests:
                cpu: 2
                memory: 8Gi
            volumeMounts:
              - mountPath: /mnt/bigdata  # Mount path inside the container
                name: ray-storage
        volumes:
          - name: ray-storage
            persistentVolumeClaim:
              claimName: ray-pvc  # Reference the PVC name here
  workerGroupSpecs:
    - replicas: 2
      groupName: "datafusion-ray"
      rayStartParams:
        num-cpus: "4"
      template:
        spec:
          containers:
            - name: ray-worker
              image: andygrove/datafusion-ray-tpch:latest
              imagePullPolicy: Always
              resources:
                limits:
                  cpu: 5
                  memory: 64Gi
                requests:
                  cpu: 5
                  memory: 64Gi
              volumeMounts:
                - mountPath: /mnt/bigdata
                  name: ray-storage
          volumes:
            - name: ray-storage
              persistentVolumeClaim:
                claimName: ray-pvc

I build my image with this Dockerfie, which extends the datafusion-ray image built from the repo.

FROM andygrove/datafusion-ray

RUN sudo apt update && \
    sudo apt install -y git

RUN git clone https://github.com/apache/datafusion-benchmarks.git
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant