Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a variant of the ray data processing job with GCSFuse CSI driver #2401

Merged
merged 1 commit into from
Oct 1, 2024

Conversation

saikat-royc
Copy link
Contributor

@saikat-royc saikat-royc commented Sep 26, 2024

This is a variant of the ray data processing job (ray_data_image_resize.py) which leverages GKE GCSFuse CSI driver
to process the images from Landsat public dataset

In the template code, the bucket name has been replaced with a sample name ray-images and the prefix for the data sample has been replaced by prefix

Why are these changes needed?

This ray job demonstrates how to mount a GCS bucket using the GKE GCSFuse CSI driver and access th contents of a bucket in a ray job

logs from a successful run on a n2-standard-32 machine type GKE nodes

2024-09-26 14:43:39,774	INFO cli.py:36 -- Job submission server address: http://landsat-image-resize-raycluster-x5jl9-head-svc.default.svc.cluster.local:8265
2024-09-26 14:43:40,807	SUCC cli.py:60 -- -------------------------------------------------------
2024-09-26 14:43:40,808	SUCC cli.py:61 -- Job 'landsat-image-resize-mbmjm' submitted successfully
2024-09-26 14:43:40,808	SUCC cli.py:62 -- -------------------------------------------------------
2024-09-26 14:43:40,808	INFO cli.py:285 -- Next steps
2024-09-26 14:43:40,808	INFO cli.py:286 -- Query the logs of the job:
2024-09-26 14:43:40,808	INFO cli.py:288 -- ray job logs landsat-image-resize-mbmjm
2024-09-26 14:43:40,808	INFO cli.py:290 -- Query the status of the job:
2024-09-26 14:43:40,808	INFO cli.py:292 -- ray job status landsat-image-resize-mbmjm
2024-09-26 14:43:40,808	INFO cli.py:294 -- Request the job to be stopped:
2024-09-26 14:43:40,808	INFO cli.py:296 -- ray job stop landsat-image-resize-mbmjm
2024-09-26 14:43:40,824	INFO cli.py:303 -- Tailing logs until the job exits (disable with --no-wait):
2024-09-26 14:45:02,818	INFO worker.py:1405 -- Using address 10.24.210.20:6379 set in the environment variable RAY_ADDRESS
2024-09-26 14:45:02,818	INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.24.210.20:6379...
2024-09-26 14:45:02,825	INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://10.24.210.20:8265 
Enumerate files in prefix  /data/LC08/01/001/002
For prefix  /data/LC08/01/001/002  number of image_files 36
2024-09-26 14:45:07,436	INFO streaming_executor.py:112 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[FlatMap(<lambda>)->MapBatches(ReadImageFiles)] -> ActorPoolMapOperator[MapBatches(TransformImages)]
2024-09-26 14:45:07,437	INFO streaming_executor.py:113 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=0, gpu=0, object_store_memory=0), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2024-09-26 14:45:07,437	INFO streaming_executor.py:115 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`
2024-09-26 14:45:07,458	INFO actor_pool_map_operator.py:114 -- FlatMap(<lambda>)->MapBatches(ReadImageFiles): Waiting for 2 pool actors to start...
2024-09-26 14:46:24,593	INFO actor_pool_map_operator.py:114 -- MapBatches(TransformImages): Waiting for 2 pool actors to start...

Running 0:   0%|          | 0/36 [00:00<?, ?it/s]
Running: 4.0/128.0 CPU, 0.0/0.0 GPU, 0.0 MiB/37.69 GiB object_store_memory:   0%|          | 0/36 [00:00<?, ?it/s]
Running: 3.0/128.0 CPU, 0.0/0.0 GPU, 0.0 MiB/37.69 GiB object_store_memory:   0%|          | 0/36 [00:30<?, ?it/s]
                                                                                                                  
(MapWorker(MapBatches(TransformImages)) pid=3590) /tmp/ray/session_2024-09-26_14-43-09_565501_8/runtime_resources/pip/607a1c7d9282880648cbd3fcd028fed8bc8611bb/virtualenv/lib/python3.8/site-packages/torchvision/transforms/functional.py:154: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)

Running: 3.0/128.0 CPU, 0.0/0.0 GPU, 0.0 MiB/37.69 GiB object_store_memory:   0%|          | 0/36 [00:37<?, ?it/s]
                                                                                                                  
(MapWorker(MapBatches(TransformImages)) pid=3590)   img = torch.from_numpy(pic.transpose((2, 0, 1))).contiguous()

Running: 3.0/128.0 CPU, 0.0/0.0 GPU, 0.0 MiB/37.69 GiB object_store_memory:   0%|          | 0/36 [00:37<?, ?it/s]
Running: 3.0/128.0 CPU, 0.0/0.0 GPU, 4.0 MiB/37.69 GiB object_store_memory:   0%|          | 0/36 [00:44<?, ?it/s]
Running: 3.0/128.0 CPU, 0.0/0.0 GPU, 4.0 MiB/37.69 GiB object_store_memory:   0%|          | 0/3 [00:44<?, ?it/s] 
Running: 3.0/128.0 CPU, 0.0/0.0 GPU, 4.0 MiB/37.69 GiB object_store_memory:  33%|███▎      | 1/3 [00:44<01:28, 44.26s/it]
Running: 3.0/128.0 CPU, 0.0/0.0 GPU, 0.0 MiB/37.69 GiB object_store_memory:  33%|███▎      | 1/3 [00:44<01:28, 44.26s/it]
Running: 3.0/128.0 CPU, 0.0/0.0 GPU, 4.0 MiB/37.69 GiB object_store_memory:  33%|███▎      | 1/3 [00:44<01:28, 44.26s/it]
Running: 3.0/128.0 CPU, 0.0/0.0 GPU, 4.0 MiB/37.69 GiB object_store_memory:  67%|██████▋   | 2/3 [00:44<00:18, 18.40s/it]
Running: 3.0/128.0 CPU, 0.0/0.0 GPU, 0.0 MiB/37.69 GiB object_store_memory:  67%|██████▋   | 2/3 [00:44<00:18, 18.40s/it]
Running: 2.0/128.0 CPU, 0.0/0.0 GPU, 0.0 MiB/37.69 GiB object_store_memory:  67%|██████▋   | 2/3 [00:48<00:18, 18.40s/it]
Running: 1.0/128.0 CPU, 0.0/0.0 GPU, 0.0 MiB/37.69 GiB object_store_memory:  67%|██████▋   | 2/3 [00:48<00:18, 18.40s/it]
Running: 0.0/128.0 CPU, 0.0/0.0 GPU, 1.0 MiB/37.69 GiB object_store_memory:  67%|██████▋   | 2/3 [00:54<00:18, 18.40s/it]
Running: 0.0/128.0 CPU, 0.0/0.0 GPU, 1.0 MiB/37.69 GiB object_store_memory: 100%|██████████| 3/3 [00:54<00:00, 14.69s/it]
                                                                                                                         
done
(MapWorker(MapBatches(TransformImages)) pid=911, ip=10.24.65.26) /tmp/ray/session_2024-09-26_14-43-09_565501_8/runtime_resources/pip/607a1c7d9282880648cbd3fcd028fed8bc8611bb/virtualenv/lib/python3.8/site-packages/torchvision/transforms/functional.py:154: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:206.)
(MapWorker(MapBatches(TransformImages)) pid=911, ip=10.24.65.26)   img = torch.from_numpy(pic.transpose((2, 0, 1))).contiguous()
2024-09-26 14:47:26,737	SUCC cli.py:60 -- ------------------------------------------
2024-09-26 14:47:26,738	SUCC cli.py:61 -- Job 'landsat-image-resize-mbmjm' succeeded
2024-09-26 14:47:26,738	SUCC cli.py:62 -- ------------------------------------------

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • [x ] Manual tests - The job is manually tested on a GKE cluster
    • This PR is not tested :(

@saikat-royc
Copy link
Contributor Author

/cc @andrewsykim

This is a variant of the ray data processing job (`ray_data_image_resize.py`) which leverages GKE GCSFuse CSI driver
@andrewsykim andrewsykim merged commit 3e20a9d into ray-project:master Oct 1, 2024
27 checks passed
@kevin85421
Copy link
Member

Do we plan to add any doc for this YAMLs in Ray website?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants