input data to spark job is not passed to job #3362

robert4os · 2024-08-28T11:51:45Z

Operating System

Linux

Version Information

az --version
ml 2.28.0

Steps to reproduce

I am trying to submit a spark job as shown in https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark
The data has been uploaded as described here: https://github.com/Azure/azureml-examples/blob/main/cli/jobs/spark/data/README.md
This is the yml definition (partially shown)

$schema: https://azuremlschemas.azureedge.net/latest/sparkJob.schema.json
resources:
    instance_type: standard_e4s_v3
    runtime_version: "3.3" 
type: spark
conf:
  spark.driver.cores: 1
  spark.driver.memory: 2g
  spark.executor.cores: 2
  spark.executor.memory: 2g
  spark.executor.instances: 2
inputs:
  input_data_step1: 
    type: uri_file
    path: azureml://datastores/workspaceblobstorex/paths/data/titanic.csv
    mode: direct
args: >-
  --input_data_step1 ${{inputs.input_data_step1}}

The overview in the AML Studio shows the correct input data and I can navigate to it.
Now the issue: Inside my job which runs without problems the input data argument is expanded to something like:
/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1724844698277_0001/container_1724844698277_0001_01_000001/azureml:/subscriptions/xxxxxx/resourcegroups/weg-aml-v2/workspaces/weg-aml-v2/datastores/workspaceblobstore/paths/data/titanic.csv

However this path does not exist.
There is no indication in the logs what failed.

Expected behavior

The input data should be passed to the spark application, such that it can be accessed.

Actual behavior

The issue is that the input data is not passed to the spark application.

Inside the spark application which runs without problems the input data argument is expanded to something like:
/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1724844698277_0001/container_1724844698277_0001_01_000001/azureml:/subscriptions/xxxxxx/resourcegroups/weg-aml-v2/workspaces/weg-aml-v2/datastores/workspaceblobstore/paths/data/titanic.csv

However this path does not exist.
There is no indication in the logs what failed.

Addition information

No response

The text was updated successfully, but these errors were encountered:

robert4os · 2024-08-28T12:12:09Z

Please note in the yml above I am pointing to 'workspaceblobstorex' (x at the end!)...

This does not even exist, but the are no complaints and the job runs through.

With the correct and existing datastore 'workspaceblobstore' it behaves the same as reported above.

robert4os added the bug label Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

input data to spark job is not passed to job #3362

input data to spark job is not passed to job #3362

robert4os commented Aug 28, 2024

robert4os commented Aug 28, 2024

input data to spark job is not passed to job #3362

input data to spark job is not passed to job #3362

Comments

robert4os commented Aug 28, 2024

Operating System

Version Information

Steps to reproduce

Expected behavior

Actual behavior

Addition information

robert4os commented Aug 28, 2024