Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

input data to spark job is not passed to job #3362

Open
robert4os opened this issue Aug 28, 2024 · 1 comment
Open

input data to spark job is not passed to job #3362

robert4os opened this issue Aug 28, 2024 · 1 comment
Labels

Comments

@robert4os
Copy link

Operating System

Linux

Version Information

az --version
ml 2.28.0

Steps to reproduce

  1. I am trying to submit a spark job as shown in https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark

  2. The data has been uploaded as described here: https://github.com/Azure/azureml-examples/blob/main/cli/jobs/spark/data/README.md

  3. This is the yml definition (partially shown)

$schema: https://azuremlschemas.azureedge.net/latest/sparkJob.schema.json
resources:
    instance_type: standard_e4s_v3
    runtime_version: "3.3" 
type: spark
conf:
  spark.driver.cores: 1
  spark.driver.memory: 2g
  spark.executor.cores: 2
  spark.executor.memory: 2g
  spark.executor.instances: 2
inputs:
  input_data_step1: 
    type: uri_file
    path: azureml://datastores/workspaceblobstorex/paths/data/titanic.csv
    mode: direct
args: >-
  --input_data_step1 ${{inputs.input_data_step1}}

  1. The overview in the AML Studio shows the correct input data and I can navigate to it.

  2. Now the issue: Inside my job which runs without problems the input data argument is expanded to something like:
    /mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1724844698277_0001/container_1724844698277_0001_01_000001/azureml:/subscriptions/xxxxxx/resourcegroups/weg-aml-v2/workspaces/weg-aml-v2/datastores/workspaceblobstore/paths/data/titanic.csv

However this path does not exist.
There is no indication in the logs what failed.

Expected behavior

The input data should be passed to the spark application, such that it can be accessed.

Actual behavior

The issue is that the input data is not passed to the spark application.

Inside the spark application which runs without problems the input data argument is expanded to something like:
/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1724844698277_0001/container_1724844698277_0001_01_000001/azureml:/subscriptions/xxxxxx/resourcegroups/weg-aml-v2/workspaces/weg-aml-v2/datastores/workspaceblobstore/paths/data/titanic.csv

However this path does not exist.
There is no indication in the logs what failed.

Addition information

No response

@robert4os robert4os added the bug label Aug 28, 2024
@robert4os
Copy link
Author

Please note in the yml above I am pointing to 'workspaceblobstorex' (x at the end!)...

This does not even exist, but the are no complaints and the job runs through.

With the correct and existing datastore 'workspaceblobstore' it behaves the same as reported above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant