Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Asset creation for randomly spitted Mltables gives error in AzureML SDK v2 #3223

Open
sherySSH opened this issue Jun 7, 2024 · 0 comments
Labels

Comments

@sherySSH
Copy link

sherySSH commented Jun 7, 2024

Operating System

Windows

Version Information

Python 3.10

attrs 23.2.0
azure-ai-ml 1.16.1
azure-common 1.1.28
azure-core 1.30.1
azure-identity 1.16.0
azure-mgmt-core 1.4.0
azure-storage-blob 12.20.0
azure-storage-file-datalake 12.15.0
azure-storage-file-share 12.16.0
azureml-dataprep 5.1.6
azureml-dataprep-native 41.0.0
azureml-dataprep-rslex 2.22.2
cachetools 5.3.3
certifi 2024.6.2
cffi 1.16.0
charset-normalizer 3.3.2
cloudpickle 2.2.1
colorama 0.4.6
cryptography 42.0.8
google-api-core 2.19.0
google-auth 2.29.0
googleapis-common-protos 1.63.1
idna 3.7
isodate 0.6.1
jsonschema 4.22.0
jsonschema-specifications 2023.12.1
marshmallow 3.21.3
mltable 1.6.1
msal 1.28.0
msal-extensions 1.1.0
msrest 0.7.1
numpy 1.26.4
oauthlib 3.2.2
opencensus 0.11.4
opencensus-context 0.1.3
opencensus-ext-azure 1.1.13
opencensus-ext-logging 0.1.1
packaging 24.0
pandas 2.2.2
pip 24.0
portalocker 2.8.2
proto-plus 1.23.0
protobuf 4.25.3
psutil 5.9.8
pyarrow 16.1.0
pyasn1 0.6.0
pyasn1_modules 0.4.0
pycparser 2.22
pydash 8.0.1
PyJWT 2.8.0
python-dateutil 2.9.0.post0
pytz 2024.1
pywin32 306
PyYAML 6.0.1
referencing 0.35.1
requests 2.32.3
requests-oauthlib 2.0.0
rpds-py 0.18.1
rsa 4.9
setuptools 69.5.1
six 1.16.0
strictyaml 1.7.3
tqdm 4.66.4
typing_extensions 4.12.1
tzdata 2024.1
urllib3 2.2.1
wheel 0.43.0

Steps to reproduce

import time
import mltable
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
from azure.identity import DefaultAzureCredential

subscription_id = "<>"
resource_group = "<>"
workspace = "<>"

VERSION = time.strftime("%Y.%m.%d.%H%M%S", time.gmtime())

ml_client = MLClient(
DefaultAzureCredential(), subscription_id, resource_group, workspace
)

data_asset = ml_client.data.get("test-sdkv2", version="2024.06.06.105241")
paths = [{ "folder" : data_asset.path }]

data_mltable = mltable.from_delimited_files(paths, header='all_files_same_headers', delimiter=',' , encoding='utf8', empty_as_string=False)
train_table, test_table = data_mltable.random_split(0.8, seed=223)

train_table.save("./train")
test_table.save("./test")

train_data_asset = Data(
path="./train",
type=AssetTypes.MLTABLE,
description="Training Dataset",
name="SDK2TrainingDataset",
version=VERSION,
)

test_data_asset = Data(
path="./test",
type=AssetTypes.MLTABLE,
description="Testing Dataset",
name="SDK2TestingDataset",
version=VERSION,
)

ml_client.data.create_or_update(train_data_asset )
ml_client.data.create_or_update(test_data_asset )

Expected behavior

Two Data Assets should be created:

  1. SDK2TrainingDataset
  2. SDK2TestingDataset

Actual behavior

Traceback (most recent call last):
File "D:\AMLSDK2\run.py", line 77, in
ml_client.data.create_or_update(train_data_asset)
File "C:\Users\sshaharyaar\AppData\Local\miniconda3\envs\conda_azureml_sdkv2\lib\site-packages\azure\ai\ml_telemetry\activity.py", line 292, in wrapper
return f(*args, **kwargs)
File "C:\Users\sshaharyaar\AppData\Local\miniconda3\envs\conda_azureml_sdkv2\lib\site-packages\azure\ai\ml\operations_data_operations.py", line 425, in create_or_update
raise ex
File "C:\Users\sshaharyaar\AppData\Local\miniconda3\envs\conda_azureml_sdkv2\lib\site-packages\azure\ai\ml\operations_data_operations.py", line 363, in create_or_update
referenced_uris = self._validate(data)
File "C:\Users\sshaharyaar\AppData\Local\miniconda3\envs\conda_azureml_sdkv2\lib\site-packages\azure\ai\ml_telemetry\activity.py", line 292, in wrapper
return f(*args, **kwargs)
File "C:\Users\sshaharyaar\AppData\Local\miniconda3\envs\conda_azureml_sdkv2\lib\site-packages\azure\ai\ml\operations_data_operations.py", line 559, in _validate
validate_mltable_metadata(
File "C:\Users\sshaharyaar\AppData\Local\miniconda3\envs\conda_azureml_sdkv2\lib\site-packages\azure\ai\ml_utils_data_utils.py", line 68, in validate_mltable_metadata
err_path = ".".join(error.path)
TypeError: sequence item 1: expected str instance, int found

Addition information

I am using the Credit Card Clients dataset which is available in Azure SDK v2 Documentation. Moreover, I found no example in SDK v2 documentation where data is being split in train and test and then uploaded as data assets.

@sherySSH sherySSH added the bug label Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant