Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure : RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'LZ4', 'UNCOMPRESSED'] #177

Open
arnabbiswas1 opened this issue Nov 19, 2020 · 3 comments
Labels
provider/azure/azureml Cluster provider for AzureML

Comments

@arnabbiswas1
Copy link
Contributor

arnabbiswas1 commented Nov 19, 2020

Steps to reproduce:

I have created Dask Cluster inside AzureML environment using the following code:

amlcluster = AzureMLCluster(ws,
                            vm_size="STANDARD_D1",
                            environment_definition=ws.environments['AzureML-Dask-CPU'], 
                            initial_node_count=0, 
                            scheduler_idle_timeout=10800,
                            vnet='vnet',
                            subnet='subnet',
                            vnet_resource_group='resourcegroup',
                            ct_name="biswasdask",
)

Next open the jupyter lab using the link returned by amlcluster.jupyter_link

As per my understanding I am into the scheduler node of the cluster now.

On the Jupyter notebook, try the following code (from the repository azureml-examples):

from adlfs import AzureBlobFileSystem

container_name = "isdweatherdatacontainer"
storage_options = {"account_name": "azureopendatastorage"}

fs = AzureBlobFileSystem(**storage_options)
files = fs.glob(f"{container_name}/ISDWeather/year=2020/month=2/part-00003-tid-695161346761253622-368439cf-81e6-43f1-be5d-49ba29e282c0-2567-2.c000.snappy.parquet")
ddf = dd.read_parquet(files, storage_options=storage_options, chunksize="20MB")

ddf.head()

It returns the following error:

RuntimeError: Decompression 'SNAPPY' not available. Options: ['GZIP', 'LZ4', 'UNCOMPRESSED']

This is seems to be an old issue. But, since I have not created this environment manually, I don't know what is the problem?

@arnabbiswas1
Copy link
Contributor Author

arnabbiswas1 commented Nov 19, 2020

This is an open source project, so I really can't complain. But, while trying to work with dask-cloudprovider (for azure), I am encountering with issues after issue at different steps. That concerns me a lot about the basic sanity and stability of the product.

Further to that I see this commit to azureml-examples repository:

"remove dask-cloudprovider givne instability and lack of support"

With this, I am not sure if I should continue my effort of trying to use dask_cloudprovider within Azure ML pipeline (as a part of my day job).

Would appreciate if anyone from the dask-cloudprovider brief about the status of the project at this point of time.

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Nov 19, 2020

Thanks for taking the time to raise these issues @arnabbiswas1.

Dask Cloudprovider contains cluster managers for a variety of different cloud platforms. Currently the AzureMLCluster is maintained by the AzureML team.

We are working to add a new cluster manager for Azure in #175 which will use Azure VMs directly instead of the AzureML API. The AzureML folks have indicated that they want to remove the AzureMLCluster in favour of the new more generic AzureVMCluster.

@arnabbiswas1
Copy link
Contributor Author

Thanks for your quick and detailed reply. That helps me to prioritize my work.

I will wait for the new cluster manager for Azure and then will pick it back. Will eagerly wait for it.

Thanks for all the great work you are doing. 🤟

@jacobtomlinson jacobtomlinson added the provider/azure/azureml Cluster provider for AzureML label Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
provider/azure/azureml Cluster provider for AzureML
Projects
None yet
Development

No branches or pull requests

2 participants