Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddf._meta_nonempty doesnt instantiate correctly when calling from_dask_dataframe #286

Open
taneugene opened this issue Mar 14, 2024 · 1 comment

Comments

@taneugene
Copy link

taneugene commented Mar 14, 2024

When I load a csv first into dask, and then into dask dataframe using .from_dask_dataframe, ._meta_nonempty does not exist, causing downstream problems in analysis (e.g. with spatial_shuffle). My hackish solution below takes the head, uses from_geopandas to get the meta, and the replaces the meta in the original. It would be nice to make this just work directly! Not sure if it replicates for other people.

# Load a csv file
df = dd.read_csv(fname,
                 dtype = {'longitude':float,
                          'latitude':float,
                          'geometry':'object',
                 }).repartition(npartitions=njobs)  # njobs is the number of workers I have
# Translate to geometry using shapely
df['geometry'] = df.geometry.map(shapely.wkt.loads,meta=('geometry','object'))
# Create a tmp dataframe using a Geodataframe and from_geopandas
tmp = dg.from_geopandas(gpd.GeoDataFrame(df.head(),geometry = 'geometry',crs = 'EPSG:4326'),npartitions = 1)

# Now create the dask_geopandas df
df = dg.from_dask_dataframe(df)

# Need to set metadata here, otherwise spatial_shuffle won't run. 
df._meta = tmp.compute()
df = df.spatial_shuffle()
@TomAugspurger
Copy link
Contributor

Thanks for the report. Can you share a fully reproducible example so that I can look into it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants