You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I try to write a dataframe using wr.s3.to_parquet() method it often throws the following error on some columns:
UndetectedType Impossible to infer the equivalent Athena data type for the billingaddress column. It is completely empty (only null values) and has a too generic data type (object). Please, cast this columns with a more deterministic data type (e.g. df['billingaddress'] = df['billingaddress'].astype('string')) or pass the column schema as argument(e.g. dtype={'billingaddress': 'string'}
How to Reproduce
First I tried to make a pre-validation before passing the dataframe to the method but it helped partially:
dtype = {}
for column in data.columns:
if data[column].isnull().all():
dtype[column] = "string"
data = data.astype(dtype)
I then discovered that the dataframe column that the method defined as “completely empty” contains a few values, say 3 values per 500 dataframe records.
It seems that the method only analyzes a limited sample of rows to determine a completely empty column and if there are only a few rows with values in the column, then there is a high probability that the column will be treated as completely empty.
The following code is what I did as a working workaround, but with the downside that the data type is rigidly equated to a string.
def _validate_dtypes(self, data: pd.DataFrame) -> pd.DataFrame:
# awswrangler can not infer the data type from an entire (or mostly) null
# column, thus we need to manually set the data type for such columns
dtype = {}
for column in data.columns:
non_null_count = data[column].notnull().sum()
total_count = len(data[column])
if non_null_count == 0:
dtype[column] = "string"
# awswrangler defines a column that has only a few values also as empty
# this is "less than 10% non-null values" workaround
elif (non_null_count / total_count < 0.1):
dtype[column] = "string"
data = data.astype(dtype)
return data
I would appreciate a solving this problem. Thank you for the great tool!
Expected behavior
No response
Your project
No response
Screenshots
No response
OS
Linux
Python version
3.9
AWS SDK for pandas version
awswrangler-3.9.1-py3-none-any.whl
Additional context
No response
The text was updated successfully, but these errors were encountered:
Hey Denis, sounds like a plan. I encountered a similar problem and was wondering if this logic could be incorporated into the _df_to_table function here.
However, it might not be necessary to define dtype = {}. Instead, it could only handle cases where the type is 'object' to improve efficiency.
for column in df.select_dtypes(include=['object']).columns:
# Check if less than 10% of the column values are non-null, treat it as empty
non_null_ratio = df[column].notnull().sum() / len(df)
if non_null_ratio < 0.1:
df[column] = df[column].astype('string') # Default to 'string' if mostly null
elif df[column].notnull().sum() == 0:
df[column] = df[column].astype('string') # Default to 'string' if all values are null
Describe the bug
When I try to write a dataframe using
wr.s3.to_parquet()
method it often throws the following error on some columns:How to Reproduce
First I tried to make a pre-validation before passing the dataframe to the method but it helped partially:
I then discovered that the dataframe column that the method defined as “completely empty” contains a few values, say 3 values per 500 dataframe records.
It seems that the method only analyzes a limited sample of rows to determine a completely empty column and if there are only a few rows with values in the column, then there is a high probability that the column will be treated as completely empty.
The following code is what I did as a working workaround, but with the downside that the data type is rigidly equated to a string.
I would appreciate a solving this problem.
Thank you for the great tool!
Expected behavior
No response
Your project
No response
Screenshots
No response
OS
Linux
Python version
3.9
AWS SDK for pandas version
awswrangler-3.9.1-py3-none-any.whl
Additional context
No response
The text was updated successfully, but these errors were encountered: