Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wr.athena.to_iceberg not using temp_path #2978

Closed
lautarortega opened this issue Sep 30, 2024 · 4 comments · Fixed by #2982
Closed

wr.athena.to_iceberg not using temp_path #2978

lautarortega opened this issue Sep 30, 2024 · 4 comments · Fixed by #2982
Assignees
Labels
bug Something isn't working

Comments

@lautarortega
Copy link

Describe the bug

I created an iceberg table in Athena through AWS Wrangler. I rename a column through an Athena query. When I want to write more rows to the table with the new col in the df, I get this error

QueryFailed: TYPE_MISMATCH: Insert query has mismatched column types: Table: [varchar, integer, varchar], Query: [varchar, integer, varchar, integer]. If a data manifest file was generated at 's3://aws-athena-query-results-494340620388-eu-west-1/68cb1983-852e-4f12-9d17-7af88520b02a-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account.

The error mentioning this S3 path makes me believe it is not using the temp path I passed as an argument.

How to Reproduce

data = {'first_name': ['John'],
        'age': [52],
        'city': ['Nashville']
}


# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
df

dtype = {'first_name': 'string', 'age': 'int', 'city': 'string'}
wr.athena.to_iceberg(
        df=df,
        database=DATABASE,
        table=TABLE_NAME,
        table_location=f"s3://{BUCKET}/{TABLE_NAME}",
        temp_path=f"s3://{BUCKET}/temp/{TABLE_NAME}",
        schema_evolution=True,
        keep_files=False,
        dtype=dtype,
)

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.10

AWS SDK for pandas version

3.9.1

Additional context

No response

@lautarortega lautarortega added the bug Something isn't working label Sep 30, 2024
@kukushking kukushking self-assigned this Sep 30, 2024
@kukushking
Copy link
Contributor

Hi @lautarortega looks like the table has not been updated correctly. Can you refer the query you used to update the table?

@lautarortega
Copy link
Author

Hi @kukushking! Yes, sure. This is the command I used.

ALTER TABLE test_table
CHANGE COLUMN new_age age int

@kukushking
Copy link
Contributor

kukushking commented Oct 1, 2024

Thanks, I am able to reproduce this with the following snippet:

import awswrangler as wr
import pandas as pd

DATABASE = "default"
BUCKET = "<REDACTED>"
TABLE_NAME = "iceberg1"

data = {'first_name': ['John'],
        'city': ['Nashville']
}
df = pd.DataFrame(data)

wr.athena.to_iceberg(
        df=df,
        database=DATABASE,
        table=TABLE_NAME,
        table_location=f"s3://{BUCKET}/{TABLE_NAME}",
        temp_path=f"s3://{BUCKET}/temp/{TABLE_NAME}",
        schema_evolution=True,
        keep_files=False,
)

wr.athena.start_query_execution(f"ALTER TABLE {TABLE_NAME} CHANGE COLUMN first_name new_first_name string", database=DATABASE)

data = {'new_first_name': ['Lily'],
        'city': ['Ontario']
}
df = pd.DataFrame(data)

wr.athena.to_iceberg(
        df=df,
        database=DATABASE,
        table=TABLE_NAME,
        table_location=f"s3://{BUCKET}/{TABLE_NAME}",
        temp_path=f"s3://{BUCKET}/temp/{TABLE_NAME}",
        schema_evolution=True,
        keep_files=False,
)
Traceback (most recent call last):
  ...
  
    raise exceptions.QueryFailed(response["Status"].get("StateChangeReason"))
awswrangler.exceptions.QueryFailed: COLUMN_NOT_FOUND: Insert column name does not exist in target table: first_name. If a data manifest file was generated at 's3://<REDACTED>/f6a7953b-dbe7-4699-a5ce-2f13168f0253-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account.

Looking into the fix.

@kukushking
Copy link
Contributor

This relates to apache/iceberg#7584 in which Glue still displays old columns as if they were present in the schema, while subsequent INSERT statements include the columns that are no longer considered "current" by Iceberg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants