wr.athena.to_iceberg not using temp_path #2978

lautarortega · 2024-09-30T13:39:12Z

Describe the bug

I created an iceberg table in Athena through AWS Wrangler. I rename a column through an Athena query. When I want to write more rows to the table with the new col in the df, I get this error

QueryFailed: TYPE_MISMATCH: Insert query has mismatched column types: Table: [varchar, integer, varchar], Query: [varchar, integer, varchar, integer]. If a data manifest file was generated at 's3://aws-athena-query-results-494340620388-eu-west-1/68cb1983-852e-4f12-9d17-7af88520b02a-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account.

The error mentioning this S3 path makes me believe it is not using the temp path I passed as an argument.

How to Reproduce

data = {'first_name': ['John'],
        'age': [52],
        'city': ['Nashville']
}


# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
df

dtype = {'first_name': 'string', 'age': 'int', 'city': 'string'}
wr.athena.to_iceberg(
        df=df,
        database=DATABASE,
        table=TABLE_NAME,
        table_location=f"s3://{BUCKET}/{TABLE_NAME}",
        temp_path=f"s3://{BUCKET}/temp/{TABLE_NAME}",
        schema_evolution=True,
        keep_files=False,
        dtype=dtype,
)

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.10

AWS SDK for pandas version

3.9.1

Additional context

No response

The text was updated successfully, but these errors were encountered:

kukushking · 2024-09-30T14:27:08Z

Hi @lautarortega looks like the table has not been updated correctly. Can you refer the query you used to update the table?

lautarortega · 2024-09-30T14:48:07Z

Hi @kukushking! Yes, sure. This is the command I used.

ALTER TABLE test_table
CHANGE COLUMN new_age age int

kukushking · 2024-10-01T17:12:08Z

Thanks, I am able to reproduce this with the following snippet:

import awswrangler as wr
import pandas as pd

DATABASE = "default"
BUCKET = "<REDACTED>"
TABLE_NAME = "iceberg1"

data = {'first_name': ['John'],
        'city': ['Nashville']
}
df = pd.DataFrame(data)

wr.athena.to_iceberg(
        df=df,
        database=DATABASE,
        table=TABLE_NAME,
        table_location=f"s3://{BUCKET}/{TABLE_NAME}",
        temp_path=f"s3://{BUCKET}/temp/{TABLE_NAME}",
        schema_evolution=True,
        keep_files=False,
)

wr.athena.start_query_execution(f"ALTER TABLE {TABLE_NAME} CHANGE COLUMN first_name new_first_name string", database=DATABASE)

data = {'new_first_name': ['Lily'],
        'city': ['Ontario']
}
df = pd.DataFrame(data)

wr.athena.to_iceberg(
        df=df,
        database=DATABASE,
        table=TABLE_NAME,
        table_location=f"s3://{BUCKET}/{TABLE_NAME}",
        temp_path=f"s3://{BUCKET}/temp/{TABLE_NAME}",
        schema_evolution=True,
        keep_files=False,
)

Traceback (most recent call last):
  ...
  
    raise exceptions.QueryFailed(response["Status"].get("StateChangeReason"))
awswrangler.exceptions.QueryFailed: COLUMN_NOT_FOUND: Insert column name does not exist in target table: first_name. If a data manifest file was generated at 's3://<REDACTED>/f6a7953b-dbe7-4699-a5ce-2f13168f0253-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account.

Looking into the fix.

kukushking · 2024-10-01T18:30:01Z

This relates to apache/iceberg#7584 in which Glue still displays old columns as if they were present in the schema, while subsequent INSERT statements include the columns that are no longer considered "current" by Iceberg.

lautarortega added the bug Something isn't working label Sep 30, 2024

kukushking self-assigned this Sep 30, 2024

kukushking mentioned this issue Oct 1, 2024

fix: return only "current" iceberg columns #2982

Merged

kukushking linked a pull request Oct 1, 2024 that will close this issue

fix: return only "current" iceberg columns #2982

Merged

kukushking closed this as completed in #2982 Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wr.athena.to_iceberg not using temp_path #2978

wr.athena.to_iceberg not using temp_path #2978

lautarortega commented Sep 30, 2024

kukushking commented Sep 30, 2024

lautarortega commented Sep 30, 2024

kukushking commented Oct 1, 2024 •

edited

Loading

kukushking commented Oct 1, 2024

wr.athena.to_iceberg not using temp_path #2978

wr.athena.to_iceberg not using temp_path #2978

Comments

lautarortega commented Sep 30, 2024

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

kukushking commented Sep 30, 2024

lautarortega commented Sep 30, 2024

kukushking commented Oct 1, 2024 • edited Loading

kukushking commented Oct 1, 2024

kukushking commented Oct 1, 2024 •

edited

Loading