Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wr.athena.to_iceberg - Insert query has mismatched column types #2678

Closed
Mroq93 opened this issue Feb 16, 2024 · 4 comments · Fixed by #2715
Closed

wr.athena.to_iceberg - Insert query has mismatched column types #2678

Mroq93 opened this issue Feb 16, 2024 · 4 comments · Fixed by #2715
Labels
bug Something isn't working

Comments

@Mroq93
Copy link

Mroq93 commented Feb 16, 2024

Describe the bug

I try to save several Data Frames to Iceberg table using wr.athena.to_iceberg.
A few incremental savings go without any issues, but after some iteration I am getting error:
TYPE_MISMATCH: Insert query has mismatched column types: Table: [varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, timestamp(6), varchar, varchar, varchar, varchar, varchar, varchar, varchar], Query: [varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, varchar, timestamp(3), varchar, varchar, varchar, varchar, varchar, varchar]. If a data manifest file was generated at 's3://bucket-temp/athena/results/c0c18807-2773-4afc-b95c-580034d960ed-manifest.csv', you may need to manually clean the data from locations specified in the manifest. Athena will not delete data in your account.

I see 2 differences between schemas:

  • In the table, there is timestamp(6) but the next saving to iceberg recognizes column as timestamp(3)
  • the number of types is different by one

Before saving, I cast Dataframe timestamp columns to the same format to be sure that every timestamp is aligned.

        dataframe[name] = pd.to_datetime(
            dataframe[name], format='ISO8601'
        ) 

When I print this timestamp, column for every Dataframe column format is the same.
image

In dtype in wr.athena.to_iceberg , for timestamp column I provide type as timestamp, I can not provide precision - it is not supported.

I am not sure if the matter of different number of columns should be an issue. I guess it was resolved here:
#2616

PS.
The order of columns does matter?

How to Reproduce

                wr.athena.to_iceberg(
                    df=df,
                    database=database,
                    table=tbl_name,
                    table_location=s3_target_path,
                    temp_path=f"{s3_target_path}temp/{ini_time}",
                    partition_cols={"bookmark_date_str"},
                    workgroup=work_group_name,
                    schema_evolution=True,
                    dtype=columns_for_iceberg,
                )

Expected behavior

No issues with timestamp precision
No issues when saving DataFrames with different schema(missing or additional columns)

Your project

No response

Screenshots

No response

OS

AWS

Python version

3.9

AWS SDK for pandas version

3.5.2

Additional context

No response

@Mroq93 Mroq93 added the bug Something isn't working label Feb 16, 2024
@kukushking
Copy link
Contributor

Hi @Mroq93 thanks for opening this - looking into it.

@GalVishi
Copy link
Contributor

Hello @kukushking, do you have any news or updates regarding the bug that we discussed earlier?

@GalVishi
Copy link
Contributor

GalVishi commented Mar 10, 2024

Just need to change this line:

sql_statement = f'INSERT INTO "{database}"."{table}" SELECT * FROM "{database}"."{temp_table}"'

sql_statement = f'INSERT INTO "{database}"."{table}" SELECT {', '.join([f'"{x}"' for x in df.columns])} FROM "{database}"."{temp_table}"'

@Mroq93
Copy link
Author

Mroq93 commented Mar 14, 2024

Hi @jaidisido @GalvFionic ,
I tried 3.7.1 awswrangler version, but still I am facing the issue.
Is the fix available in the latest version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants