Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add back rowcounts and byte scanned to dbt cli and run_results.json #375

Merged
merged 9 commits into from
Aug 22, 2023

Conversation

nicor88
Copy link
Contributor

@nicor88 nicor88 commented Aug 17, 2023

Description

Fix #374

This approach doesn't cover cases where we do batch inserts due to partition limitations.

Models used to test - Optional

TODO

  • Test with table materialization (hive and iceberg)
  • Test with incremental append materialization (hive and iceberg)
  • Test with incremental merge materialization (iceberg)
  • Test with incremental partition overwrite materialization (hive)
  • Test when partition limits are reached (iceberg/hive) with table
  • Test when partition limits are reached (iceberg/hive) with incremental loads

Checklist

  • You followed contributing section
  • You kept your Pull Request small and focused on a single feature or bug fix.
  • You added unit testing when necessary
  • You added functional testing when necessary

@nicor88
Copy link
Contributor Author

nicor88 commented Aug 17, 2023

@svdimchenko when you have time could you have a look at this? - somehow it works for me, I want to test intensely against specific edge cases.

@nicor88 nicor88 removed the request for review from thenaturalist August 18, 2023 11:35
svdimchenko
svdimchenko previously approved these changes Aug 22, 2023
Copy link
Contributor

@svdimchenko svdimchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

@nicor88 nicor88 merged commit 6ece494 into main Aug 22, 2023
8 checks passed
@nicor88 nicor88 deleted the fix/tweak_cli_records branch August 22, 2023 09:52
@SoumayaMauthoorMOJ
Copy link
Contributor

I compared data_scanned_in_bytes with the "Data Scanned" from the Athena recent queries tab and the value is consistently 4.8% greater. @nicor88 @svdimchenko do you have any ideas why ?

@nicor88
Copy link
Contributor Author

nicor88 commented Sep 18, 2023

As you can see from the implementation we purely pass the result of the cursor to result.

Few ideas:

  • try to run a simple query without dbt, using PyAthena and see if you get the same difference
  • try out if this 4.8% difference exists in the previous version of the adapter e.g. 1.5.1
  • did you try to use boto3 athena client and see what you get for the same execution id?

@SoumayaMauthoorMOJ
Copy link
Contributor

@nicor88 I get the same difference with PyAthena:

from pyathena import connect

cursor = connect(s3_staging_dir="s3://sandbox-ireland-athena-query-dump/",
                 region_name="eu-west-1").cursor()
cursor.execute("SELECT * from tpcds_1.catalog_returns")
print(data_scanned_in_bytes)

# 17738113

I also get the same difference with aws athena get-query-execution --query-execution-id 0b75297f-b0af-4181-8595-9c66ea12e656:

        "Statistics": {
            "EngineExecutionTimeInMillis": 1889,
            "DataScannedInBytes": 17738113,
            "TotalExecutionTimeInMillis": 2034,
            "QueryQueueTimeInMillis": 125,
            "QueryPlanningTimeInMillis": 67,
            "ServiceProcessingTimeInMillis": 20,
            "ResultReuseInformation": {
                "ReusedPreviousResult": false
            }
        },

However Athena recent queries shows something different:

image

I'll raise a support request with AWS to clarify.

@SoumayaMauthoorMOJ
Copy link
Contributor

SoumayaMauthoorMOJ commented Sep 19, 2023

Oh silly me, it's because of binary conversion: https://www.gbmb.org/bytes-to-mb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enable-functional-tests Label to trigger functional testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix query statistics returned by the CLI and in run_results.json
3 participants