Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to pass a client_request_token for wr.athena.read_sql_query #2473

Closed
mmalahe opened this issue Sep 26, 2023 · 1 comment
Closed
Assignees
Labels
enhancement New feature or request

Comments

@mmalahe
Copy link

mmalahe commented Sep 26, 2023

Is your idea related to a problem? Please describe.
I'm running a batch computing job in an account that runs thousands of such jobs a day. The basic format of each job is to read in data via an Athena query, do some processing, and then write out json outputs. For an arbitrary subset of these jobs, identical queries may already have been run, but we're performing different operations on the data they provide and are producing different results.

In order to save on costs and time, I'd like to ensure that queries don't get rerun for those duplicate cases. I've looked into the athena_cache_settings option for read_sql_query, but the mechanism of that caching is missing some desirable properties:

  1. I'd like there to be no limit on the lookback distance. The queries we're reusing may be tens or hundreds of thousands of executions back.
  2. I'd prefer not to make many calls to the Athena API to do the linear search for a matching query.

Describe the solution you'd like
I think the easiest way to meet these requirements is to utilize the ClientRequestToken parameter available in StartQueryExecution (https://docs.aws.amazon.com/athena/latest/APIReference/API_StartQueryExecution.html). Callers to read_sql_query are then given the option to pass their own token through. If the token matches, they simply get served the existing results.

Example usage:

import hashlib
import awswrangler as wr

def get_query_hash(query):
    return hashlib.sha1(bytes(query, encoding="utf-8")).hexdigest()

query = "select * from table limit 10"
client_request_token = "select_limit_10" + get_query_hash(query)
df = wr.athena.read_sql_query(
    sql=query,
    client_request_token=client_request_token,
    ...
)
@mmalahe mmalahe added the enhancement New feature or request label Sep 26, 2023
@kukushking kukushking self-assigned this Sep 26, 2023
@mmalahe
Copy link
Author

mmalahe commented Sep 28, 2023

Thanks for the quick turnaround on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants