Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve docs around the :output_location option #46

Merged
merged 1 commit into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,10 @@

[Req](https://github.com/wojtekmach/req) plugin for [AWS Athena](https://docs.aws.amazon.com/athena/latest/APIReference/Welcome.html).

ReqAthena makes it easy to make Athena queries. Query results are decoded into the `ReqAthena.Result` struct.
The struct implements the `Table.Reader` protocol and thus can be efficiently traversed by rows or columns.
ReqAthena makes it easy to make Athena queries and save the results into S3 buckets.

By default, `ReqAthena` will query results and use the default output format,
which is CSV. To change that, you can use the `:format` option documented bellow.

## Usage

Expand All @@ -21,7 +23,9 @@ opts = [
secret_access_key: System.fetch_env!("AWS_SECRET_ACCESS_KEY"),
region: System.fetch_env!("AWS_REGION"),
database: "default",
output_location: "s3://my-bucket"
# This may need to be a new directory for every query using the `:json` or `:explorer` formats.
# See the docs for details: https://hexdocs.pm/req_athena/ReqAthena.html#new/1
output_location: "s3://my-bucket/my-location"
]

req = ReqAthena.new(opts)
Expand Down
23 changes: 13 additions & 10 deletions lib/req_athena.ex
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,15 @@ defmodule ReqAthena do
* `:database` - Required. The AWS Athena database name.

* `:output_location` - Optional. The S3 URL location to output AWS Athena query results.
Results will be saved as Parquet and loaded with Explorer only if this option is given.

When using `:json` or `:explorer` as the `:format` option (see below), this option is required.
You may also need to specify a new output location for every new query when using these
formats due to a limition of the `UNLOAD` command that `ReqAthena` uses underneath.
Since Athena expects the directory used by `UNLOAD` to be empty, we append a "`results`"
directory to the path of the `:output_location` to ensure it's empty.

See the [`UNLOAD` command docs](https://docs.aws.amazon.com/athena/latest/ug/unload.html#unload-considerations-and-limitations)
for more details.

* `:workgroup` - Conditional. The AWS Athena workgroup.

Expand All @@ -64,23 +72,18 @@ defmodule ReqAthena do
and to prevent it from doing so, set `decode_body: false`.

* `:explorer` - return contents in parquet format, lazy loaded into Explorer data frame.
It means that the content is saved in the `:output_location` using parquet files.

To use this option you first need to install `:explorer` as a dependency.

There are some limitations when using the `:json` and `:explorer` format.
First, you need to install Explorer in order to use the `:explorer` format.
Second, when using these format, you always need to provide a different output location.
See the [`UNLOAD` command docs](https://docs.aws.amazon.com/athena/latest/ug/unload.html#unload-considerations-and-limitations)
for more details.
When using `:json` or `:explorer` format, you may need to pass a different output location
for every query. See `:output_location` for details.

* `:output_compression` - Optional. Sets the Parquet compression format and level
for the output when using the Explorer output format. This can be a string, like `"gzip"`,
or a tuple with `{format, level}`, like: `{"ZSTD", 4}`. By default this is `nil`,
which means that for Parquet (the format that Explorer uses) this is going to be `"gzip"`.

There is a limitation of Athena that requires the `:output_location` to be present
for every query that outputs to a format other than "CSV". So we append "results"
to the `:output_location` to make the partition files be saved there.

Conditional fields must always be defined, and can be one of the fields or both.
"""
@spec new(keyword()) :: Req.Request.t()
Expand Down