Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time Series QA: Make Dask notebook self-contained and testable #352

Merged
merged 1 commit into from
Mar 26, 2024

Conversation

amotl
Copy link
Member

@amotl amotl commented Mar 7, 2024

What's Inside

Make it more convenient to go data shopping using the CrateDB/Dask notebook.

Remarks

There are concerns about downloading and processing 200 MB worth of data again and again, so this patch may need a resource saver.

References

@amotl amotl force-pushed the amo/timeseries-ci branch 2 times, most recently from dba15fe to 2b593ad Compare March 7, 2024 17:26
@amotl amotl marked this pull request as ready for review March 7, 2024 18:44
Copy link
Contributor

@ckurze ckurze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We worked in parallel on a similar approach :) Good to have the kaggle import in CrateDB Toolkit now!

A few suggestions:

  • I think we should explain how to obtain the kaggle.json file (API key) and where to place it.
  • The schema should remain the default "doc", you switched to "notebook" - can lead to confusion, as we are working in doc in other notebooks as well.

Great work, thank you!

Comment on lines 523 to 546
"CREATE TABLE IF NOT EXISTS \"doc\".\"weather_data\" (\n",
"CREATE TABLE IF NOT EXISTS \"notebook\".\"weather_data\" (\n",
Copy link
Member Author

@amotl amotl Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema should remain the default "doc", you switched to "notebook" - can lead to confusion, as we are working in doc in other notebooks as well.

Maybe let's remove this explicit reference about the schema, so that software tests can use a different schema. The other notebooks also don't include it, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erledigt.

Base automatically changed from amo/timeseries-ci to main March 19, 2024 19:53
@amotl amotl force-pushed the amo/timeseries-ci-dask branch 4 times, most recently from 665d389 to fb04467 Compare March 21, 2024 17:59
@amotl amotl requested a review from ckurze March 21, 2024 18:03
Comment on lines +14 to +20
# Skip Kaggle tests when having no authentication information.
kaggle_auth_exists = Path("~/.kaggle/kaggle.json").exists() or (
"KAGGLE_USERNAME" in os.environ and "KAGGLE_KEY" in os.environ
)
if not kaggle_auth_exists:
raise pytest.skip(f"Kaggle dataset can not be tested "
f"without authentication: {notebook.name}")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should explain how to obtain the kaggle.json file (API key) and where to place it.

Right, this information should also go to the docs / the notebook itself.

Copy link
Member Author

@amotl amotl Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erledigt. See #352 (comment).

Comment on lines 523 to 546
"CREATE TABLE IF NOT EXISTS \"doc\".\"weather_data\" (\n",
"CREATE TABLE IF NOT EXISTS \"notebook\".\"weather_data\" (\n",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erledigt.

Comment on lines +77 to +90
"The subsequent code cell acquires the dataset directly from kaggle.com.\n",
"To properly configure the notebook to use corresponding credentials\n",
"after signing up on Kaggle, define the `KAGGLE_USERNAME` and\n",
"`KAGGLE_KEY` environment variables. Alternatively, put them into the\n",
"file `~/.kaggle/kaggle.json` in your home folder, like this:\n",
"```json\n",
"{\n",
" \"username\": \"acme\",\n",
" \"key\": \"2b1dac2af55caaf1f34df76236fada4a\"\n",
"}\n",
"```\n",
"Another variant is to acquire the dataset files manually, and extract\n",
"them into a folder called `DOWNLOAD`. In this case, you can deactivate\n",
"those two lines of code, in order to skip automatic dataset acquisition."
Copy link
Member Author

@amotl amotl Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the relevant documentation paragraph, now within the notebook itself, which educates readers about Kaggle.

Copy link
Contributor

@ckurze ckurze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you very much!

@amotl amotl merged commit af3cf3a into main Mar 26, 2024
1 check passed
@amotl amotl deleted the amo/timeseries-ci-dask branch March 26, 2024 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants