From 23a2aa9ddd870f44ac100908e395721ae8a24f40 Mon Sep 17 00:00:00 2001 From: Hugh Evans Date: Tue, 3 Sep 2024 14:36:45 +0100 Subject: [PATCH 01/12] Added spell checking, basic linting, and checks for broken links in markdown --- .pre-commit-config.yaml | 17 +++++++++++++++++ ignore-spelling-words.txt | 8 ++++++++ requirements.txt | 4 ++++ 3 files changed, 29 insertions(+) create mode 100644 .pre-commit-config.yaml create mode 100644 ignore-spelling-words.txt create mode 100644 requirements.txt diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 00000000..428e67f8 --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,17 @@ +repos: +- repo: https://github.com/pre-commit/pre-commit-hooks + rev: v2.4.0 + hooks: + - id: trailing-whitespace + - id: end-of-file-fixer + - id: check-yaml + - id: check-json +- repo: https://github.com/codespell-project/codespell + rev: v2.3.0 + hooks: + - id: codespell + name: codespell + description: Checks for common misspellings in text files. + entry: codespell --ignore-words=ignore-spelling-words.txt + language: python + types: [text] diff --git a/ignore-spelling-words.txt b/ignore-spelling-words.txt new file mode 100644 index 00000000..3104459b --- /dev/null +++ b/ignore-spelling-words.txt @@ -0,0 +1,8 @@ +EGE +MKE +MOT +SAV +AGS + +Rouge +Nome diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 00000000..c16e82f6 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,4 @@ +pre-commit +black +flake +codespell From b0a0d890ab7aa6286d0a59cd2296bed3bceab196 Mon Sep 17 00:00:00 2001 From: Hugh Evans Date: Tue, 3 Sep 2024 14:40:06 +0100 Subject: [PATCH 02/12] Added pre-commit instructions to README.md --- README.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 54a3d543..d120c791 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ It contains: Suggestions or comments? Call into the [discussions](https://github.com/implydata/learn-druid/discussions). Found a problem or want to request a notebook? Raise an [issue](https://github.com/implydata/learn-druid/issues). Want to contribute? Raise a [PR](https://github.com/implydata/learn-druid/pulls). [Contributions](contributing.md) to this community resource are welcome! Contribute your own notebook on a topic that's not listed here, and check out the [issue](https://github.com/implydata/learn-druid/issues) list, where you'll find bugs and enhancement requests. - + Come meet your friendly Apache Druid [community](https://druid.apache.org/community) if you have any questions about the functionality you see here. ## Pre-requisites @@ -48,7 +48,7 @@ To use the "Learn Druid" Docker Compose, you need: To get started quickly: 1. Clone the repository: - + ```bash git clone https://github.com/implydata/learn-druid ``` @@ -130,6 +130,15 @@ The Learn Druid environment includes the following services: [**Apache Druid**](https://druid.apache.org/): The currently released version of Apache Druid by default. +## Contributing + +Please install and run the [pre-commit](https://pre-commit.com/) before raising PRs. + +```bash +pip install pre-commit +pre-commit install +``` + --- **This repository is not affiliated with, endorsed by, or otherwise associated with the Apache Software Foundation (ASF) or any of its projects. Apache, Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of ASF in the USA and other countries.** From 846601a98c1fd36a37b3207b49790add242be670 Mon Sep 17 00:00:00 2001 From: Hugh Evans Date: Thu, 12 Sep 2024 11:51:58 +0100 Subject: [PATCH 03/12] Moved contributing note to correct location --- README.md | 9 --------- contributing.md | 15 +++++++++++---- 2 files changed, 11 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index d120c791..23ba18c0 100644 --- a/README.md +++ b/README.md @@ -130,15 +130,6 @@ The Learn Druid environment includes the following services: [**Apache Druid**](https://druid.apache.org/): The currently released version of Apache Druid by default. -## Contributing - -Please install and run the [pre-commit](https://pre-commit.com/) before raising PRs. - -```bash -pip install pre-commit -pre-commit install -``` - --- **This repository is not affiliated with, endorsed by, or otherwise associated with the Apache Software Foundation (ASF) or any of its projects. Apache, Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of ASF in the USA and other countries.** diff --git a/contributing.md b/contributing.md index cd85378b..2c1e8014 100644 --- a/contributing.md +++ b/contributing.md @@ -11,14 +11,21 @@ Here are some general guidelines on making a notebook. The [contributing](https://github.com/implydata/learn-druid/tree/main/notebooks/99-contributing) folder contains a notebook template as a starting point. You'll find boilerplate elements including: * Setting the connection to Druid, Kafka, and the data generator. -* Starter elements for ingesting from example datas sets or the data generator. +* Starter elements for ingesting from example data sets or the data generator. * Clean-up elements, like dropping tables, stopping streaming ingestion, and halting data generator jobs. -* Re-usable code elements that other contributors have found useful. +* Reusable code elements that other contributors have found useful. And don't forget that the template itself is open to contribution! ### Raise a PR +Please install and run the [pre-commit](https://pre-commit.com/) before raising PRs. + +```bash +pip install pre-commit +pre-commit install +``` + When you have a notebook and you're ready for feedback, it's a good idea to raise a draft PR first. Feel free to use the comments section to ask for initial feedback, or drop into the docs channel in the official Apache Druid Slack channel. And when it's ready to go, finalize your PR. Add reviewers, get formal feedback, make any necessary changes, etc. in the usual way. @@ -39,12 +46,12 @@ Use the same route to run a locally-built Docker image by using the appropriate ### Run automated tests on notebooks -Make sure that docker compose is down and all volumes have been deleted, then start tests with: +Make sure that docker compose is down and all volumes have been deleted, then start tests with: ```shell cd tests ./test-notebooks.sh - ``` + ``` To test single notebook: From 93fdc20850117fed438692af1f438fece5036bf2 Mon Sep 17 00:00:00 2001 From: Peter Marshall Date: Thu, 12 Sep 2024 08:50:43 +0100 Subject: [PATCH 04/12] Update README.md Added a link to the notebooks index. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 54a3d543..016b4f3e 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ The "Learn Druid" repository contains all manner of resources to help you learn It contains: -* Jupyter Notebooks that guide you through query, ingestion, and data management with Apache Druid. +* [Jupyter Notebooks](notebooks) that guide you through query, ingestion, and data management with Apache Druid. * A Docker Compose file to get you up and running with a learning lab. Suggestions or comments? Call into the [discussions](https://github.com/implydata/learn-druid/discussions). Found a problem or want to request a notebook? Raise an [issue](https://github.com/implydata/learn-druid/issues). Want to contribute? Raise a [PR](https://github.com/implydata/learn-druid/pulls). From ade82f3bc3b537138b31d645024a7469b81c32d1 Mon Sep 17 00:00:00 2001 From: Hugh Evans Date: Tue, 3 Sep 2024 14:36:45 +0100 Subject: [PATCH 05/12] Added spell checking, basic linting, and checks for broken links in markdown --- .pre-commit-config.yaml | 17 +++++++++++++++++ ignore-spelling-words.txt | 8 ++++++++ requirements.txt | 4 ++++ 3 files changed, 29 insertions(+) create mode 100644 .pre-commit-config.yaml create mode 100644 ignore-spelling-words.txt create mode 100644 requirements.txt diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml new file mode 100644 index 00000000..428e67f8 --- /dev/null +++ b/.pre-commit-config.yaml @@ -0,0 +1,17 @@ +repos: +- repo: https://github.com/pre-commit/pre-commit-hooks + rev: v2.4.0 + hooks: + - id: trailing-whitespace + - id: end-of-file-fixer + - id: check-yaml + - id: check-json +- repo: https://github.com/codespell-project/codespell + rev: v2.3.0 + hooks: + - id: codespell + name: codespell + description: Checks for common misspellings in text files. + entry: codespell --ignore-words=ignore-spelling-words.txt + language: python + types: [text] diff --git a/ignore-spelling-words.txt b/ignore-spelling-words.txt new file mode 100644 index 00000000..3104459b --- /dev/null +++ b/ignore-spelling-words.txt @@ -0,0 +1,8 @@ +EGE +MKE +MOT +SAV +AGS + +Rouge +Nome diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 00000000..c16e82f6 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,4 @@ +pre-commit +black +flake +codespell From 4860ed45ca80888e8e09084f6be56a1c15e87375 Mon Sep 17 00:00:00 2001 From: Hugh Evans Date: Tue, 3 Sep 2024 14:40:06 +0100 Subject: [PATCH 06/12] Added pre-commit instructions to README.md --- README.md | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 016b4f3e..af85b1f9 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ It contains: Suggestions or comments? Call into the [discussions](https://github.com/implydata/learn-druid/discussions). Found a problem or want to request a notebook? Raise an [issue](https://github.com/implydata/learn-druid/issues). Want to contribute? Raise a [PR](https://github.com/implydata/learn-druid/pulls). [Contributions](contributing.md) to this community resource are welcome! Contribute your own notebook on a topic that's not listed here, and check out the [issue](https://github.com/implydata/learn-druid/issues) list, where you'll find bugs and enhancement requests. - + Come meet your friendly Apache Druid [community](https://druid.apache.org/community) if you have any questions about the functionality you see here. ## Pre-requisites @@ -48,7 +48,7 @@ To use the "Learn Druid" Docker Compose, you need: To get started quickly: 1. Clone the repository: - + ```bash git clone https://github.com/implydata/learn-druid ``` @@ -130,6 +130,15 @@ The Learn Druid environment includes the following services: [**Apache Druid**](https://druid.apache.org/): The currently released version of Apache Druid by default. +## Contributing + +Please install and run the [pre-commit](https://pre-commit.com/) before raising PRs. + +```bash +pip install pre-commit +pre-commit install +``` + --- **This repository is not affiliated with, endorsed by, or otherwise associated with the Apache Software Foundation (ASF) or any of its projects. Apache, Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of ASF in the USA and other countries.** From 179f5f1abd0b566036fce4906d20b495b6a95828 Mon Sep 17 00:00:00 2001 From: Hugh Evans Date: Thu, 12 Sep 2024 11:51:58 +0100 Subject: [PATCH 07/12] Moved contributing note to correct location --- README.md | 9 --------- contributing.md | 15 +++++++++++---- 2 files changed, 11 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index af85b1f9..e3a3e84b 100644 --- a/README.md +++ b/README.md @@ -130,15 +130,6 @@ The Learn Druid environment includes the following services: [**Apache Druid**](https://druid.apache.org/): The currently released version of Apache Druid by default. -## Contributing - -Please install and run the [pre-commit](https://pre-commit.com/) before raising PRs. - -```bash -pip install pre-commit -pre-commit install -``` - --- **This repository is not affiliated with, endorsed by, or otherwise associated with the Apache Software Foundation (ASF) or any of its projects. Apache, Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of ASF in the USA and other countries.** diff --git a/contributing.md b/contributing.md index cd85378b..2c1e8014 100644 --- a/contributing.md +++ b/contributing.md @@ -11,14 +11,21 @@ Here are some general guidelines on making a notebook. The [contributing](https://github.com/implydata/learn-druid/tree/main/notebooks/99-contributing) folder contains a notebook template as a starting point. You'll find boilerplate elements including: * Setting the connection to Druid, Kafka, and the data generator. -* Starter elements for ingesting from example datas sets or the data generator. +* Starter elements for ingesting from example data sets or the data generator. * Clean-up elements, like dropping tables, stopping streaming ingestion, and halting data generator jobs. -* Re-usable code elements that other contributors have found useful. +* Reusable code elements that other contributors have found useful. And don't forget that the template itself is open to contribution! ### Raise a PR +Please install and run the [pre-commit](https://pre-commit.com/) before raising PRs. + +```bash +pip install pre-commit +pre-commit install +``` + When you have a notebook and you're ready for feedback, it's a good idea to raise a draft PR first. Feel free to use the comments section to ask for initial feedback, or drop into the docs channel in the official Apache Druid Slack channel. And when it's ready to go, finalize your PR. Add reviewers, get formal feedback, make any necessary changes, etc. in the usual way. @@ -39,12 +46,12 @@ Use the same route to run a locally-built Docker image by using the appropriate ### Run automated tests on notebooks -Make sure that docker compose is down and all volumes have been deleted, then start tests with: +Make sure that docker compose is down and all volumes have been deleted, then start tests with: ```shell cd tests ./test-notebooks.sh - ``` + ``` To test single notebook: From b12cff0c218d0634fb13d5efc473ba69b8541a0c Mon Sep 17 00:00:00 2001 From: Hugh Evans Date: Thu, 19 Sep 2024 10:10:54 +0100 Subject: [PATCH 08/12] Minor spelling and word choice fixes --- notebooks/03-query/20-tiering-historicals.ipynb | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/notebooks/03-query/20-tiering-historicals.ipynb b/notebooks/03-query/20-tiering-historicals.ipynb index 41a41b91..d8813656 100644 --- a/notebooks/03-query/20-tiering-historicals.ipynb +++ b/notebooks/03-query/20-tiering-historicals.ipynb @@ -25,7 +25,7 @@ " ~ under the License.\n", " -->\n", "\n", - "[Service tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#service-tiering) provides administrators the ability to provide cluster resources suited to different performance and storage requirements, such as isolating heavy queries involving complex subqueries or large result from high priority, interactive queries.\n", + "[Service tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#service-tiering) provides administrators the ability to provision cluster resources suited to different performance and storage requirements, such as isolating heavy queries involving complex subqueries or large results from high priority, interactive queries.\n", "\n", "This tutorial demonstrates how to work with [historical tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#historical-tiering) to load particular ages of data onto different processes. In turn, this causes queries to execute on different services depending on the period of time covered by a query." ] @@ -56,6 +56,12 @@ "The following cells set up the notebook and learning environment ready for use." ] }, + { + "cell_type": "markdown", + "id": "83fc3ca9", + "metadata": {}, + "source": [] + }, { "cell_type": "markdown", "id": "0b769122-c5a4-404e-9ef8-9c0ebd97695a", @@ -526,7 +532,7 @@ "\n", "Run the cell below to ingest some data for the current year.\n", "\n", - "The TIME_EXTRACT function is used to fake data for this year by calculating the shift between the source timetamp and today's date." + "The TIME_EXTRACT function is used to fake data for this year by calculating the shift between the source timestamp and today's date." ] }, { From a256a1fca5437a9efffdb41dba6c8056c7d2f225 Mon Sep 17 00:00:00 2001 From: Hugh Evans Date: Thu, 19 Sep 2024 10:14:37 +0100 Subject: [PATCH 09/12] Charles' suggestion for paragraph 1 --- notebooks/03-query/20-tiering-historicals.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/notebooks/03-query/20-tiering-historicals.ipynb b/notebooks/03-query/20-tiering-historicals.ipynb index d8813656..4281c043 100644 --- a/notebooks/03-query/20-tiering-historicals.ipynb +++ b/notebooks/03-query/20-tiering-historicals.ipynb @@ -25,7 +25,7 @@ " ~ under the License.\n", " -->\n", "\n", - "[Service tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#service-tiering) provides administrators the ability to provision cluster resources suited to different performance and storage requirements, such as isolating heavy queries involving complex subqueries or large results from high priority, interactive queries.\n", + "As an Apache Druid cluster administrator, you can use [Service tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#service-tiering) to provision cluster resources according to various performance and storage requirements. For example, you can isolate resource-intensive queries with complex subqueries and large result sets from high priority, interactive queries.\\n\"\n", "\n", "This tutorial demonstrates how to work with [historical tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#historical-tiering) to load particular ages of data onto different processes. In turn, this causes queries to execute on different services depending on the period of time covered by a query." ] From b461620de9c0ddc51765af9cf51330a0f75c63da Mon Sep 17 00:00:00 2001 From: Hugh Evans <101668034+hevansDev@users.noreply.github.com> Date: Fri, 20 Sep 2024 12:01:36 +0100 Subject: [PATCH 10/12] Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith Update notebooks/03-query/20-tiering-historicals.ipynb Co-authored-by: Charles Smith --- .../03-query/20-tiering-historicals.ipynb | 50 +++++++++---------- 1 file changed, 25 insertions(+), 25 deletions(-) diff --git a/notebooks/03-query/20-tiering-historicals.ipynb b/notebooks/03-query/20-tiering-historicals.ipynb index 4281c043..92abdeda 100644 --- a/notebooks/03-query/20-tiering-historicals.ipynb +++ b/notebooks/03-query/20-tiering-historicals.ipynb @@ -27,7 +27,7 @@ "\n", "As an Apache Druid cluster administrator, you can use [Service tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#service-tiering) to provision cluster resources according to various performance and storage requirements. For example, you can isolate resource-intensive queries with complex subqueries and large result sets from high priority, interactive queries.\\n\"\n", "\n", - "This tutorial demonstrates how to work with [historical tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#historical-tiering) to load particular ages of data onto different processes. In turn, this causes queries to execute on different services depending on the period of time covered by a query." + "This tutorial demonstrates how to create different [historical tiers (https://druid.apache.org/docs/latest/operations/mixed-workloads#historical-tiering) to host different data based upon data age. Then, when you run a query, Druid uses the service that host the data corresponding to the time period within the query." ] }, { @@ -39,7 +39,7 @@ "\n", "This tutorial works with Druid 30.0.0 or later.\n", "\n", - "This tutorial requires a deployment of Druid with multiple historicals, and presumes that the additional tier is called \"slow\".\n", + "This tutorial requires a deployment of Druid with multiple historical services, with one of the historical tiers named \"slow\".\n", "\n", "Launch this tutorial and all prerequisites using the `druid-jupyter-tiered-hist` profile of the Docker Compose file for Jupyter-based Druid tutorials to create a cluster with an additional historical.\n", "\n", @@ -163,7 +163,7 @@ "source": [ "## Create a table using batch ingestion\n", "\n", - "In this section, you will create a table that contains data spanning a few years using batch ingestion, and then look at where this data has been cached for query." + "In this section, you use batch ingestion to create a table with data spanning a few years. Then identify where the data has been cached for querying." ] }, { @@ -173,9 +173,9 @@ "source": [ "### Ingest example data\n", "\n", - "Run the next cell to bring in the initial data. Only a subset of the columns that are available in the example dataset will be ingested.\n", + "Run the next cell to ingest the initial data. The ingestion query only loads a subset of all columns available in the example dataset.\n", "\n", - "When completed, you'll see a description of the final table." + "When completed, the cell prints a description of the table." ] }, { @@ -221,7 +221,7 @@ "source": [ "### Inspect the servers and current configuration\n", "\n", - "Use a query against the servers system table to see what historicals are available, and the tiers that they are assigned to." + "Query the servers system table to see available historicals their assigned tiers." ] }, { @@ -273,7 +273,7 @@ "source": [ "All segments for the table, totalling around 20,000 rows, are loaded onto historicals in the `default_tier` tier.\n", "\n", - "To understand why, run the following cell to use the co-ordinator API to inspect the current retention load rules." + "To understand the default configuration, run the following cell to inspect the current retention load rules with the Coordinator API." ] }, { @@ -291,7 +291,7 @@ "id": "3e597661-e6a8-4a3b-98cc-c1fc7e3bafa4", "metadata": {}, "source": [ - "On creation, tables have no set of rules of their own. Instead, the server's default set of rules, `_default`, are applied to the table.\n", + "After creation, tables do not have any load rules of their own. Instead, they adhere to Druid's default set of rules, `_default`.\n", "\n", "By default, the `_default` rules set contains only one rule - a [load forever](https://druid.apache.org/docs/latest/operations/rule-configuration#forever-load-rule) rule (`loadForever`) with a replication factor (`tieredReplicants`) of 2 across servers in the `_default_tier`.\n", "\n", @@ -305,7 +305,7 @@ "source": [ "## Cache data on different tiers\n", "\n", - "In this section, you'll create a load rule that also loads data onto the `slow` tier. You will use a mixture of forever, period, and interval [load rules](https://druid.apache.org/docs/latest/operations/rule-configuration#load-rules)." + "In this section, you create an additional load rule to load data onto the `slow` tier using a combination of different types of [load rules](https://druid.apache.org/docs/latest/operations/rule-configuration#load-rules): forever, period, and interval ." ] }, { @@ -317,10 +317,10 @@ "\n", "Run the next cell to create a JSON object for us to store a retention rule, send it to the Coordinator API, and then print out the current full rule set on the database.\n", "\n", - "- Historicals in the `slow` tier have been added to the replication rules (`tieredReplicants`).\n", - "- The `slow` tier will receive one replica of the data.\n", - "- The `_default_tier` tier will receive one replica of the data.\n", - "- The API call is made to the `rules` endpoint for the table (using the `table_name` variable)." + "- Add Historicals in the `slow` tier to the replication rules (`tieredReplicants`).\n", + "- Load one replica of the data onto the `slow` tier.\n", + "- Load one replica of the data onto the `_default_tier` tier.\n", + "- Call the the `rules` endpoint (Retention rules API) and pass the `table_name` to apply the load rule in Druid." ] }, { @@ -349,9 +349,9 @@ "id": "19819547-f069-4bfc-be2a-4f28509a0bf1", "metadata": {}, "source": [ - "In addition to the `_default` rule set, there is now a new rule set specific to the table you have created.\n", + "In addition to the `_default` rule set, there is a new rule set specific to the `example-wikipedia-tiering` table.\n", "\n", - "Run the next cell to see where the data has been cached." + "Run the next cell to see where the location of the cached data." ] }, { @@ -383,8 +383,8 @@ "\n", "Run the next cell to create some \"fake\" data in the table that is a year older.\n", "\n", - "* INSERT is used to append data instead of REPLACE INTO.\n", - "* The [TIME_PARSE](https://druid.apache.org/docs/latest/querying/sql-scalar#date-and-time-functions) function has been used to shift the timestamp back by a year. " + "* INSERT appends data instead to the table of REPLACE INTO, which would overwrite.\n", + "* The [TIME_PARSE](https://druid.apache.org/docs/latest/querying/sql-scalar#date-and-time-functions) function shifts the timestamp back by a year. " ] }, { @@ -467,7 +467,7 @@ "id": "f6c686c9-93c7-4477-a9ee-902b2ac8c209", "metadata": {}, "source": [ - "Run the next cell to see where the data has been cached." + "Run the next cell to see where the location of the cached data." ] }, { @@ -490,7 +490,7 @@ "* Data for 2015 (`loadByInterval`) is only cached on the `slow` tier.\n", "* All other data (`loadForever`) is only available on the `_default_tier` tier.\n", "\n", - "(Re-run the cell above if you do not see this immediately.)\n", + "Re-run the cell above if you do not see this immediately.\n", "\n", "Run the following query on the table." ] @@ -532,7 +532,7 @@ "\n", "Run the cell below to ingest some data for the current year.\n", "\n", - "The TIME_EXTRACT function is used to fake data for this year by calculating the shift between the source timestamp and today's date." + "The TIME_EXTRACT function calculates the shift between the source timestamp and today's date to fake data for this year." ] }, { @@ -574,10 +574,10 @@ "id": "f976d910-d4e9-43d6-ad8c-4f03bc4dd762", "metadata": {}, "source": [ - "Review the following retention load rules configuration, then run the cell to apply it to the table.\n", + "The following retention load rules configuration creates load rules as follows:\n", "\n", - "- `loadByPeriod`, giving a [period load rule](https://druid.apache.org/docs/latest/operations/rule-configuration/#period-load-rule) covering data newer than one year old, requesting one cached replica on both the `_default_tier` and `slow` tiers.\n", - "- `loadForever`, requesting all data to be cached on the `slow` tier." + "- `loadByPeriod`, creates a [period load rule](https://druid.apache.org/docs/latest/operations/rule-configuration/#period-load-rule) for one cached replica of data newer than one year on both the `_default_tier` and `slow` tiers.\n", + "- `loadForever`, creates a rule to cache all remaining data on the `slow` tier." ] }, { @@ -713,8 +713,8 @@ "\n", "The order of the rules means:\n", "\n", - "1. Data younger than a year is loaded to `_default_tier` historicals.\n", - "2. Data covering 2015 is cached on `slow`-tier historicals.\n", + "1. Druid loads data younger than a year onto `_default_tier` historicals.\n", + "2. Druid loads data covering 2015 onto `slow`-tier historicals.\n", "3. No other data is loaded." ] }, From ea83d7ab0eb2a0dfc0244a8adb8dbefbfc769161 Mon Sep 17 00:00:00 2001 From: Hugh Evans Date: Fri, 20 Sep 2024 12:12:55 +0100 Subject: [PATCH 11/12] Renamed profile for tiered druid examples --- docker-compose.yaml | 32 +++++++++---------- .../03-query/20-tiering-historicals.ipynb | 2 +- notebooks/README.md | 4 +-- 3 files changed, 19 insertions(+), 19 deletions(-) diff --git a/docker-compose.yaml b/docker-compose.yaml index 83b6f116..564e71cd 100644 --- a/docker-compose.yaml +++ b/docker-compose.yaml @@ -34,7 +34,7 @@ services: postgres: image: postgres:alpine3.20 container_name: postgres - profiles: ["druid-jupyter", "all-services", "druid-jupyter-tiered-hist"] + profiles: ["druid-jupyter", "all-services", "tiered-druid-jupyter"] volumes: - metadata_data:/var/lib/postgresql/data environment: @@ -46,11 +46,11 @@ services: zookeeper: image: zookeeper:3.9.2 container_name: zookeeper - profiles: ["druid-jupyter", "kafka-jupyter", "all-services", "druid-jupyter-tiered-hist"] + profiles: ["druid-jupyter", "kafka-jupyter", "all-services", "tiered-druid-jupyter"] environment: - ZOO_MY_ID=1 - ALLOW_ANONYMOUS_LOGIN=yes - + kafka: image: bitnami/kafka:3.6.2 container_name: kafka-broker @@ -74,11 +74,11 @@ services: coordinator: image: imply/druid:${DRUID_VERSION:-30.0.0} container_name: coordinator - profiles: ["druid-jupyter", "all-services", "druid-jupyter-tiered-hist"] + profiles: ["druid-jupyter", "all-services", "tiered-druid-jupyter"] volumes: - druid_shared:/opt/shared - coordinator_var:/opt/druid/var - depends_on: + depends_on: - zookeeper - postgres ports: @@ -91,11 +91,11 @@ services: broker: image: imply/druid:${DRUID_VERSION:-30.0.0} container_name: broker - profiles: ["druid-jupyter", "all-services", "druid-jupyter-tiered-hist"] + profiles: ["druid-jupyter", "all-services", "tiered-druid-jupyter"] volumes: - broker_var:/opt/druid/var - druid_shared:/opt/shared - depends_on: + depends_on: - zookeeper - postgres - coordinator @@ -109,11 +109,11 @@ services: historical: image: imply/druid:${DRUID_VERSION:-30.0.0} container_name: historical - profiles: ["druid-jupyter", "all-services", "druid-jupyter-tiered-hist"] + profiles: ["druid-jupyter", "all-services", "tiered-druid-jupyter"] volumes: - druid_shared:/opt/shared - historical_var:/opt/druid/var - depends_on: + depends_on: - zookeeper - postgres - coordinator @@ -127,11 +127,11 @@ services: historical_slow: image: imply/druid:${DRUID_VERSION:-30.0.0} container_name: historical_slow - profiles: ["druid-jupyter-tiered-hist"] + profiles: ["tiered-druid-jupyter"] volumes: - druid_shared:/opt/shared - historical_2_var:/opt/druid/var - depends_on: + depends_on: - zookeeper - postgres - coordinator @@ -150,11 +150,11 @@ services: middlemanager: image: imply/druid:${DRUID_VERSION:-30.0.0} container_name: middlemanager - profiles: ["druid-jupyter", "all-services", "druid-jupyter-tiered-hist"] + profiles: ["druid-jupyter", "all-services", "tiered-druid-jupyter"] volumes: - druid_shared:/opt/shared - middle_var:/opt/druid/var - depends_on: + depends_on: - zookeeper - postgres - coordinator @@ -169,7 +169,7 @@ services: router: image: imply/druid:${DRUID_VERSION:-30.0.0} container_name: router - profiles: ["druid-jupyter", "all-services", "druid-jupyter-tiered-hist"] + profiles: ["druid-jupyter", "all-services", "tiered-druid-jupyter"] volumes: - router_var:/opt/druid/var depends_on: @@ -186,7 +186,7 @@ services: jupyter: image: imply/druid-notebook:latest container_name: jupyter - profiles: ["jupyter", "kafka-jupyter", "druid-jupyter", "all-services", "druid-jupyter-tiered-hist"] + profiles: ["jupyter", "kafka-jupyter", "druid-jupyter", "all-services", "tiered-druid-jupyter"] environment: JUPYTER_ENABLE_LAB: "yes" JUPYTER_TOKEN: "docker" @@ -203,7 +203,7 @@ services: datagen: image: imply/datagen:latest container_name: datagen - profiles: ["jupyter", "kafka-jupyter", "druid-jupyter", "all-services", "druid-jupyter-tiered-hist"] + profiles: ["jupyter", "kafka-jupyter", "druid-jupyter", "all-services", "tiered-druid-jupyter"] ports: - "${DATAGEN_PORT:-9999}:9999" volumes: diff --git a/notebooks/03-query/20-tiering-historicals.ipynb b/notebooks/03-query/20-tiering-historicals.ipynb index 92abdeda..5ecabb42 100644 --- a/notebooks/03-query/20-tiering-historicals.ipynb +++ b/notebooks/03-query/20-tiering-historicals.ipynb @@ -41,7 +41,7 @@ "\n", "This tutorial requires a deployment of Druid with multiple historical services, with one of the historical tiers named \"slow\".\n", "\n", - "Launch this tutorial and all prerequisites using the `druid-jupyter-tiered-hist` profile of the Docker Compose file for Jupyter-based Druid tutorials to create a cluster with an additional historical.\n", + "Launch this tutorial and all prerequisites using the `tiered-druid-jupyter` profile of the Docker Compose file for Jupyter-based Druid tutorials to create a cluster with an additional historical.\n", "\n", "For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid)." ] diff --git a/notebooks/README.md b/notebooks/README.md index a049932c..e8b0f710 100644 --- a/notebooks/README.md +++ b/notebooks/README.md @@ -53,7 +53,7 @@ For tutorials focused on effective use of all manner of `SELECT` statements in A |---|---|---| |[Learn the basics of Druid SQL](./03-query/00-using-sql-with-druidapi.ipynb)|An introduction to the unique aspects of Druid SQL.|`druid-jupyter`| |[GROUP BY](./03-query/01-groupby.ipynb)|Use GROUP BY in various forms to aggregate your data.|`druid-jupyter`| -|[COUNT DISTINCT](./03-query/03-approx-count-distinct.ipynb)|Work through approximate and accurate ways of counting unique occurences of data.|`druid-jupyter`| +|[COUNT DISTINCT](./03-query/03-approx-count-distinct.ipynb)|Work through approximate and accurate ways of counting unique occurrences of data.|`druid-jupyter`| |[SQL API](./03-query/12-query-api.ipynb)|See examples of getting results from the Druid SQL API directly.|`druid-jupyter`| |[TopN approximation](./03-query/02-approx-ranking.ipynb)|Understand the approximation used for GROUP BY queries with ORDER BY and LIMIT.|`druid-jupyter`| |[Analyzing data distributions](./03-query/04-approx-distribution.ipynb)|Use approximation to estimate quantiles, ranks, and histograms.|`druid-jupyter`| @@ -70,7 +70,7 @@ For tutorials focused on effective use of all manner of `SELECT` statements in A |[PIVOT and UNPIVOT](./03-query/15-pivot-unpivot.ipynb)|Use PIVOT to convert row values into columns. Use UNPIVOT to convert column values into rows.|`druid-jupyter`| |[Query from Deep Storage](./03-query/14-sync-async-queries.ipynb)|Use asynchronous queries to access data directly from deep storage.|`all-services`| |[Exporting data (experimental)](./03-query/16-export-msq-query.ipynb)|Walk through using INSERT INTO EXTERN to export query results.|`druid-jupyter`| -|[Retention load rules](./03-query/20-tiering-historicals.ipynb)|Using load rules to prescribe up how much data is cached on historicals, including when used with multiple tiers.|`druid-jupyter-tiered-hist`| +|[Retention load rules](./03-query/20-tiering-historicals.ipynb)|Using load rules to prescribe up how much data is cached on historicals, including when used with multiple tiers.|`tiered-druid-jupyter`| ## Operations From 25775c79e61c911f5443e154460567a0d0635e2b Mon Sep 17 00:00:00 2001 From: Hugh Evans Date: Fri, 20 Sep 2024 15:49:41 +0100 Subject: [PATCH 12/12] Jill's edit --- .../03-query/20-tiering-historicals.ipynb | 48 +++++++++---------- 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/notebooks/03-query/20-tiering-historicals.ipynb b/notebooks/03-query/20-tiering-historicals.ipynb index 5ecabb42..20fd8405 100644 --- a/notebooks/03-query/20-tiering-historicals.ipynb +++ b/notebooks/03-query/20-tiering-historicals.ipynb @@ -27,7 +27,7 @@ "\n", "As an Apache Druid cluster administrator, you can use [Service tiering](https://druid.apache.org/docs/latest/operations/mixed-workloads#service-tiering) to provision cluster resources according to various performance and storage requirements. For example, you can isolate resource-intensive queries with complex subqueries and large result sets from high priority, interactive queries.\\n\"\n", "\n", - "This tutorial demonstrates how to create different [historical tiers (https://druid.apache.org/docs/latest/operations/mixed-workloads#historical-tiering) to host different data based upon data age. Then, when you run a query, Druid uses the service that host the data corresponding to the time period within the query." + "This tutorial demonstrates how to create [historical tiers (https://druid.apache.org/docs/latest/operations/mixed-workloads#historical-tiering) to host different data based upon data age. Then, when you run a query, Druid uses the service that hosts the data corresponding to the time period within the query." ] }, { @@ -128,7 +128,7 @@ "source": [ "### Create some helper functions\n", "\n", - "Run the next cell to set up a standard piece of SQL that you will use in this notebook. It uses the `server_segments`, `servers`, and `segments` tables to produce a list of segments and where they have been cached." + "Run the next cell to set up a standard piece of SQL that you will use in this notebook. It uses the `server_segments`, `servers`, and `segments` tables to produce a list of segments and where they’re cached." ] }, { @@ -163,7 +163,7 @@ "source": [ "## Create a table using batch ingestion\n", "\n", - "In this section, you use batch ingestion to create a table with data spanning a few years. Then identify where the data has been cached for querying." + "In this section, you use batch ingestion to create a table with data spanning a few years, then identify where the data is cached for querying." ] }, { @@ -221,7 +221,7 @@ "source": [ "### Inspect the servers and current configuration\n", "\n", - "Query the servers system table to see available historicals their assigned tiers." + "Query the server’s system table to see available Historicals their assigned tiers." ] }, { @@ -247,10 +247,10 @@ "source": [ "You will see that there are multiple historical servers.\n", "\n", - "- One historical belongs to the default tier of `_default_tier`.\n", - "- One historical belongs to the `slow` tier.\n", + "- One belongs to the default tier `_default_tier`.\n", + "- One belongs to the `slow` tier.\n", "\n", - "> If you do not see multiple servers on multiple tiers, stop now.\n", + "> If you don't see multiple servers on multiple tiers, stop now.\n", "> See [pre-requisites](#prerequisites) for more information.\n", "\n", "Run the next cell to inspect the current distribution of data using the sys tables." @@ -271,7 +271,7 @@ "id": "c9320078-86ab-40b7-82eb-7785e36dbfe6", "metadata": {}, "source": [ - "All segments for the table, totalling around 20,000 rows, are loaded onto historicals in the `default_tier` tier.\n", + "Druid loads all segments for the table, totalling around 20,000 rows, onto Historicals in the `default_tier` tier.\n", "\n", "To understand the default configuration, run the following cell to inspect the current retention load rules with the Coordinator API." ] @@ -291,11 +291,11 @@ "id": "3e597661-e6a8-4a3b-98cc-c1fc7e3bafa4", "metadata": {}, "source": [ - "After creation, tables do not have any load rules of their own. Instead, they adhere to Druid's default set of rules, `_default`.\n", + "After creation, tables don't have any load rules of their own. Instead, they adhere to Druid's default set of rules, `_default`.\n", "\n", "By default, the `_default` rules set contains only one rule - a [load forever](https://druid.apache.org/docs/latest/operations/rule-configuration#forever-load-rule) rule (`loadForever`) with a replication factor (`tieredReplicants`) of 2 across servers in the `_default_tier`.\n", "\n", - "The entire timelime of data for your table is cached on historicals in the `_default_tier`, and queries will execute there." + "Druid caches the entire timeline of data for your table on Historicals in the `_default_tier` and executes queries there." ] }, { @@ -351,7 +351,7 @@ "source": [ "In addition to the `_default` rule set, there is a new rule set specific to the `example-wikipedia-tiering` table.\n", "\n", - "Run the next cell to see where the location of the cached data." + "Run the next cell to see the location of the cached data." ] }, { @@ -371,7 +371,7 @@ "source": [ "By setting up a table-specific rule set, where `tieredReplicants` includes both tiers, both the `slow` and `_default_tier` tiers have been loaded with all the segments of your table.\n", "\n", - "(Run the cell above again if you do not see this immediately.)" + "(Run the cell above again if you don't see this immediately.)" ] }, { @@ -431,7 +431,7 @@ "- An [interval load rule](https://druid.apache.org/docs/latest/operations/rule-configuration#interval-load-rule) (`loadByInterval`) covers 10 years of data before 1st January 2016, and requests one cached replica on both the `_default_tier` and `slow` tiers.\n", "- `loadForever` requests that all data be cached on the `slow` tier.\n", "\n", - "Each segment is checked against the rules in order when the decision is made as to where it must be cached.\n", + "Druid checks each segment against the rules in order when the decision is made as to where it must be cached.\n", "\n", "Run the cell to commit it to the database." ] @@ -490,7 +490,7 @@ "* Data for 2015 (`loadByInterval`) is only cached on the `slow` tier.\n", "* All other data (`loadForever`) is only available on the `_default_tier` tier.\n", "\n", - "Re-run the cell above if you do not see this immediately.\n", + "Re-run the cell above if you don't see this immediately.\n", "\n", "Run the following query on the table." ] @@ -520,7 +520,7 @@ "id": "ea52acbb-826c-4022-ab8a-5d79fc2eda0a", "metadata": {}, "source": [ - "Consider that, because of [time partitioning](https://druid.apache.org/docs/latest/multi-stage-query/concepts#partitioning-by-time), some parts of this query were calculated on the `slow` historical tier, and some were calculated on historicals in the `_default_tier` tier." + "Consider that, because of [time partitioning](https://druid.apache.org/docs/latest/multi-stage-query/concepts#partitioning-by-time), some parts of this query were calculated on the `slow` historical tier, and some were calculated on Historicals in the `_default_tier` tier." ] }, { @@ -647,9 +647,9 @@ "\n", "Review the rules below, and then run the cell to apply it to the table. This uses a mixture of all the retention load rules you have seen so far.\n", "\n", - "* A period load rule, which fires first, catches all data younger than a year, and loads it on `_default_tier` historicals.\n", - "* An interval load rule, catcheing data for 2015 and loading this to historicals in the `slow` tier only.\n", - "* A final forever load rule, that, since it has no `tieredReplicants` and `useDefaultTierForNull` is `false`, ensures none of the remaining data is cached on historicals.\n", + "* A period load rule, which fires first, catches all data newer than a year, and loads it on `_default_tier` Historicals.\n", + "* An interval load rule, catcheing data for 2015 and loading this to Historicals in the `slow` tier only.\n", + "* A final forever load rule, that, since it has no `tieredReplicants` and `useDefaultTierForNull` is `false`, ensures none of the remaining data is cached on Historicals.\n", "\n", "What do you predict will happen to data for 2016?" ] @@ -709,13 +709,13 @@ "id": "1f8ac22c-9fd5-4209-86c5-41fd7b4431cb", "metadata": {}, "source": [ - "Notice that, depending on the period of time they cover, some table segments are loaded onto historicals and some are not.\n", + "Notice that, depending on the period of time they cover, Druid loads some table segments onto Historicals but not others.\n", "\n", "The order of the rules means:\n", "\n", - "1. Druid loads data younger than a year onto `_default_tier` historicals.\n", - "2. Druid loads data covering 2015 onto `slow`-tier historicals.\n", - "3. No other data is loaded." + "1. Druid loads data newer than a year onto `_default_tier` Historicals.\n", + "2. Druid loads data covering 2015 onto `slow`-tier Historicals.\n", + "3. Druid loads no other data." ] }, { @@ -748,12 +748,12 @@ "## Summary\n", "\n", "* All historical servers belong to a tier.\n", - "* The default tier for all historicals is `_default`.\n", + "* The default tier for all Historicals is `_default`.\n", "* Default retention rules apply to all tables.\n", "* Out of the box, the default retention rule set has only one rule, loading all data on to the `_default` tier.\n", "* There is an API endpoint for amending load rules.\n", "* Rule sets can be made up of a mixture of age, interval, and \"forever\" rules.\n", - "* Rules are applied in order.\n", + "* Druid applies rules in the order they are listed in the retention rules configuration.\n", "\n", "## Learn more\n", "\n",