Merge pull request #129 from implydata/202412-clickstream

202412 clickstream guide
implydata · Dec 12, 2024 · 952fdab · 952fdab
2 parents f5be7f4 + 47cade1
commit 952fdab
Show file tree

Hide file tree

Showing 13 changed files with 374 additions and 4 deletions.
diff --git a/notebooks/01-introduction/01-clickstream/01-connect.md b/notebooks/01-introduction/01-clickstream/01-connect.md
@@ -0,0 +1,20 @@
+# Connect to clickstream data
+
+Event hubs allow clickstream data to be collected and transported to Druid quickly, and is a very common implementation pattern.
+
+* Learn about streaming ingestion and try the [notebook](../../02-ingestion/01-streaming-from-kafka.ipynb).
+* Learn about [multi-topic Apache Kafka streaming ingestion](https://druid.apache.org/docs/latest/ingestion/kafka-ingestion#ingest-from-multiple-topics) and try the [notebook](../../02-ingestion/11-stream-from-multiple-topics.ipynb).
+
+Very busy websites create very large volumes of data. When Druid’s ingestion throughput is too low, pressure builds up on the ingestion side, resulting in visitor actions not being available for query quickly enough.
+
+* Monitor [ingestion lag metrics](https://druid.apache.org/docs/latest/operations/metrics/#ingestion-metrics).
+* Partition the stream and balance with enough [ingestion tasks](https://druid.apache.org/docs/latest/ingestion/supervisor#io-configuration) using `ioConfig/taskCount`. Ideally, a number of event hub partitions that is a multiple of the number of task count.
+* Check the balance of each incoming stream partition and, if necessary, force balance upstream.
+* Understand the effect of [resetting supervisors](https://druid.apache.org/docs/latest/api-reference/supervisor-api#reset-a-supervisor).
+* [Read about](https://druid.apache.org/docs/latest/ingestion/supervisor#tuning-configuration) automatic supervisor resets using `tuningConfig/resetOffsetAutomatically`.
+
+---
+
+[index](README.md) [next](02-transform.md) 
+
+---
diff --git a/notebooks/01-introduction/01-clickstream/02-transform.md b/notebooks/01-introduction/01-clickstream/02-transform.md
@@ -0,0 +1,36 @@
+# Transforming clickstream data
+
+Clickstream data often contains numeric identifiers. These are often the basis of filtering and grouping queries.
+
+* Consider transforming numeric identifiers into strings with bitmap indexes in Druid by using `type` in your list of [dimension objects](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimension-objects).
+
+Clickstream events often contain data that technical users and machines understand but business users do not.
+
+* Read about [key-value lookup tables](https://druid.apache.org/docs/latest/querying/datasource#lookup).
+* Learn about the [query-time](https://druid.apache.org/docs/latest/querying/math-expr#string-functions) `LOOKUP` SQL-function.
+* Read about real-time [Apache Kafka key-value lookup tables](https://druid.apache.org/docs/latest/querying/kafka-extraction-namespace) and try the notebook.
+* Consider front-loading lookups using the native `lookup` [string function](https://druid.apache.org/docs/latest/querying/math-expr#string-functions) at ingestion time.
+
+Avoid including low-value rows from the source system. When a number of sites are all running on the same server, for example, you may only need actions on a specific host. Or perhaps you are building a UI for security operations who only need 500 errors.  Carefully consider whether a table needs to contain _all_ action events, or whether only specific actions are needed, such as conversions.
+
+* Use filters in the `transformSpec` to front-load `WHERE` operations, reducing storage cost and increasing query performance.
+* Consider applying filters upstream, reducing Druid ingestion infrastructure requirements, especially when filters begin to throw away more than 25% of rows.
+
+Clickstream events often include entity dimensions that need to be parsed to make it useful to end users. While this can be done at query time, the most performant work is the work you don’t do at all.
+
+* Learn about [native ingestion-time functions](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#transformspec) available through the `transformSpec` and try out the [notebook](../../02-ingestion/13-native-transforms.ipynb) - consider whether some functions destined for query-time could be applied at ingestion-time, either to new columns or replacing existing ones.
+* Consider which of the functions are valuable to downstream systems other than Druid, and could be applied upstream.
+
+Clickstream data can be particularly granular and noisy. Optimize your storage and processing speeds by establishing a minimum required time granularity for your UX, and consider emitting aggregated versions of dimension values rather than raw. You might do this for your main table, or as an additional table.
+
+* Read about ingestion-time [rollup](https://druid.apache.org/docs/latest/ingestion/rollup/) and, if you're using a streaming source, try the [notebook](../../02-ingestion/16-native-groupby-rollup.ipynb) on rollup with streaming.
+
+COUNT DISTINCT operations are very common in clickstream. Provide a speed boost to Druid's approximate query results by utilizing Druid's support for embedded Apache Datasketches created during rollup.
+
+* Try the [notebook](02-ingestion/03-generating-sketches.ipynb) on generating sketches.
+
+---
+
+[back](01-connect.md) [index](README.md) [next](03-optimize.md) 
+
+---
diff --git a/notebooks/01-introduction/01-clickstream/03-optimize.md b/notebooks/01-introduction/01-clickstream/03-optimize.md
@@ -0,0 +1,19 @@
+# Optimizing clickstream data
+
+Clickstream data can very often have a fluid schema. One of the objectives is to improve the experience upstream, so new attributes and measures can – and will – be added all the time to help analyse that experience.
+
+* Plan change control for ingestion specifications.
+* Plan ahead for data governance.
+* Read about [schema evolution](https://druid.apache.org/docs/latest/data-management/schema-changes).
+* Learn about [automatic schema detection](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#dimensionsspec), dimension inclusions and exclusions, and try out the [notebook](../../02-ingestion/15-native-dimensions.ipynb).
+
+Applying a secondary partitioning scheme can speed up queries by providing an additional layer of pruning before queries execute. Clickstream deployments can be multitenant, serving queries about multiple websites from a single table, for example, meaning that a partition on a website identifier can be beneficial. Or 90% of queries might concern some visitor attribute.
+
+* Read about [secondary partitions](https://druid.apache.org/docs/latest/ingestion/partitioning#secondary-partitioning) and try the notebook on [applying secondary partitioning using compaction](../../05-operations/04-compaction-partitioning.ipynb).
+* Read about how Druid [sorts data](https://druid.apache.org/docs/latest/ingestion/partitioning#sorting) inside segments and test whether an explicit order in your `dimensions` list improves query performance.
+
+---
+
+[back](02-transform.md) [index](README.md) [next](04-query.md) 
+
+---
diff --git a/notebooks/01-introduction/01-clickstream/04-query.md b/notebooks/01-introduction/01-clickstream/04-query.md
@@ -0,0 +1,42 @@
+# Querying clickstream data
+
+Clickstream UX often include [calendars](https://datavizcatalogue.com/methods/calendar.html), [timetables](https://datavizcatalogue.com/methods/timetable.html), and [line graphs](https://datavizcatalogue.com/methods/line_graph.html). These require GROUP BY on a time field, often scanning many millions of rows to create the resulting datasets.
+
+* Use Druid's `__time` functions to intelligently return fewer rows than can fit onto a user's screen. Read about [SQL time functions](https://druid.apache.org/docs/latest/querying/sql-scalar#date-and-time-functions) and try the [notebook](../../03-query/07-functions-datetime.ipynb).
+* Always use a time filter, giving careful consideration to the number of segments that will be scanned (influenced by `segmentGranularity`) and thus the number of threads required to complete the computation.
+
+Clickstream data often requires calculation of COUNT DISTINCT, for example on visitor identifiers.
+
+* Learn about how Druid uses Apache Datasketches for automatic approximation by reading the [documentation](https://druid.apache.org/docs/latest/querying/sql-translation#approximations).
+* Try the notebooks on [ranking](../../03-query/02-approx-ranking.ipynb), [count distinct](../../03-query/03-approx-count-distinct.ipynb), and [distribution](../../03-query/04-approx-distribution.ipynb).
+* Watch this [video on approximation](https://youtu.be/fSWwJs1gCvQ?list=PLDZysOZKycN7MZvNxQk_6RbwSJqjSrsNR) in Druid.
+
+Set analysis is very common in clickstream, sometimes using data that is high cardinality.
+
+* Learn about the limitations of the interactive API's execution model on handling multiple subqueries for resolving sets in [this video](https://youtu.be/chnZmngXMsQ?list=PLDZysOZKycN7MZvNxQk_6RbwSJqjSrsNR). Play to the strengths of this API by using Thetasketches to perform approximate set analysis.
+* Use and adequately resource the MSQ API for asynchronous queries for complex set operations that do not need to be interactive.
+
+The Druid database is tuned for immutable event data where all dimension values are considered true at the point in time given in the primary timestamp. Sessions, on the other hand, are long lasting, often vaguely defined entities where not all values are known until the session ends - for example, the session length.
+
+While it is possible to craft SQL to calculate session entity data, they tend to become complex, with multiple sub-queries. This demands that the correct processing engine be used.
+
+* Avoid queries with multiple JOIN and sub-query elements when using the interactive API.
+   * Consider the standard practice of using a stream processor, for example Apache Flink, to emit a "sessions" event stream containing event-compliant, finished sessions. Druid can consume from this quickly, and users can perform analysis of without wastefully engaging the database repeatedly.
+   * If real-time session data is desired, test applying event architecture practices by emitting deltas to a "session_deltas" topic. Druid can then consume from this topic, reconstituting the very latest session state over a period of time through a GROUP BY operation.
+* Use the asynchronous MSQ API to calculate data offline, retrieving it directly or using it to populate another table.
+
+Session journeys are often broken down into stages, beginning with when a visitor arrives and ending when they have achieved a goal. These stages are represented as funnels. To solve for this problem, a database must count the number of visitors who passed through each stage in the funnel - and the SQL behind this can grow to be extremely gnarly and computationally expensive.
+
+Sometimes it's imperitive that the UX receives sub-second query results - perhaps visitor demographic filters will be applied to the funnel interactively by end users, or an interactive time comparison is required. In this case, the interactive API needs to be used respectfully, meaning that JOIN and sub-query operations must be avoided.
+
+When funnel stage membership is _implicitly_ in _time order_, here are some approaches to investigate:
+
+* Use standard set analytics through [Theta sketches](https://druid.apache.org/docs/latest/querying/sql-scalar#theta-sketch-functions), replacing intersection, difference, and union subqueries with approximate operations. Try out the associated [notebook](03-query/03-approx-count-distinct.ipynb).
+* Add dimensions, one for each funnel stage, containing a 1 or a 0 to indicate that a funnel stage / goal was achieved, for example "impression", "click", and "conversion". To calculate the funnel, execute a SUM. Consider, though, the impact that changing the funnel stages might have.
+* Emit the final funnel stage achieved in your session data. A COUNT can then be taken, GROUP BY the last funnel stage, to give the data required to create the funnel. This is easier to adapt than fixed dimensions.
+
+---
+
+[back](03-optimize.md) [index](README.md) [next](05-manage.md) 
+
+---
diff --git a/notebooks/01-introduction/01-clickstream/05-manage.md b/notebooks/01-introduction/01-clickstream/05-manage.md
@@ -0,0 +1,12 @@
+# Managing clickstream data
+
+With many thousands of producers all generating data across multiple networks comes increased likelihood that action events will arrive late or out of order.
+
+* Read about [segment optimization](https://druid.apache.org/docs/latest/operations/segment-optimization) and use this to inform the selection of both your [primary timestamp](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec) in `timestampSpec` and the [primary partitioning period](https://druid.apache.org/docs/latest/ingestion/ingestion-spec#timestampspec) using `granularitySpec/segmentGranularity`.
+* Learn about [compaction](https://druid.apache.org/docs/latest/data-management/compaction) and try the [notebook](../../05-operations/04-compaction-partitioning.ipynb).
+
+---
+
+[back](04-query.md) [index](README.md)
+
+---
diff --git a/notebooks/01-introduction/01-clickstream/README.md b/notebooks/01-introduction/01-clickstream/README.md
@@ -0,0 +1,102 @@
+# Clickstream guide
+
+The Apache Druid database can be used to gain real-time insight into activity taking place on a web application or website.
+
+* [Introduction to clickstream](#introduction)
+* [Connecting to clickstream](01-connect.md)
+* [Transforming clickstream](02-transform.md)
+* [Optimize for clickstream](03-optimize.md)
+* [Querying clickstream](04-query.md)
+* [Managing clickstream data](05-manage.md)
+
+## Introduction
+
+As website visitors, we are all being coaxed toward achieving a particular goal that the owning organization has in mind for us, whether that's placing an order for a product, signing up for a service, or passing on contact information for sales leads.
+
+Clickstream analytics concerns questions like:
+
+* What types of people visit our site most?
+* Which visitors achieve high value goals?
+* What brought someone to our website?
+* What things take our visitors longest to achieve?
+
+These are important questions for a number of organizations, including auction websites, news publishers, and video streaming services.
+
+> "Clickstream analytics puts us closer to our users, and if you know what your users want, you are better able to serve them."
+
+For news publishers, the front page remains the most important real-estate on a website. Clickstream analytics helps advertising campaign managers understand whether the right demographic is reaching this front page as a campaign runs - and whether the campaign has resulted in longer sessions on the site overall.
+
+* What kind of people are landing on our product page?
+* Where are most of our visitors from?
+* What are people's usage habits?
+* Do we have return visitors?
+* How long does it take for someone to buy our services?
+* What questions are people asking about our products when they arrive?
+
+### Challenges
+
+Collecting and analyzing clickstream data is difficult because:
+
+* It's hard to collect the data.
+   * Data can come in from multiple sources, so a database needs to be able to draw from multiple real-time and batch streams - both at ingestion and query time.
+   * Data volume can be very large, so a database must not only be able to scale with the ingestion velocity, but index and compress the data efficiently both for long term storage and speedy computation.
+* It's hard to query the data.
+   * The websites, mobile apps, and other channels change over time. A database needs to be adaptable to changing integrations and schemas.
+   * Filtering and statistical needs of uses are often unpredictable, or lead to a large number of reports that have to be maintained. A database needs to be able to cope with a range of query profiles that can be executed in a number of different ways depending on needs.
+   * A broad number of clickstream queries concern distinct counts, especially of visitors. Databases need to have ways to calculate distinct counts quickly.
+   * Statistics very often concern intersections, unions, and differences. For example, visitors who used a channel today but not in the last 30 days. A database needs to be able to carry out set operations quickly enough for the answers to be relevant.
+
+The index for this guide directs you to functionality in Apache Druid that many adoptees are using to solve for these challenges.
+
+### Entities and events
+
+Clickstream describes a person's journey through a website. As a person goes through the site, each interaction causes an event to be emitted. Events can concern a number of different entities:
+
+| Entity | Definition | Value |
+| --- | --- | --- |
+| Visitor | The user themselves. | Driving loyalty. Informing advertising and promotion strategies. Testing churn-reducing tactics. |
+| Page | A place in a channel where a visitor can perform actions. | Changing the products or services shown depending on their journey so far. |
+| Action | Things that the visitor did on a page, also known as a click or a hit. | Building up a picture of common journeys. Measuring system responsiveness. |
+| Session | The user's journey. It might include data about what happened before the journey started (like the referring site), and summarise information about the number of pages visited. | Personalizing "grazing and hunting" experiences. Improving navigation. Attempting to build "first visit, first buy" to prevent attrition. |
+
+During a visit, a _conversion_ might occur. This is when a visitor took some action that amounts to them achieving a goal on the site. Each conversion is assigned a _conversion value_ depending on how important achieving that goal is to the site owners.
+
+Clickstream analytics concerns the entire customer lifecycle: how the attention of the _visitor_ was grabbed, what _user actions_ took place to achieve _conversion_ through a _session_ - no matter the channel - and how that _visitor_ was made loyal.
+
+### Pipeline
+
+Website-hosting technologies like [Apache HTTP Server](https://httpd.apache.org/) and [Microsoft Internet Information Services](https://www.iis.net) emit server logs, like [W3C Extended Log Files](https://en.wikipedia.org/wiki/Extended_Log_Format) and [NCSA Common Format Log Files](https://en.wikipedia.org/wiki/Common_Log_Format). Logs are often the starting point for clickstream analytics and contain information like:
+
+* The date and time of a web request.
+* The IP address of the server itself.
+* What service was being requested (`POST`, `GET`...)
+* The address of the content.
+* The IP address of the requesting client.
+* What the client is (the "user agent").
+* The number of bytes returned.
+
+Network logs might also be used, whether passively or proactively using packet sniffing.
+
+Log data like this is very often used to meet OLAs and SLAs, and to prevent and investigate security incidents.
+
+Additional clickstream data comes from code embedded on the client (visitor) side. Client-side event generation is richer, extending analytics into the business arena.
+
+* Javascript code.
+* Pixels.
+* Embedded components.
+
+Events are typically added to, and read from, an event hub such as Apache Kafka, Amazon Kinesis, or Azure Event Hub. Druid can ingest this data directly, making it queryable on arrival.
+
+In an omnichannel business, data might also come from telephone interactions, mobile app stats, social media interactions, and even physical store transactions.
+
+Clickstream event data often need additional processing to enrich or even create the entities entirely.
+
+* Clickstream data is stateless, it's often very easy to know when a _session_ started, but not so easy to know when a session ended.
+* Clickstream data is often anonymous, so _visitor_ data needs to be enriched by joining to an internal database and / or to online enrichment services.
+
+Druid ingestion-time transformations allow for row-wise functions to be applied to data as it arrives. More complex enrichment and processing is possible in batch using MSQ. Examples are given in the main portion of the guide. More often than not, Druid's transformations are combined with another technology, like Apache Flink, for more complex operations.
+
+* For _session_ analysis, enough time must elapse for the session to end before certain analysis can be done, such as calculating average session length by a particular _visitor_ demographic.
+* For _action_ data, the data must be made available quickly enough for decisions to be taken in a timely manner. For example, A/B testing of a new navigation structure, or determining the effectiveness of a campaign for a flash sale.
+
+The results of this processing may be posted into data lake technologies, or posted directly into stream event hubs for immediate analysis.