You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apache Druid 32.0.0 contains over 220 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 52 contributors.
Review the upgrade notes and incompatible changes before you upgrade to Druid {{DRUIDVERSION}}.
If you are upgrading across multiple versions, see the Upgrade notes page, which lists upgrade notes for the most recent Druid versions.
# Realtime query processing for multi-value strings
Realtime query processing no longer considers all strings as multi-value strings during expression processing, fixing a number of bugs and unexpected failures. This should also improve realtime query performance of expressions on string columns.
This change impacts topN queries for realtime segments where rows of data are implicitly null, such as from a property missing from a JSON object.
Before this change, these were handled as [] instead of null, leading to inconsistency between processing realtime segments and published segments. When processing segments, the value was treated as [], which topN ignores. After publishing, the value became null, which topN does not ignore. The same query could have different results before and after being persisted
After this change, the topN engine now treats [] as null when processing realtime segments, which is consistent with published segments.
This change doesn't impact actual multi-value string columns, regardless of if they're realtime.
Druid now supports hints for SQL JOIN queries that use the MSQ task engine. This allows queries to provide hints for the JOIN type that should be used at a per join level. Join hints recursively affect sub queries.
They no longer affect your query results. Only SQL-compliant non-legacy behavior is supported now.
If the configs are set to the legacy behavior, Druid services will fail to start.
If you want to continue to get the same results without these settings, you must update your queries or your results will be incorrect after you upgrade.
For more information about how to update your queries, see the migration guide.
Druid now supports hints for SQL JOIN queries that use the MSQ task engine. This allows queries to provide hints for the JOIN type that should be used at a per join level. Join hints recursively affect sub queries.
select/*+ sort_merge */w1.cityName, w2.countryNamefrom
(
select/*+ broadcast */w3.cityNameAS cityName, w4.countryNameAS countryName from wikipedia w3 LEFT JOIN wikipedia-set2 w4 ONw3.regionName=w4.regionName
) w1
JOIN wikipedia-set1 w2 ONw1.cityName=w2.cityNamewherew1.cityName='New York';
Use the new optional config tryParseNumbers for CSV and TSV input formats to control how numbers are treated. If enabled, any numbers present in the input will be parsed in the following manner:
long data type for integer types and
double for floating-point numbers
By default, this configuration is set to false, so numeric strings will be treated as strings.
# Control how many segments get merged for publishing
You can now use the maxColumnsToMerge property in your supervisor spec to specify the number of segments to merge in a single phase when merging segments for publishing. This limit affects the total number of columns present in a set of segments to merge. If the limit is exceeded, segment merging occurs in multiple phases. Druid merges at least 2 segments each phase, regardless of this setting.
Druid now fully supports early/late rejection periods when stopTasksCount is configured and streaming tasks run longer than the configured task duration #17442
Improved segment publishing when resubmitting supervisors or when task publishing takes a long time #17509
The following fields are deprecated for window queries that use the MSQ task engine: maxRowsMaterializedInWindow and partitionColumnNames. They will be removed in a future release.
Added automatic query prioritization based on the period of the segments scanned in a query. You can set the duration threshold in ISO format using druid.query.scheduler.prioritization.segmentRangeThreshold#17009
Improved error handling for incomplete queries. A trailer header to indicate an error is returned now #16672
Improved scan queries to account for column types in more situations #17463
Improved lookups so that they can now iterate over fetched data #17212
Improved projections so that they can contain only aggregators and no grouping columns #17484
Removed microseconds as a supported unit for EXTRACT #17247
The Overlord runtime property druid.indexer.tasklock.batchAllocationReduceMetadataIO can help reduce IO during batch segment allocation. Setting this flag to true (default value) allows the Overlord to fetch only necessary segment payloads during segment allocation.
Druid can now run non-G1 Garbage Collectors with JAVA_OPTS#17078
You no longer have to configure a temporary storage directory on the Middle Manager for durable storage or exports. If it isn't configured, Druid uses the task directory #17015#17335
Improved autoscaling on supervisors so that tasks don't get published needlessly #17335
Improved recovery time for Overlord leadership after ZooKeeper are bounced #17535
Improved Druid to be more resilient of Druid service leadership changes due to ZooKeeper outages #17546
Removed the following unused Coordinator dynamic configs: mergeBytesLimit and mergeSegmentsLimit#17384
# Sorting columns for compaction with the MSQ task engine (experimental)
Compaction that uses the MSQ task engine now supports sorting segments with non-time columns. If forceSegmentSortByTime is set in the compaction config or the inferred schema, the following happens:
Skip adding __time explicitly as the first column to the dimension schema since it already comes as part of the schema
Ensure column mappings propagate __time in the order specified by the schema
Set forceSegmentSortByTime in the MSQ query context.
Druid now emits the following metrics for GroupBy queries:
mergeBuffer/used: Number of merge buffers used mergeBuffer/acquisitionTimeNs: Total time required to acquire merge buffer mergeBuffer/acquisition: Number of queries that acquired a batch of merge buffers groupBy/spilledQueries: Number of queries that spilled onto the disk groupBy/spilledBytes-> Spilled bytes on the disk groupBy/mergeDictionarySize: Size of the merging dictionary
Added ingest/notices/queueSize, ingest/notices/time, and ingest/pause/time metrics to the statsd emitter #17487#17468
Added duty group as a dimension for the coordinator.global.time metric for the statsd-emitter #17320
-The service/heartbeat metric now reports the status on the Peon #17488
Changed real-time segment metrics so that they are for each Sink instead of for each FireHydrant. This is a return to emission behavior prior to improvements to real-time query performance made in 30.0.0#17170
Changed query stats to be first before intervals in getNativeQueryLine logging so that the stats are retained if the query object gets truncated #17326
The Delta Lake input source now supports decimal data types and is handled as a double. If the value cannot fit within a double, ingest it as a string #17376
You can now filter by snap shot version even if if the version is 0 #17367
A new contributor extension enabled a gRPC API for SQL and native queries, which means that gRPC-based clients can use the extension to issue SQL queries. Use this extension for simple queries.
In Druid 32.0.0, you can specify that Druid uses front-coded dictionaries feature during segment creation. Front-coded dictionaries reduce storage and improve performance by optimizing for strings where the front part looks similar.
Once this feature is on, you cannot easily downgrade to an earlier version that does not support the feature.
Improved dependency support between extensions. When an extension has a dependency on another extension, it now tries to use the dependency's class loader to find classes required classes #16973
Coordinator APIs for marking segments as used or unused are now deprecated and will be removed in a future release. Use the corresponding Overlord APIs instead. For a full list, see New Overlord APIs or #17545
Apache Druid 32.0.0 contains over 220 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 52 contributors.
See the complete set of changes for additional details, including bug fixes.
Review the upgrade notes and incompatible changes before you upgrade to Druid {{DRUIDVERSION}}.
If you are upgrading across multiple versions, see the Upgrade notes page, which lists upgrade notes for the most recent Druid versions.
# Important features
This section contains important information about new and existing features.
# New Overlord APIs
APIs for marking segments as used or unused have been moved from the Coordinator to the Overlord service:
Mark all segments of a datasource as unused:
POST /druid/indexer/v1/datasources/{dataSourceName}
Mark all (non-overshadowed) segments of a datasource as used:
DELETE /druid/indexer/v1/datasources/{dataSourceName}
Mark multiple segments as used
POST /druid/indexer/v1/datasources/{dataSourceName}/markUsed
Mark multiple (non-overshadowed) segments as unused
POST /druid/indexer/v1/datasources/{dataSourceName}/markUnused
Mark a single segment as used:
POST /druid/indexer/v1/datasources/{dataSourceName}/segments/{segmentId}
Mark a single segment as unused:
DELETE /druid/indexer/v1/datasources/{dataSourceName}/segments/{segmentId}
As part of this change, the corresponding Coordinator APIs have been deprecated and will be removed in a future release:
POST /druid/coordinator/v1/datasources/{dataSourceName}
POST /druid/coordinator/v1/datasources/{dataSourceName}/markUsed
POST /druid/coordinator/v1/datasources/{dataSourceName}/markUnused
POST /druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}
DELETE /druid/coordinator/v1/datasources/{dataSourceName}/segments/{segmentId}
DELETE /druid/coordinator/v1/datasources/{dataSourceName}
The Coordinator now calls the Overlord to serve these requests.
#17545
# Realtime query processing for multi-value strings
Realtime query processing no longer considers all strings as multi-value strings during expression processing, fixing a number of bugs and unexpected failures. This should also improve realtime query performance of expressions on string columns.
This change impacts topN queries for realtime segments where rows of data are implicitly null, such as from a property missing from a JSON object.
Before this change, these were handled as [] instead of null, leading to inconsistency between processing realtime segments and published segments. When processing segments, the value was treated as [], which topN ignores. After publishing, the value became null, which topN does not ignore. The same query could have different results before and after being persisted
After this change, the topN engine now treats [] as null when processing realtime segments, which is consistent with published segments.
This change doesn't impact actual multi-value string columns, regardless of if they're realtime.
#17386
# Join hints in MSQ task engine queries
Druid now supports hints for SQL JOIN queries that use the MSQ task engine. This allows queries to provide hints for the JOIN type that should be used at a per join level. Join hints recursively affect sub queries.
#17541
# Changes and deprecations
# ANSI-SQL compatibility and query results
Support for the configs that let you maintain older behavior that wasn't ANSI-SQL compliant have been removed:
druid.generic.useDefaultValueForNull=true
druid.expressions.useStrictBooleans=false
druid.generic.useThreeValueLogicForNativeFilters=false
They no longer affect your query results. Only SQL-compliant non-legacy behavior is supported now.
If the configs are set to the legacy behavior, Druid services will fail to start.
If you want to continue to get the same results without these settings, you must update your queries or your results will be incorrect after you upgrade.
For more information about how to update your queries, see the migration guide.
#17568 #17609
# Java support
Java support in Druid has been updated:
We recommend that you upgrade to Java 17.
#17466
# Hadoop-based ingestion
Hadoop-based ingestion is now deprecated. We recommend that you migrate to SQL-based ingestion.
# Join hints in MSQ task engine queries
Druid now supports hints for SQL JOIN queries that use the MSQ task engine. This allows queries to provide hints for the JOIN type that should be used at a per join level. Join hints recursively affect sub queries.
(#17406)
# Functional area and related changes
This section contains detailed release notes separated by areas.
# Web console
# Explore view (experimental)
Several improvements have been made to the Explore view in the web console:
The time chart visualization now supports zooming, dragging, and is smarter about granularity detection:
Filters been improved with helper tables and additional context:
Tiles can now be shown side-by-side:
#17627
# Segment timeline view
The segment timeline is now more interactive and no longer forces day granularity.
New view
Old view:
#17521
# Other web console improvements
expectedLoadTimeMillis
segment loading metric to the web console #17359# Ingestion
# Numbers for CSV and TSV input formats
Use the new optional config
tryParseNumbers
for CSV and TSV input formats to control how numbers are treated. If enabled, any numbers present in the input will be parsed in the following manner:By default, this configuration is set to false, so numeric strings will be treated as strings.
#17082
# Other ingestion improvements
# SQL-based ingestion
# Streaming ingestion
# Control how many segments get merged for publishing
You can now use the
maxColumnsToMerge
property in your supervisor spec to specify the number of segments to merge in a single phase when merging segments for publishing. This limit affects the total number of columns present in a set of segments to merge. If the limit is exceeded, segment merging occurs in multiple phases. Druid merges at least 2 segments each phase, regardless of this setting.#17030
# Other streaming ingestion improvements
stopTasksCount
is configured and streaming tasks run longer than the configured task duration #17442# Querying
# Window queries
The following fields are deprecated for window queries that use the MSQ task engine:
maxRowsMaterializedInWindow
andpartitionColumnNames
. They will be removed in a future release.#17433
# Other querying improvements
druid.query.scheduler.prioritization.segmentRangeThreshold
#17009# Cluster management
# Reduce metadata IO
The Overlord runtime property
druid.indexer.tasklock.batchAllocationReduceMetadataIO
can help reduce IO during batch segment allocation. Setting this flag to true (default value) allows the Overlord to fetch only necessary segment payloads during segment allocation.#17496
# Other cluster management improvements
JAVA_OPTS
#17078mergeBytesLimit
andmergeSegmentsLimit
#17384# Data management
# Sorting columns for compaction with the MSQ task engine (experimental)
Compaction that uses the MSQ task engine now supports sorting segments with non-time columns. If
forceSegmentSortByTime
is set in the compaction config or the inferred schema, the following happens:__time
explicitly as the first column to the dimension schema since it already comes as part of the schema__time
in the order specified by the schemaforceSegmentSortByTime
in the MSQ query context.# Other data management improvements
# Metrics and monitoring
# New metrics for
GroupByStatsMonitor
:Druid now emits the following metrics for GroupBy queries:
mergeBuffer/used
: Number of merge buffers usedmergeBuffer/acquisitionTimeNs
: Total time required to acquire merge buffermergeBuffer/acquisition
: Number of queries that acquired a batch of merge buffersgroupBy/spilledQueries
: Number of queries that spilled onto the diskgroupBy/spilledBytes
-> Spilled bytes on the diskgroupBy/mergeDictionarySize
: Size of the merging dictionary#17360
# CgroupV2 monitors (experimental)
The following monitors for
cgroupv2
are now available:org.apache.druid.java.util.metrics.CgroupV2CpuMonitor
org.apache.druid.java.util.metrics.CgroupV2DiskMonitor
org.apache.druid.java.util.metrics.CgroupV2MemoryMonitor
#16905
# Other metrics and monitoring improvements
ingest/notices/queueSize
,ingest/notices/time
, andingest/pause/time
metrics to the statsd emitter #17487 #17468coordinator.global.time
metric for the statsd-emitter #17320-The
service/heartbeat
metric now reports the status on the Peon #17488getNativeQueryLine
logging so that the stats are retained if the query object gets truncated #17326# Extensions
# Delta Lake
# gRPC queries
A new contributor extension enabled a gRPC API for SQL and native queries, which means that gRPC-based clients can use the extension to issue SQL queries. Use this extension for simple queries.
For more information, see gRPC query extension for Druid.
#15982
# Kubernetes
getMaximumCapacity
field #17107# Iceberg
The Iceberg extension now supports the AWS Glue Iceberg catalog.
#17392
# Documentation improvements
# Upgrade notes and incompatible changes
# Upgrade notes
# Front-coded dictionaries
In Druid 32.0.0, you can specify that Druid uses front-coded dictionaries feature during segment creation. Front-coded dictionaries reduce storage and improve performance by optimizing for strings where the front part looks similar.
Once this feature is on, you cannot easily downgrade to an earlier version that does not support the feature.
For more information, see Migration guide: front-coded dictionaries.
If you're already using this feature, you don't need to take any action.
# Incompatible changes
# Developer notes
# Dependency updates
The following dependencies have had their versions bumped:
javax.xml.bind
is no longer used. Druid now usesjakarta.xml.bind
#17370file-loader
dependency for the web console #17346git-commit-id-maven-plugin
from 4.9.10 to 9.0.1 #175714.1.108.Final
to4.1.116.Final
# Credits
@317brian
@a2l007
@abhishekagarwal87
@abhishekrb19
@adarshsanjeev
@adithyachakilam
@aho135
@Akshat-Jain
@anny-imply
@ashibhardwaj
@ashwintumma23
@benhopp
@bsmithgall
@cecemei
@clintropolis
@cryptoe
@edgar2020
@ektravel
@findingrish
@gargvishesh
@georgew5656
@gianm
@GWphua
@hardikbajaj
@jgoz
@jtuglu-netflix
@kfaraz
@kgyrtkirk
@kirangadhave-imply
@LakshSingla
@lkm
@naanagon
@nozjkoitop
@Pankaj260100
@petermarshallio
@pranavbhole
@rohangarg
@Shekharrajak
@shigarg1
@sreemanamala
@suneet-s
@suraj-goel
@techdocsmith
@TessaIO
@TSFenwick
@vivek807
@vogievetsky
@vtlim
@weishiuntsai
@writer-jill
@zachjsh
@Zeyu-Chen-SFDC
The text was updated successfully, but these errors were encountered: