-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add api for Retrieving unused segments #15415
Conversation
* fix checkstyle * refactor * remove deadcode
server/src/test/java/org/apache/druid/server/http/DataSourcesResourceTest.java
Fixed
Show fixed
Hide fixed
server/src/test/java/org/apache/druid/server/http/DataSourcesResourceTest.java
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an initial pass on API design
server/src/main/java/org/apache/druid/metadata/SegmentsMetadataManager.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/http/DataSourcesResource.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/http/DataSourcesResource.java
Outdated
Show resolved
Hide resolved
public Response getIntervalsWithUnusedSegmentsOrAllUnusedSegmentsPerIntervals( | ||
@PathParam("dataSourceName") String dataSourceName, | ||
@QueryParam("simple") String simple, | ||
@QueryParam("full") String full, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we include the API in the public-facing docs https://github.com/apache/druid/blob/master/docs/api-reference/legacy-metadata-api.md?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
server/src/main/java/org/apache/druid/server/http/DataSourcesResource.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/server/http/DataSourcesResource.java
Outdated
Show resolved
Hide resolved
@Path("/{dataSourceName}/unused/intervals") | ||
@Produces(MediaType.APPLICATION_JSON) | ||
@ResourceFilters(DatasourceResourceFilter.class) | ||
public Response getIntervalsWithUnusedSegmentsOrAllUnusedSegmentsPerIntervals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we're following the same API patterns in this file written over a decade ago, but I'm wondering if it'd be better to move away from the pattern where an API does multiple things and returns different shapes of responses depending on the parameters specified. For example, in this case, IMO it'd be cleaner to have two different sets of API endpoints each serving only one functionality:
getIntervalsWithUnusedSegments()
getUnusedSegmentsInInterval()
where the interval is optional similar to other parameters
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, now there are 3 apis
server/src/test/java/org/apache/druid/server/http/DataSourcesResourceTest.java
Fixed
Show fixed
Hide fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Left a few suggestions and test cases to consider adding. Thanks, @zachjsh!
server/src/main/java/org/apache/druid/metadata/SQLMetadataConnector.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/metadata/storage/derby/DerbyConnector.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/metadata/SqlSegmentsMetadataQuery.java
Outdated
Show resolved
Hide resolved
@@ -3050,19 +3106,20 @@ private List<DataSegment> createAndGetUsedYearSegments(final int startYear, fina | |||
|
|||
private ImmutableList<DataSegment> retrieveUnusedSegmentsUsingMultipleIntervalsAndLimit( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method be renamed to include offset in its name, or perhaps could use a simpler name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
server/src/test/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinatorTest.java
Outdated
Show resolved
Hide resolved
int offset = 10; | ||
List<DataSegment> expectedSegments = segments.stream().sorted((Comparator<DataSegment>) (o1, o2) -> { | ||
long o1Start = o1.getInterval().getStartMillis(); | ||
long o2Start = o2.getInterval().getStartMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need this custom comparator? createAndGetUsedYearSegments()
already returns a list of segments sorted by start date. So I think this should suffice:
final List<DataSegment> expectedSegments = segments.stream().skip(offset).collect(Collectors.toList());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Assert.assertEquals(expectedSegments.size(), actualUnusedSegments.size()); | ||
Assert.assertTrue(expectedSegments.containsAll(actualUnusedSegments)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For test coverage, can we also please include new or extend existing tests for the following scenarios:
- non-empty interval, non-null limit and non-null offset
- multiple intervals, non-null limit and non-null offset
- multiple intervals exceeding
MAX_INTERVALS_PER_BATCH
, non-null limit and non-null offset (the offset in this case would be ignored)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
server/src/main/java/org/apache/druid/server/http/MetadataResource.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/http/MetadataResourceTest.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/http/MetadataResourceTest.java
Outdated
Show resolved
Hide resolved
Co-authored-by: Abhishek Radhakrishnan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Should we also add a release note entry for this new API?
Iterable<DataSegment> iterateAllUnusedSegmentsForDatasource( | ||
String datasource, | ||
@Nullable Interval interval, | ||
@Nullable Integer limit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be a good idea to go ahead and add a sort order now too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Good suggestion. Added
* @param interval The intervals to search over | ||
* @param limit The limit of segments to return | ||
* @param offset The offset to use when retrieving matching segments. | ||
* @param orderByStartEnd Specifies the order with which to return the matching segments by start time, end time. A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An enum type would be a better choice for the direction of order by? Similar to OrderByColumnSpec#Direction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added new enum
For performance-reasons, I think it'd be best if we don't use Instead of an SELECT "start", "end", "id"
FROM druid_segments
WHERE dataSource = 'foo' AND "start" >= '2023-01-01' AND "end" < '2024-01-01'
ORDER BY 1, 2
LIMIT 100
OFFSET 10000 We can do a cursor-based query: SELECT "start", "end", "id"
FROM druid_segments
WHERE dataSource = 'foo' AND "start" >= '2023-01-01' AND "end" < '2024-01-01' AND id > 'foo_2023-02-24T07:00:00.000Z_2023-02-24T08:00:00.000Z_2023-11-20T23:04:26.856Z'
ORDER BY 1, 2, 3
LIMIT 100 The execution plans for these two queries are both |
Good suggestion, updated. |
@@ -248,6 +248,12 @@ Returns full segment metadata for a specific segment in the cluster. | |||
|
|||
Return the tiers that a datasource exists in. | |||
|
|||
`GET /druid/coordinator/v1/datasources/{dataSourceName}/unusedSegments?interval={interval}&limit={limit}&lastSegmentId={lastSegmentId}&sortOrder={sortOrder}` | |||
|
|||
Returns a list of unused segments for a datasource in the cluster contained within an optionally specified interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the API behavior clear in the absence of all the optional parameters, maybe specify what the defaults for the no interval, no limit, etc. Also, it'd be good to include an example for these params, so a user knows what to specify for params like sortOrder
(wish we had an open API spec in Druid :)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
/** | ||
* @return the string offset clause | ||
*/ | ||
public String getOffsetClause(int offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused offset method, so remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* iterates the returned iterable only once rather than several times. | ||
* | ||
* @param datasource the name of the datasource. | ||
* @param interval the interval to search over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param interval the interval to search over. | |
* @param interval an optional interval to search over. If none is specified, {@link org.apache.druid.java.util.common.Intervals#ETERNITY} | |
* should be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* | ||
* @param datasource the name of the datasource. | ||
* @param interval the interval to search over. | ||
* @param limit the maximum number of results to return. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is an interface method, could you specify what the default behavior should be for these other optional parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
||
if (compareAsString) { | ||
appendConditionForIntervalsAndMatchMode(sb, intervals, matchMode, connector); | ||
} | ||
|
||
if (sortOrder != null) { | ||
sb.append(StringUtils.format(" ORDER BY start %2$s, %1$send%1$s %2$s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To guarantee stable pagination, we should also include id
here in the ORDER BY
clause if it's specified; otherwise, the result set can be garbled between multiple pages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed
} | ||
|
||
@Test | ||
public void test_invalid() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public void test_invalid() | |
public void testInvalid() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
import java.util.stream.Collectors; | ||
|
||
/** | ||
* specifies the sort order when doing metadata store queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* specifies the sort order when doing metadata store queries. | |
* Specifies the sort order when doing metadata store queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
) | ||
{ | ||
// Check if the intervals all support comparing as strings. If so, bake them into the SQL. | ||
final boolean compareAsString = intervals.stream().allMatch(Intervals::canCompareEndpointsAsStrings); | ||
|
||
final StringBuilder sb = new StringBuilder(); | ||
sb.append("SELECT payload FROM %s WHERE used = :used AND dataSource = :dataSource"); | ||
sb.append("SELECT payload FROM %s WHERE used = :used AND dataSource = :dataSource %s"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the trailing %s
? If we add the necessary sql clauses conditionally we don't need this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
server/src/main/java/org/apache/druid/metadata/SqlSegmentsMetadataQuery.java
Show resolved
Hide resolved
.createQuery(StringUtils.format( | ||
sb.toString(), | ||
dbTables.getSegmentsTable(), | ||
lastSegmentId == null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comment about moving this block above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me after the last batch of comments are addressed. Thanks, @zachjsh!
@@ -3048,21 +3159,23 @@ private List<DataSegment> createAndGetUsedYearSegments(final int startYear, fina | |||
return segments; | |||
} | |||
|
|||
private ImmutableList<DataSegment> retrieveUnusedSegmentsUsingMultipleIntervalsAndLimit( | |||
private ImmutableList<DataSegment> retrieveUnusedSegmentsUsingMultipleIntervalsLimitAndOffset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no offset anymore; maybe we can simply call this method retrieveUnusedSegments
?
} | ||
|
||
@Test | ||
public void testRetrieveUnusedSegmentsUsingNoIntervalsNoLimitNoOffset() throws IOException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjust the test method names since there's no offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
null | ||
); | ||
Assert.assertEquals(segments.size(), actualUnusedSegments.size()); | ||
Assert.assertTrue(segments.containsAll(actualUnusedSegments)); | ||
} | ||
|
||
@Test | ||
public void testRetrieveUnusedSegmentsUsingNoIntervalsAndNoLimitAndOffset() throws IOException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offset -> last segment id here and below
public void testRetrieveUnusedSegmentsUsingNoIntervalsAndNoLimitAndOffset() throws IOException | |
public void testRetrieveUnusedSegmentsUsingNoIntervalsAndNoLimitAndNoLastSegmentId() throws IOException |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* This call does not return any information about realtime segments. | ||
* | ||
* @param datasource The name of the datasource | ||
* @param interval The intervals to search over |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param interval The intervals to search over | |
* @param interval an optional interval to search over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* | ||
* @param datasource The name of the datasource | ||
* @param interval The intervals to search over | ||
* @param limit The limit of segments to return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @param limit The limit of segments to return | |
* @param limit an optional maximum number of results to return. If none is specified, the results are not limited. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
final Iterable<DataSegment> authorizedSegments = | ||
AuthorizationUtils.filterAuthorizedResources(req, unusedSegments, raGenerator, authorizerMapper); | ||
|
||
// sort by earliest start interval first, then end interval. DataSegment are sorted in this same order due to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this comment here since the sorting actually happens inside iterateAllUnusedSegmentsForDatasource
? Also, this is missing last segment id in the sort by criteria
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
} | ||
|
||
SortOrder theSortOrder = sortOrder == null ? null : SortOrder.fromValue(sortOrder); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add a validation check for lastSegmentId
if it's non-nul? We could use org.apache.druid.timeline.SegmentId#tryParse
utility to validate the id and throw an invalid input exception if the result of the tryParse
is null (the id failed to parse)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
// test valid datasource with interval filter limit and offset - returns unused segments for that datasource within | ||
// interval upto limit starting at offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// test valid datasource with interval filter limit and offset - returns unused segments for that datasource within | |
// interval upto limit starting at offset | |
// test valid datasource with interval filter limit and last segment id - returns unused segments for that datasource within | |
// interval upto limit starting at last segment id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Description
This pr adds an api for retrieving unused segments for a particular datasource. The api supports pagination by the addition of
limit
andlastSegmentId
parameters. The resulting unused segments are returned with optionalsortOrder
,ASC
orDESC
with respect to the matching segmentsid
,start time
, andend time
, or not returned in any guarenteed order ifsortOrder
is not specifiedGET /druid/coordinator/v1/datasources/{dataSourceName}/unusedSegments?interval={interval}&limit={limit}&lastSegmentId={lastSegmentId}&sortOrder={sortOrder}
Returns a list of unused segments for a datasource in the cluster contained within an optionally specified interval.
Optional parameters for limit and lastSegmentId can be given as well, to limit results and enable paginated results.
The results may be sorted in either ASC, or DESC order depending on specifying the sortOrder parameter.
dataSourceName
: The name of the datasourceinterval
: the specific interval to search for unused segments for.limit
: the maximum number of unused segments to return information about. This property helps tosupport pagination
lastSegmentId
: the last segment id from which to search for results. All segments returned are > this segmentlexigraphically if sortOrder is null or ASC, or < this segment lexigraphically if sortOrder is DESC.
sortOrder
: Specifies the order with which to return the matching segments by start time, end time. A nullvalue indicates that order does not matter.
Release note
Release note entry removed as this API will be removed. Please see #15721.
This PR has: