Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add api for Retrieving unused segments #15415

Merged
merged 21 commits into from
Dec 11, 2023

Conversation

zachjsh
Copy link
Contributor

@zachjsh zachjsh commented Nov 22, 2023

Description

This pr adds an api for retrieving unused segments for a particular datasource. The api supports pagination by the addition of limit and lastSegmentId parameters. The resulting unused segments are returned with optional sortOrder, ASC or DESC with respect to the matching segments id, start time, and end time, or not returned in any guarenteed order if sortOrder is not specified

GET /druid/coordinator/v1/datasources/{dataSourceName}/unusedSegments?interval={interval}&limit={limit}&lastSegmentId={lastSegmentId}&sortOrder={sortOrder}

Returns a list of unused segments for a datasource in the cluster contained within an optionally specified interval.
Optional parameters for limit and lastSegmentId can be given as well, to limit results and enable paginated results.
The results may be sorted in either ASC, or DESC order depending on specifying the sortOrder parameter.

dataSourceName: The name of the datasource
interval: the specific interval to search for unused segments for.
limit: the maximum number of unused segments to return information about. This property helps to
support pagination
lastSegmentId: the last segment id from which to search for results. All segments returned are > this segment
lexigraphically if sortOrder is null or ASC, or < this segment lexigraphically if sortOrder is DESC.
sortOrder: Specifies the order with which to return the matching segments by start time, end time. A null
value indicates that order does not matter.

Release note

Release note entry removed as this API will be removed. Please see #15721.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@zachjsh zachjsh changed the title Retreive unused segments Add api for Retreiving unused segments Nov 22, 2023
@zachjsh zachjsh changed the title Add api for Retreiving unused segments Add api for Retrieving unused segments Nov 22, 2023
@zachjsh zachjsh marked this pull request as ready for review November 22, 2023 22:33
Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial pass on API design

public Response getIntervalsWithUnusedSegmentsOrAllUnusedSegmentsPerIntervals(
@PathParam("dataSourceName") String dataSourceName,
@QueryParam("simple") String simple,
@QueryParam("full") String full,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@Path("/{dataSourceName}/unused/intervals")
@Produces(MediaType.APPLICATION_JSON)
@ResourceFilters(DatasourceResourceFilter.class)
public Response getIntervalsWithUnusedSegmentsOrAllUnusedSegmentsPerIntervals(
Copy link
Contributor

@abhishekrb19 abhishekrb19 Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we're following the same API patterns in this file written over a decade ago, but I'm wondering if it'd be better to move away from the pattern where an API does multiple things and returns different shapes of responses depending on the parameters specified. For example, in this case, IMO it'd be cleaner to have two different sets of API endpoints each serving only one functionality:

  1. getIntervalsWithUnusedSegments()
  2. getUnusedSegmentsInInterval() where the interval is optional similar to other parameters

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, now there are 3 apis

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Left a few suggestions and test cases to consider adding. Thanks, @zachjsh!

@@ -3050,19 +3106,20 @@ private List<DataSegment> createAndGetUsedYearSegments(final int startYear, fina

private ImmutableList<DataSegment> retrieveUnusedSegmentsUsingMultipleIntervalsAndLimit(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this method be renamed to include offset in its name, or perhaps could use a simpler name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

int offset = 10;
List<DataSegment> expectedSegments = segments.stream().sorted((Comparator<DataSegment>) (o1, o2) -> {
long o1Start = o1.getInterval().getStartMillis();
long o2Start = o2.getInterval().getStartMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this custom comparator? createAndGetUsedYearSegments() already returns a list of segments sorted by start date. So I think this should suffice:

final List<DataSegment> expectedSegments = segments.stream().skip(offset).collect(Collectors.toList());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Assert.assertEquals(expectedSegments.size(), actualUnusedSegments.size());
Assert.assertTrue(expectedSegments.containsAll(actualUnusedSegments));
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test coverage, can we also please include new or extend existing tests for the following scenarios:

  1. non-empty interval, non-null limit and non-null offset
  2. multiple intervals, non-null limit and non-null offset
  3. multiple intervals exceeding MAX_INTERVALS_PER_BATCH, non-null limit and non-null offset (the offset in this case would be ignored)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@zachjsh zachjsh requested a review from abhishekrb19 December 5, 2023 23:12
Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Should we also add a release note entry for this new API?

Iterable<DataSegment> iterateAllUnusedSegmentsForDatasource(
String datasource,
@Nullable Interval interval,
@Nullable Integer limit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a good idea to go ahead and add a sort order now too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Good suggestion. Added

* @param interval The intervals to search over
* @param limit The limit of segments to return
* @param offset The offset to use when retrieving matching segments.
* @param orderByStartEnd Specifies the order with which to return the matching segments by start time, end time. A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An enum type would be a better choice for the direction of order by? Similar to OrderByColumnSpec#Direction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added new enum

@abhishekrb19
Copy link
Contributor

abhishekrb19 commented Dec 8, 2023

For performance-reasons, I think it'd be best if we don't use OFFSET on the segments table. Using OFFSET to page through results in a cluster with large number of segments will become a bottleneck. An alternative idea is to leverage the indices id, start and end built for the segments table along with the uniqueness of segment id via the primary key constraint, we can guarantee a stable sort and page through batched results from a client.

Instead of an OFFSET based paginated query:

SELECT "start", "end", "id"
FROM druid_segments
WHERE dataSource = 'foo' AND "start" >= '2023-01-01' AND "end" < '2024-01-01'
ORDER BY 1, 2
LIMIT 100
OFFSET 10000

We can do a cursor-based query:

SELECT "start", "end", "id"
FROM druid_segments
WHERE dataSource = 'foo' AND "start" >= '2023-01-01' AND "end" < '2024-01-01' AND id > 'foo_2023-02-24T07:00:00.000Z_2023-02-24T08:00:00.000Z_2023-11-20T23:04:26.856Z'
ORDER BY 1, 2, 3
LIMIT 100

The execution plans for these two queries are both simple from EXPLAIN PLAN, but the first query has to read, count and then discard results because of OFFSET, so it'll take longer especially for larger values of OFFSET. With the second query, an API caller should pass in the last segment id from a previous batch (if any); they can also narrow down the search interval as they batch through results as another optimization.

@zachjsh
Copy link
Contributor Author

zachjsh commented Dec 9, 2023

For performance-reasons, I think it'd be best if we don't use OFFSET on the segments table. Using OFFSET to page through results in a cluster with large number of segments will become a bottleneck. An alternative idea is to leverage the indices id, start and end built for the segments table along with the uniqueness of segment id via the primary key constraint, we can guarantee a stable sort and page through batched results from a client.

Instead of an OFFSET based paginated query:

SELECT "start", "end", "id"
FROM druid_segments
WHERE dataSource = 'foo' AND "start" >= '2023-01-01' AND "end" < '2024-01-01'
ORDER BY 1, 2
LIMIT 100
OFFSET 10000

We can do a cursor-based query:

SELECT "start", "end", "id"
FROM druid_segments
WHERE dataSource = 'foo' AND "start" >= '2023-01-01' AND "end" < '2024-01-01' AND id > 'foo_2023-02-24T07:00:00.000Z_2023-02-24T08:00:00.000Z_2023-11-20T23:04:26.856Z'
ORDER BY 1, 2, 3
LIMIT 100

The execution plans for these two queries are both simple from EXPLAIN PLAN, but the first query has to read, count and then discard results because of OFFSET, so it'll take longer especially for larger values of OFFSET. With the second query, an API caller should pass in the last segment id from a previous batch (if any); they can also narrow down the search interval as they batch through results as another optimization.

Good suggestion, updated.

@@ -248,6 +248,12 @@ Returns full segment metadata for a specific segment in the cluster.

Return the tiers that a datasource exists in.

`GET /druid/coordinator/v1/datasources/{dataSourceName}/unusedSegments?interval={interval}&limit={limit}&lastSegmentId={lastSegmentId}&sortOrder={sortOrder}`

Returns a list of unused segments for a datasource in the cluster contained within an optionally specified interval.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the API behavior clear in the absence of all the optional parameters, maybe specify what the defaults for the no interval, no limit, etc. Also, it'd be good to include an example for these params, so a user knows what to specify for params like sortOrder (wish we had an open API spec in Druid :)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/**
* @return the string offset clause
*/
public String getOffsetClause(int offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused offset method, so remove it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* iterates the returned iterable only once rather than several times.
*
* @param datasource the name of the datasource.
* @param interval the interval to search over.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param interval the interval to search over.
* @param interval an optional interval to search over. If none is specified, {@link org.apache.druid.java.util.common.Intervals#ETERNITY}
* should be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

*
* @param datasource the name of the datasource.
* @param interval the interval to search over.
* @param limit the maximum number of results to return.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an interface method, could you specify what the default behavior should be for these other optional parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


if (compareAsString) {
appendConditionForIntervalsAndMatchMode(sb, intervals, matchMode, connector);
}

if (sortOrder != null) {
sb.append(StringUtils.format(" ORDER BY start %2$s, %1$send%1$s %2$s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guarantee stable pagination, we should also include id here in the ORDER BY clause if it's specified; otherwise, the result set can be garbled between multiple pages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed

}

@Test
public void test_invalid()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void test_invalid()
public void testInvalid()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

import java.util.stream.Collectors;

/**
* specifies the sort order when doing metadata store queries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* specifies the sort order when doing metadata store queries.
* Specifies the sort order when doing metadata store queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

)
{
// Check if the intervals all support comparing as strings. If so, bake them into the SQL.
final boolean compareAsString = intervals.stream().allMatch(Intervals::canCompareEndpointsAsStrings);

final StringBuilder sb = new StringBuilder();
sb.append("SELECT payload FROM %s WHERE used = :used AND dataSource = :dataSource");
sb.append("SELECT payload FROM %s WHERE used = :used AND dataSource = :dataSource %s");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the trailing %s? If we add the necessary sql clauses conditionally we don't need this, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

.createQuery(StringUtils.format(
sb.toString(),
dbTables.getSegmentsTable(),
lastSegmentId == null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see my comment about moving this block above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@abhishekrb19 abhishekrb19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me after the last batch of comments are addressed. Thanks, @zachjsh!

@@ -3048,21 +3159,23 @@ private List<DataSegment> createAndGetUsedYearSegments(final int startYear, fina
return segments;
}

private ImmutableList<DataSegment> retrieveUnusedSegmentsUsingMultipleIntervalsAndLimit(
private ImmutableList<DataSegment> retrieveUnusedSegmentsUsingMultipleIntervalsLimitAndOffset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no offset anymore; maybe we can simply call this method retrieveUnusedSegments?

}

@Test
public void testRetrieveUnusedSegmentsUsingNoIntervalsNoLimitNoOffset() throws IOException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust the test method names since there's no offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

null
);
Assert.assertEquals(segments.size(), actualUnusedSegments.size());
Assert.assertTrue(segments.containsAll(actualUnusedSegments));
}

@Test
public void testRetrieveUnusedSegmentsUsingNoIntervalsAndNoLimitAndOffset() throws IOException
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offset -> last segment id here and below

Suggested change
public void testRetrieveUnusedSegmentsUsingNoIntervalsAndNoLimitAndOffset() throws IOException
public void testRetrieveUnusedSegmentsUsingNoIntervalsAndNoLimitAndNoLastSegmentId() throws IOException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* This call does not return any information about realtime segments.
*
* @param datasource The name of the datasource
* @param interval The intervals to search over
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param interval The intervals to search over
* @param interval an optional interval to search over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

*
* @param datasource The name of the datasource
* @param interval The intervals to search over
* @param limit The limit of segments to return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param limit The limit of segments to return
* @param limit an optional maximum number of results to return. If none is specified, the results are not limited.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

final Iterable<DataSegment> authorizedSegments =
AuthorizationUtils.filterAuthorizedResources(req, unusedSegments, raGenerator, authorizerMapper);

// sort by earliest start interval first, then end interval. DataSegment are sorted in this same order due to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this comment here since the sorting actually happens inside iterateAllUnusedSegmentsForDatasource? Also, this is missing last segment id in the sort by criteria

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

SortOrder theSortOrder = sortOrder == null ? null : SortOrder.fromValue(sortOrder);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add a validation check for lastSegmentId if it's non-nul? We could use org.apache.druid.timeline.SegmentId#tryParse utility to validate the id and throw an invalid input exception if the result of the tryParse is null (the id failed to parse)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +320 to +321
// test valid datasource with interval filter limit and offset - returns unused segments for that datasource within
// interval upto limit starting at offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// test valid datasource with interval filter limit and offset - returns unused segments for that datasource within
// interval upto limit starting at offset
// test valid datasource with interval filter limit and last segment id - returns unused segments for that datasource within
// interval upto limit starting at last segment id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@zachjsh zachjsh merged commit ab7d9bc into apache:master Dec 11, 2023
83 checks passed
@zachjsh zachjsh deleted the retreive-unused-segments branch December 11, 2023 21:32
@abhishekrb19 abhishekrb19 added the Needs web console change Backend API changes that would benefit from frontend support in the web console label Jan 18, 2024
@abhishekrb19 abhishekrb19 removed Release Notes Needs web console change Backend API changes that would benefit from frontend support in the web console labels Jan 18, 2024
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants