-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Term indexer improvements #303
base: master
Are you sure you want to change the base?
Conversation
|
||
void S2RegionTermIndexer::GetQueryTerms(const S2Point& point, | ||
string_view prefix, | ||
vector<string>* terms) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split changing the type signature into a separate PR. We can discuss it there. Does this have a measurable and meaningful performance difference for you? I would have guessed that computing the covering takes way longer than reallocating vectors.
https://google.github.io/eng-practices/review/developer/small-cls.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want it, it's conflicting changes. It's not convenient to make separate PRs and then resolve conflict in one to another
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have a measurable and meaningful performance difference for you?
For points it's obliviously better 0 allocations (only first time) instead of log(max-min/level) for each call
For covering it depends, so I add such overload only for function which already have covering
I would have guessed that computing the covering takes way longer than reallocating vectors.
In my overloads we don't really compute it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's better, but is it 10% better or 0.001% better? We need to weigh the complexity of the change vs. the performance improvements. How much does this improve 1. time to build the whole index (or insert an object) 2. query time? (or some other user-observable action)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to weigh the complexity of the change vs. the performance improvements.
But it's not affect users who doesn't want to use it
Why is it issue?
In general same api exists for S2RegionCoverer, simple and advanced
Why don't you want same as for S2RegionCoverer?
https://github.com/google/s2geometry/blob/master/src/s2/s2region_coverer.h#L189
Could you explain please?
Why for coverer it's ok, but for indexer it's too difficult? They usage are pretty same
Sure, it's better, but is it 10% better or 0.001% better?
I don't measure this part, but it depends on what we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to make a performance change. Explain the performance impact in a time unit.
Already done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the change is good
I mean we have index with points only. Every allocation cost at least 10ns (on practice more), so if we make 1 instead of N it decrease overall time at least on (N-1)*10ns
Query with points only it's other part of PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many terms do your queries have? According to your benchmarks, you save 3µs for 16 query terms.
This sounds not very significant to me, so could you explain either why this is significant, or what N you care about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every allocation cost at least 10ns (on practice more), so if we make 1 instead of N it decrease overall time at least on (N-1)*10ns
It's significant for me not because it's big but because it costs nothing to me, I just should use other API
Even small improvements still good, if don't make code significantly more difficult
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concrete for me for query it will be same.
I can only see the difference for indexing, but I thought it's better to have a symmetrical API.
Maybe as a compromise I can keep the new indexing API but remove the new querying API (if you think that would be better)?
|
||
// If the index only contains points, there are no covering terms. | ||
if (options_.index_contains_points_only()) continue; | ||
|
||
// If we are optimizing for index space rather than query time, cells are | ||
// also queried as covering terms (except for true_max_level() cells, | ||
// which are indexed and queried as ancestor cells only). | ||
if (options_.optimize_for_space() && level < true_max_level) { | ||
terms.push_back(GetTerm(TermType::COVERING, id, prefix)); | ||
if (options_.optimize_for_space() && level != true_max_level) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only change what you need to. More changes just makes it harder to see what's really changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to make it same for different function
src/s2/s2region_term_indexer_test.cc
Outdated
@@ -131,6 +134,8 @@ TEST(S2RegionTermIndexer, IndexRegionsQueryPointsOptimizeTime) { | |||
options.set_max_level(16); | |||
options.set_max_cells(20); | |||
TestRandomCaps(options, QueryType::POINT); | |||
options.set_query_contains_points_only(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove QueryType
and do it based only on query_contains_points_only
. This is similar to how it works for index_contains_points_only
.
@smcallis what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove QueryType and do it based only on query_contains_points_only. This is similar to how it works for index_contains_points_only.
No. We want to test which query points with and without points only optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Points are also regions, and this property is already used here: https://github.com/google/s2geometry/pull/303/files/0f9ef7a0ebe8d52ebf73cfb4c441cc1e3d553a05#diff-c9f79443ff2537a812dd6c149afd00960afcc88655f0514de8a1cc4741e78e79L66
I agree it would be more test coverage, but I'm fine making this same assumption we're already making.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general exist such scenarios matrix:
index contains only points, only points with enabled optimization, anything
x
same about query
x
time/space bool option
So exists at least 3*3*2 = 18
completely different configuration
Was 3*2*2 = 12
Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I understood what do you mean, you are incorrect.
this same assumption we're already making.
Now in master tested such cases:
index contains points: true/false
query points/query regions
index contains points optimization affect query terms!
So it has two different query for same index (points only)
So for opposite options I need to make same query (points only) with two different index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway I wrote better coverage with automatic case combinations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16 (2 * 2 * 2 * 2) - 4 (2 * 2 * 1 * 1) = 12
all combinations - skip invalid combinations
as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm you right about in my code wasn't checked query points but index contains caps.
Fixed.
(2 * 2 * 2 * 2 * 2) - ((1 * 2 * 2 * 1 * 2) + (2 * 1 * 2 * 2 * 1) - 2) = 18
all combinations - (skip first if + skip second if - their intersection (it counts twice))
cases now
If you don't like skip I can use GTEST_SUCCEED()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmr Do you ok with current tests?
src/s2/s2region_term_indexer.cc
Outdated
: TermType::COVERING, | ||
id, prefix)); | ||
|
||
// If query only contains points, there are no need other terms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is obvious from the code and doesn't need a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as for index_only optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want I remove them both, or only my comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
|
||
void S2RegionTermIndexer::GetQueryTerms(const S2Point& point, | ||
string_view prefix, | ||
vector<string>* terms) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many terms do your queries have? According to your benchmarks, you save 3µs for 16 query terms.
This sounds not very significant to me, so could you explain either why this is significant, or what N you care about?
src/s2/s2region_term_indexer_test.cc
Outdated
S2RegionTermIndexer::Options options; | ||
options.set_optimize_for_space(false); // Optimize for time. | ||
options.set_min_level(0); // Use face cells. | ||
TEST_P(S2RegionTermIndexerTest, UseFaceCells2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use something more descriptive than "2" to differentiate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use something more descriptive than "2" to differentiate it.
What? No good name was in old code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can use UseFaceCellsLevelMode2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more descriptive
src/s2/s2region_term_indexer_test.cc
Outdated
@@ -44,10 +45,13 @@ S2_DEFINE_int32(iters, 400, "number of iterations for testing"); | |||
|
|||
namespace { | |||
|
|||
enum QueryType { POINT, CAP }; | |||
enum DataType { | |||
POINT = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave it like before. Explicit = 0 not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* Add ability to reuse term buffer * Add option for optimize index size if query only points
@jmr Any problem here? |
@@ -289,6 +303,21 @@ class S2RegionTermIndexer { | |||
std::vector<std::string> GetQueryTermsForCanonicalCovering( | |||
const S2CellUnion& covering, absl::string_view prefix); | |||
|
|||
// Same as above but allows to reuse same buffer for different points or use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its fine to have these since the old ones call the new variants. Since the terms vector isn't optional let's just make it a non-const reference though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a small inefficiency we should forget about 97% of the time. @MBkkt even said the few microseconds it saved were irrelevant to him.
Why do we want to complicate the interface with this? Now we have to document what happens with the vectors. Are they cleared? appended to?
The formatting is also inconsistent (const T&
vs const T &
) with a few lines below. A lot of time is spent dealing with this for epsilon difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes if the performance gain isn't worth it then we should probably just elide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they cleared? appended to?
appended to, I think function is pretty small to see it from code.
The formatting is also inconsistent (const T& vs const T &) with a few lines below. A lot of time is spent dealing with this for epsilon difference.
Sorry you don't have clang-format in repo, so my ide reformat it randomly, I tried to make it like was in other function, maybe in some places I missed it
This seems like a small inefficiency we should forget about 97% of the time. @MBkkt even said the few microseconds it saved were irrelevant to him.
I already wrote, now it's ok for me don't have such api for query, but for indexing, calling other api make my code faster (a little but measurable), so I want other api.
Offtop:
Btw today we discussed how speedup geo terms search: now we need to make disjunction of all query terms (it's from s2 documentation and it works now for us).
But it's possible to make conjuntion of disjunction which is a lot of better for inverted index (don't need to read a lot of postings lists)
Like query covering:
cellid-0: min_level
cellid-1: min_level + 1
cellid-2: min_level + 1
Like query terms:
cellid-0-term-min-level
cellid-1-term-min-level
cellid-1-term-min-level+1
// cellid-2-term-min-level, for an example same as cellid-1-term-min-level, so deduplicated
cellid-2-term-min-level+1
So we can make query like:
(cellid-0-term-min-level || cellid-1-term-min-level) && (cellid-1-term-min-level+1 || cellid-2-term-min-level+1)
instead of
cellid-0-term-min-level || cellid-1-term-min-level || cellid-1-term-min-level+1 || cellid-2-term-min-level+1
It has few other tricks, like we should store in index for pair: term, document in posting list: is it last or not
But in general it's a lot of better.
Why do I write about this?
Because for use such api we need to know term level.
It can be computed from terms, but it's not really nice, because we already know they from covering.
So will be nice to have overload in future like:
std::vector<std::pair<int/*level*/,std::string/*term*/>>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its fine to have these since the old ones call the new variants. Since the terms vector isn't optional let's just make it a non-const reference though.
reference looks better for me, but in all other places in s2 mutable reference arguments passed like pointer
// + If the query will contain only points (rather than general regions), then | ||
// we can skip all the ancestor terms mentioned above (except last cell see | ||
// `GetIndexTerms(const S2Point& point...` for details) because there will | ||
// never be any document regions larger than the index region. This can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think saying "never be any document regions smaller than the query region" makes the comment have the same sense as 1. above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already wrote, now it's ok for me don't have such api for query, but for indexing, calling other api make my code faster (a little but measurable), so I want other api.
Do you have an indexing benchmark, or are you referring to the microbenchmarks you wrote in #304? What are the results?
Offtop:
Btw today we discussed how speedup geo terms search: now we need to make disjunction of all query terms (it's from s2 documentation and it works now for us).
As you say, off-topic for this PR. Please open a discussion or an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmr Today another day when I thought that will be good to have such API
But because now s2 don't have it I need to write code like this:
geoTerms = _indexer.GetIndexTerms(*_shape.region(), {});
if (!_shape.contains(centroid)) {
auto terms = _indexer.GetIndexTerms(centroid, {});
geoTerms.insert(geoTerms.end(), std::make_move_iterator(terms.begin()),
std::make_move_iterator(terms.end()));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an indexing benchmark, or are you referring to the microbenchmarks you wrote in #304? What are the results?
We already computed, it saves at least N*10 ns, where N number of inserts.
Why do you against this API so much?
First is caps
Second is points
Third is points with query only points
As we see index contains a lot of smaller count of terms, which is very good
Of course this optimization possible only if you query contains only points
Tests results