-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable comparing main facets module agains sandbox facets implementation #325
base: main
Are you sure you want to change the base?
Conversation
There is BrowseRandomLabelTaxoFacets regression because We can think about implementing dense counting for the sandbox facet module, but it looks like a rare use case to me - not only we need a field with large number of unique values, but it also needs to be a MatchAllDocsQuery. In any case, seems like the result is expected. Looking into PKLookup regression now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SearchTask#getFacetResultsMsec now includes time to run #search
Will that change results we were already reporting? Just good to know.
src/main/perf/SearchTask.java
Outdated
import org.apache.lucene.facet.Facets; | ||
import org.apache.lucene.facet.FacetsCollector; | ||
import org.apache.lucene.facet.FacetsCollectorManager; | ||
import org.apache.lucene.facet.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to avoid the wildcard import.
(Same below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I think my Intellij folded it, I'll change
src/main/perf/SearchTask.java
Outdated
// TODO: support sort, filter too!! | ||
// TODO: support other facet methods | ||
List<TaskParser.TaskBuilder.FacetTask> classicFacetRequests = new ArrayList<>(); | ||
List<TaskParser.TaskBuilder.FacetTask> sandboxFacetRequests = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're starting to spread this temporary name. It would have been great to converge on apache/lucene#13965 before, but I don't know that we want to wait now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I say @epotyom and @Shradha26 should simply pick a name, now, and run with it, here :) Surprise us!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I'm thinking about it is that new classes can be moved to the existing facets
module once we believe they are mature enough. I don't think these classes belong in an independent module, they depend on a lot of things from the facets
module, e.g. indexing time functionality or drill sideways. I promise to clean up and get rid of all mentions of "sandbox" once we do it, or, if decided otherwise, once we create a separate module for it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point that the new classes may become part of the facets module.
In this PR though, we're using "sanbox" as a contrast to "classic". Can we name it something generic, such as "matchTime", or are we not that detached from the particular implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have a point. How about postCollectionFacets
for classic VS directCollectionFacets
or directFacets
for sandbox? Or docIdSetFacets
VS directFacets
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like postCollectionFacets
. Can we pair it with duringCollectionFacets
, withCollectionFacets
, or collectionFacets
?
src/main/perf/TestContext.java
Outdated
public enum FacetMode { | ||
UNDEFINED, | ||
CLASSIC, | ||
SANDBOX, // TODO: better names? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
@@ -1193,7 +1195,9 @@ def runSimpleSearchBench(self, iter, id, c, | |||
command += [f'-XX:StartFlightRecording=dumponexit=true,maxsize=250M,settings={constants.BENCH_BASE_DIR}/src/python/profiling.jfc' + | |||
f',filename={constants.LOGS_DIR}/bench-search-{id}-{c.name}-{iter}.jfr', | |||
'-XX:+UnlockDiagnosticVMOptions', | |||
'-XX:+DebugNonSafepoints'] | |||
'-XX:+DebugNonSafepoints', | |||
# '-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=localhost:7891' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you leave this in accidentally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was intentional, it makes it easy to enable remote debug - just uncomment this line. Maybe I should add a comment. Do you think it's better to remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I don't mind leaving it in if you found it helpful for quick debugging.
import competition | ||
import os | ||
|
||
# Script to compare performance of sandbox and main facets modules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention this file somewhere visible, maybe in the README?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, will do in the next revision.
@@ -0,0 +1,40 @@ | |||
package perf; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs copyright header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
+1 -- good catch! |
Thank you for reviewing @stefanvodita !
Yeah, but TBH I can't find where we report I believe that QPS metrics reported by localrun.py are not going to change. |
Command to run: python src/python/localrunFacets.py -source facetsWikimediumAll Auxiliary changes: - Enable attaching context to a Task, to be able to use different implementations for the same task and Lucene code in baseline/candidate - Fix facets result overlap check: in python SearchTask's equals and hash methods to compare facets request, not results - Added facetsWikimediumAll config that contains all taxonomy facets tasks from wikimediumall
c446471
to
3c46276
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more small comments. Thank you for making the changes, great to see this change coming along!
|
||
There are currently two facets implementations - one that first collects document IDs and then computes facets in a separate phase, and a new implementation that computes facets during collection. | ||
|
||
To compare performance for the two implementations run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add details about what exactly gets compared? For example, you had a comment about only taxonomy facets being compared I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, added.
src/main/perf/SearchTask.java
Outdated
// for MatchAllDocsQuery to make collection for all docs in the index faster? | ||
Map<String, CountFacetRecorder> indexFieldToRecorder = new HashMap<>(); | ||
List<CollectorManager<? extends Collector, ?>> collectorManagers = new ArrayList<>(); | ||
// First collector manager in the list is to collect hits, but not for if MatchAllDocsQuery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we state this more clearly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded!
src/main/perf/SearchTask.java
Outdated
if (request.dimension().startsWith("range:")) { | ||
throw new AssertionError("fix me!"); | ||
} else if (request.dimension().endsWith(".taxonomy")) { | ||
//if (true) throw new RuntimeException("fix me! " + request.dimension() + "; " + state.facetsConfig.getDimConfig(request.dimension()).indexFieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgotten commented code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, sorry for the mess!
Auxiliary changes:
SearchTask#getFacetResultsMsec
now includes time to run#search
, because in the sandbox module we can't measure search and facet compute times separately. But it might be the right thing to do anyway, as otherwise we don't account for time spent to build doc ID sets in FacetsCollector?facetsWikimediumAll
config that contains all taxonomy facets tasks from wikimediumallCommand to run:
Results on my laptop
I'll look into why there is regression for BrowseRandomLabelTaxoFacets and PKLookup