-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deterministic, documented order of the controller started, finished and aborted callbacks. #2
Comments
While you can put these kind of guarantees in something like GCP where we know all the duplicates are going to be created in one go up front, I think it would be a mistake to mandate them as part of the Alternatively, it may be that what you're trying to achieve here would be better modelled as an |
This is mainly about processing patterns that require over-the-whole-corpus processing AND should work properly with GCP or other tools that use duplication and concurrent processing, like the CorpuStats plugin (similar to Termraider, but can run with GCP) or the LearningFramework (where training in most cases happens after all documents have been processed. With those, it would be good to have some guarantee or at least convention about what can or cannot run in parallel when the controller callbacks are invoked and maybe also some predictable behaviour about when those callbacks are invoked. But I can see your point about having this going on in a web/rest server with dynamic allocation of controllers. How is this done currently in the current code for services? The reason why I thought some guarantees or conventions would be good is that it is not easy to come up with a generic pattern where one makes sure that any kind of concurrency and order in the callbacks / execute calls can occur. But maybe it is unavoidable. BTW, the output-handler approach is, I think, not usable for most of these situations above, because they all process and hence need access to data that got collected by the PR (collectively by all duplicates, using some shared data structure), while often ignoring the documents themselves. |
I guess if guarantees is needed by some PR at all, then it would be those:
In the case of a web/rest service these should be not too hard to follow, in the case of GCP or similar programs it would be trivial. |
For reference, the current behaviour (not particularly planned this way but that's how it works out) does guarantee that the started callback will go to the template before the duplicates, but the order of the finished callbacks is not deterministic as it depends what order they were last given back to the pool after processing their final documents. |
Currently the controllerExecutionStarted callback is invoked first on the original, "template" controller, then in order on all the duplicates that were added to the pool, in a single thread, one after the other.
However, the controllerExecutionFinished/Aborted callbacks can occur in any order because the iteration happens over the queue at the time of termination. It may be useful to invoke these callbacks in either the same order or maybe even better in reverse order, so that the one for the original template controller gets invoked last. This should be easy to implement since there is a list which contains the controllers in the order in which they were created.
The details and guarantees related to the controllerExecutionXXX callbacks should get documented here and also as requirements for other GCP-like tools in the ControllerAwarePR interface.
The text was updated successfully, but these errors were encountered: