-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent number of references if update job is ran for multiple tags at once #292
Comments
The main reason I can see is this timeline:
For refs in version N, we take the tokenizer output and tokens that match a def in the database are considered to be refs. We might be catching some defs from version N+1, this depends on timing. Fixing this would require to have a list of defs in version N available when parsing refs of version N. Those are all defs that come from blobs in version N. This can be expensive to compute. Note that it is not a list of new blobs in version N (those that we just parsed), it can be some blobs that have been parsed since a long time that are still part of version N. I plan on addressing this as part of #289. This however means the outputs of the old |
I think it we could just wait for all defs jobs to finish before starting other jobs. I did a simple modification that waits for defs threads to finish before starting everything else and it seems to get consistent results (three attempts on musl with three tags, I tried the default version too and it got different results each time). It's also likely that we have a similar problem with comps docs: |
It will make it consistent if you index the same tags. It will not make it consistent if you index different tags, including if you do incremental indexing. The trick I used on my experiment is to build the list of defs for version N by starting from the defs from version N-1. On top you (1) removing defs for blobs not present in version N and (2) add defs for new blobs in version N. Using sets it is pretty quick, must faster than re-iterating the whole database to grab all defs for a list of blobs. Yes, the same problem is probably present for |
I don't understand. You still would have to reindex all versions from the start for the references database to be consistent, no? |
No, if you just computed defs for version N-1, you reuse that for version N. That is an optimization when indexing many versions in a row. If you don't have easy access to version N-1 defs, you grab them from the database. That's slow (few seconds, complete database read), but required. Then, once you have the exact defs valid for version N, your refs will be consistent no matter what. |
To reproduce:
The text was updated successfully, but these errors were encountered: