Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent number of references if update job is ran for multiple tags at once #292

Open
fstachura opened this issue Jun 25, 2024 · 5 comments
Labels

Comments

@fstachura
Copy link
Collaborator

To reproduce:

  1. Pull a git repository with more than one tag
  2. Run an update job twice, on an empty data directory each time
  3. Run this script on references.db from both data directories, compare results between databases. The number of references for some identifiers should be different.
  4. Run this script on both databases with one of identifiers that has a count difference and compare results (sort before diff). Entries for some files should be missing from one of the databases.
@tleb
Copy link
Member

tleb commented Jun 25, 2024

The main reason I can see is this timeline:

  • Start indexing of defs in version N;
  • Done indexing of defs in version N;
  • Start indexing of defs in version N+1;
  • Start indexing of refs in version N.

For refs in version N, we take the tokenizer output and tokens that match a def in the database are considered to be refs. We might be catching some defs from version N+1, this depends on timing.

Fixing this would require to have a list of defs in version N available when parsing refs of version N. Those are all defs that come from blobs in version N. This can be expensive to compute. Note that it is not a list of new blobs in version N (those that we just parsed), it can be some blobs that have been parsed since a long time that are still part of version N.

I plan on addressing this as part of #289. This however means the outputs of the old update.py and the new one won't be exactly identical. This was a property I attempted to keep for easy testing.

@fstachura
Copy link
Collaborator Author

Fixing this would require to have a list of defs in version N available when parsing refs of version N. Those are all defs that come from blobs in version N. This can be expensive to compute. Note that it is not a list of new blobs in version N (those that we just parsed), it can be some blobs that have been parsed since a long time that are still part of version N.

I think it we could just wait for all defs jobs to finish before starting other jobs. I did a simple modification that waits for defs threads to finish before starting everything else and it seems to get consistent results (three attempts on musl with three tags, I tried the default version too and it got different results each time).

It's also likely that we have a similar problem with comps docs:
https://github.com/bootlin/elixir/blob/master/update.py#L520

@tleb
Copy link
Member

tleb commented Jul 9, 2024

It will make it consistent if you index the same tags. It will not make it consistent if you index different tags, including if you do incremental indexing.

The trick I used on my experiment is to build the list of defs for version N by starting from the defs from version N-1. On top you (1) removing defs for blobs not present in version N and (2) add defs for new blobs in version N. Using sets it is pretty quick, must faster than re-iterating the whole database to grab all defs for a list of blobs.

Yes, the same problem is probably present for comps.

@fstachura
Copy link
Collaborator Author

I don't understand. You still would have to reindex all versions from the start for the references database to be consistent, no?

@tleb
Copy link
Member

tleb commented Jul 9, 2024

No, if you just computed defs for version N-1, you reuse that for version N. That is an optimization when indexing many versions in a row.

If you don't have easy access to version N-1 defs, you grab them from the database. That's slow (few seconds, complete database read), but required.

Then, once you have the exact defs valid for version N, your refs will be consistent no matter what.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants