Inconsistent number of references if update job is ran for multiple tags at once #292

fstachura · 2024-06-25T13:26:49Z

To reproduce:

Pull a git repository with more than one tag
Run an update job twice, on an empty data directory each time
Run this script on references.db from both data directories, compare results between databases. The number of references for some identifiers should be different.
Run this script on both databases with one of identifiers that has a count difference and compare results (sort before diff). Entries for some files should be missing from one of the databases.

tleb · 2024-06-25T16:06:42Z

The main reason I can see is this timeline:

Start indexing of defs in version N;
Done indexing of defs in version N;
Start indexing of defs in version N+1;
Start indexing of refs in version N.

For refs in version N, we take the tokenizer output and tokens that match a def in the database are considered to be refs. We might be catching some defs from version N+1, this depends on timing.

Fixing this would require to have a list of defs in version N available when parsing refs of version N. Those are all defs that come from blobs in version N. This can be expensive to compute. Note that it is not a list of new blobs in version N (those that we just parsed), it can be some blobs that have been parsed since a long time that are still part of version N.

I plan on addressing this as part of #289. This however means the outputs of the old update.py and the new one won't be exactly identical. This was a property I attempted to keep for easy testing.

fstachura · 2024-07-09T13:12:33Z

Fixing this would require to have a list of defs in version N available when parsing refs of version N. Those are all defs that come from blobs in version N. This can be expensive to compute. Note that it is not a list of new blobs in version N (those that we just parsed), it can be some blobs that have been parsed since a long time that are still part of version N.

I think it we could just wait for all defs jobs to finish before starting other jobs. I did a simple modification that waits for defs threads to finish before starting everything else and it seems to get consistent results (three attempts on musl with three tags, I tried the default version too and it got different results each time).

It's also likely that we have a similar problem with comps docs:
https://github.com/bootlin/elixir/blob/master/update.py#L520

tleb · 2024-07-09T13:45:23Z

It will make it consistent if you index the same tags. It will not make it consistent if you index different tags, including if you do incremental indexing.

The trick I used on my experiment is to build the list of defs for version N by starting from the defs from version N-1. On top you (1) removing defs for blobs not present in version N and (2) add defs for new blobs in version N. Using sets it is pretty quick, must faster than re-iterating the whole database to grab all defs for a list of blobs.

Yes, the same problem is probably present for comps.

fstachura · 2024-07-09T14:10:37Z

I don't understand. You still would have to reindex all versions from the start for the references database to be consistent, no?

tleb · 2024-07-09T14:23:34Z

No, if you just computed defs for version N-1, you reuse that for version N. That is an optimization when indexing many versions in a row.

If you don't have easy access to version N-1 defs, you grab them from the database. That's slow (few seconds, complete database read), but required.

Then, once you have the exact defs valid for version N, your refs will be consistent no matter what.

fstachura added the bug label Jun 25, 2024

fstachura mentioned this issue Jun 25, 2024

Remove net from keyword blacklist #286

Open

tleb added indexing Related to the index content — missing definitions/references, lexer bugs, new ctags features... and removed bug labels Feb 25, 2025

tleb mentioned this issue Feb 25, 2025

update: Rewrite update script #372

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent number of references if update job is ran for multiple tags at once #292

Inconsistent number of references if update job is ran for multiple tags at once #292

fstachura commented Jun 25, 2024

tleb commented Jun 25, 2024 •

edited

Loading

fstachura commented Jul 9, 2024

tleb commented Jul 9, 2024

fstachura commented Jul 9, 2024

tleb commented Jul 9, 2024 •

edited

Loading

Inconsistent number of references if update job is ran for multiple tags at once #292

Inconsistent number of references if update job is ran for multiple tags at once #292

Comments

fstachura commented Jun 25, 2024

tleb commented Jun 25, 2024 • edited Loading

fstachura commented Jul 9, 2024

tleb commented Jul 9, 2024

fstachura commented Jul 9, 2024

tleb commented Jul 9, 2024 • edited Loading

tleb commented Jun 25, 2024 •

edited

Loading

tleb commented Jul 9, 2024 •

edited

Loading