Optimise speed of training u #2539

RobinL · 2024-12-05T18:43:02Z

It would be faster and more memory efficient to train u probabilities comparison-by-comparison rather than doing them 'all in one' because:

Doing them 'all in one' uses more memory. Spill to disk is a significant source of slowdown. See here but I also noticed big nonlinearites due to spill to disk when I did performance testing and ran out of memory due to large embeddings vectors
Some u probabilities require a much larger value of max_pairs than others. For instance, you don't need to generate a million comparisons to get u values for, say, a 'gender' field. But you might need 100 million for a very high cardinality field like social security number.

With the later, we could make 'max pairs' more dynamic using the techniques described here

But an initial PR may just do this calculation comparison by comparison

This also has the advantage that we could output some sort of timings on the relative computational intensity of each comparison

The text was updated successfully, but these errors were encountered:

zmbc · 2024-12-06T21:16:27Z

I think there are further optimizations to make than even what you suggest here.

You could count the number of occurrences of each value, then do weighted sampling of pairs from that aggregate table, and weight the calculation by the commonness of each pair. That way you get more precise estimates with less computational work, and much less for fields with low cardinality (where you could easily sample 100% of unique pairs). I believe this is similar to some optimizations that are present in fastLink.

For exact match levels that are the first level in their comparison (admittedly not the most interesting, but these are common), you can calculate the u probabilities exactly analytically from that aggregate table. Simply divide the counts by the total number of non-null values, then take the sum of the squares of those proportions.

zmbc mentioned this issue Dec 7, 2024

[FEAT] Exact match level required for TF adjustment - can this be avoided? #2006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise speed of training u #2539

Optimise speed of training u #2539

RobinL commented Dec 5, 2024 •

edited

Loading

zmbc commented Dec 6, 2024 •

edited

Loading

Optimise speed of training u #2539

Optimise speed of training u #2539

Comments

RobinL commented Dec 5, 2024 • edited Loading

zmbc commented Dec 6, 2024 • edited Loading

RobinL commented Dec 5, 2024 •

edited

Loading

zmbc commented Dec 6, 2024 •

edited

Loading