Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise speed of training u #2539

Open
RobinL opened this issue Dec 5, 2024 · 1 comment
Open

Optimise speed of training u #2539

RobinL opened this issue Dec 5, 2024 · 1 comment

Comments

@RobinL
Copy link
Member

RobinL commented Dec 5, 2024

It would be faster and more memory efficient to train u probabilities comparison-by-comparison rather than doing them 'all in one' because:

  • Doing them 'all in one' uses more memory. Spill to disk is a significant source of slowdown. See here but I also noticed big nonlinearites due to spill to disk when I did performance testing and ran out of memory due to large embeddings vectors
  • Some u probabilities require a much larger value of max_pairs than others. For instance, you don't need to generate a million comparisons to get u values for, say, a 'gender' field. But you might need 100 million for a very high cardinality field like social security number.

With the later, we could make 'max pairs' more dynamic using the techniques described here

But an initial PR may just do this calculation comparison by comparison

This also has the advantage that we could output some sort of timings on the relative computational intensity of each comparison

@zmbc
Copy link
Contributor

zmbc commented Dec 6, 2024

I think there are further optimizations to make than even what you suggest here.

You could count the number of occurrences of each value, then do weighted sampling of pairs from that aggregate table, and weight the calculation by the commonness of each pair. That way you get more precise estimates with less computational work, and much less for fields with low cardinality (where you could easily sample 100% of unique pairs). I believe this is similar to some optimizations that are present in fastLink.

For exact match levels that are the first level in their comparison (admittedly not the most interesting, but these are common), you can calculate the u probabilities exactly analytically from that aggregate table. Simply divide the counts by the total number of non-null values, then take the sum of the squares of those proportions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants