You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be faster and more memory efficient to train u probabilities comparison-by-comparison rather than doing them 'all in one' because:
Doing them 'all in one' uses more memory. Spill to disk is a significant source of slowdown. See here but I also noticed big nonlinearites due to spill to disk when I did performance testing and ran out of memory due to large embeddings vectors
Some u probabilities require a much larger value of max_pairs than others. For instance, you don't need to generate a million comparisons to get u values for, say, a 'gender' field. But you might need 100 million for a very high cardinality field like social security number.
With the later, we could make 'max pairs' more dynamic using the techniques described here
But an initial PR may just do this calculation comparison by comparison
This also has the advantage that we could output some sort of timings on the relative computational intensity of each comparison
The text was updated successfully, but these errors were encountered:
I think there are further optimizations to make than even what you suggest here.
You could count the number of occurrences of each value, then do weighted sampling of pairs from that aggregate table, and weight the calculation by the commonness of each pair. That way you get more precise estimates with less computational work, and much less for fields with low cardinality (where you could easily sample 100% of unique pairs). I believe this is similar to some optimizations that are present in fastLink.
For exact match levels that are the first level in their comparison (admittedly not the most interesting, but these are common), you can calculate the u probabilities exactly analytically from that aggregate table. Simply divide the counts by the total number of non-null values, then take the sum of the squares of those proportions.
It would be faster and more memory efficient to train
u
probabilities comparison-by-comparison rather than doing them 'all in one' because:With the later, we could make 'max pairs' more dynamic using the techniques described here
But an initial PR may just do this calculation comparison by comparison
This also has the advantage that we could output some sort of timings on the relative computational intensity of each comparison
The text was updated successfully, but these errors were encountered: