Not matching 2 obvious records.. #212

cage77 · 2024-12-09T10:42:11Z

Hi,
I am using the deduplifhir to find duplicate records in a generated CSV :

[[email protected] cli]$ cat /tmp/dedup.csv
"unique_id","family_name","given_name","gender","birth_date"
"23","morrison","elizabeth","F","12/05/1953"
"24","morrison","elizabeth","F","12/05/1953"

I run
[[email protected] cli]$ python3.9 ecqm_dedupe.py dedupe-data --fmt CSV /tmp/dedup.csv /tmp/ddd.csv
(i;ve cut some names here)
Stats for nerds:
blocking_rule row_count cumulative_rows cartesian match_key start
0 l."birth_date" = r."birth_date" 187 187 125751 0 0
1 (l."ssn" = r."ssn") AND (l."birth_date" = r."birth_date") 0 187 125751 1 187
2 l."phone" = r."phone" 48079 48266 125751 2 187
----- Estimating u probabilities using random sampling -----
u probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
u probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Estimated u probabilities using random sampling

Your model is not yet fully trained. Missing estimates for:
- street_address0 (no m values are trained).
- postal_code0 (some u values are not trained, no m values are trained).
- street_address1 (no m values are trained).
- postal_code1 (some u values are not trained, no m values are trained).
- phone (no m values are trained).
- given_name (no m values are trained).
- family_name (no m values are trained).
- birth_date (no m values are trained).
/home/cage/public_html/mdinteractive-00/scripts/dedupliFHIR/cli

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."ssn" = r."ssn"

Parameter estimates will be made for the following comparison(s):
- street_address0
- postal_code0
- street_address1
- postal_code1
- phone
- given_name
- family_name
- birth_date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:

WARNING:
Level Exact match on street_address0 on comparison street_address0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on street_address1 on comparison street_address1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Jaro-Winkler distance of given_name >= 0.7 on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level DamerauLevenshtein distance <= 1 on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 month on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 10 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison birth_date not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.996 in probability_two_random_records_match
Iteration 2: Largest change in params was 0.00392 in probability_two_random_records_match

EM converged after 2 iterations
m probability not trained for street_address0 - Exact match on street_address0 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for street_address1 - Exact match on street_address1 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Jaro-Winkler distance of given_name >= 0.7 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - DamerauLevenshtein distance <= 1 (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 year (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 10 year (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
- street_address0 (some m values are not trained).
- postal_code0 (some u values are not trained, some m values are not trained).
- street_address1 (some m values are not trained).
- postal_code1 (some u values are not trained, some m values are not trained).
- given_name (some m values are not trained).
- family_name (some m values are not trained).
- birth_date (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
l."birth_date" = r."birth_date"

Parameter estimates will be made for the following comparison(s):
- street_address0
- postal_code0
- street_address1
- postal_code1
- phone
- given_name
- family_name

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- birth_date

WARNING:
Level Exact match on sector on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code0 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value

Iteration 1: Largest change in params was -0.922 in the m_probability of street_address1, level Exact match on street_address1
Iteration 2: Largest change in params was 0.00042 in probability_two_random_records_match

EM converged after 2 iterations
m probability not trained for postal_code0 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code0 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
- postal_code0 (some u values are not trained, some m values are not trained).
- postal_code1 (some u values are not trained, some m values are not trained).
- birth_date (some m values are not trained).

----- Starting EM training session -----

Estimating the m probabilities of the model by blocking on:
(l."street_address0" = r."street_address0") AND (l."postal_code0" = r."postal_code0")

Parameter estimates will be made for the following comparison(s):
- street_address1
- postal_code1
- phone
- given_name
- family_name
- birth_date

Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules:
- street_address0
- postal_code0

WARNING:
Level Exact match on street_address1 on comparison street_address1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on sector on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on district on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on area on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison postal_code1 not observed in dataset, unable to train m value

WARNING:
Level Exact match on given_name on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level Jaro-Winkler distance of given_name >= 0.88 on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison given_name not observed in dataset, unable to train m value

WARNING:
Level Exact match on family_name on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level Jaro-Winkler distance of family_name >= 0.88 on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison family_name not observed in dataset, unable to train m value

WARNING:
Level DamerauLevenshtein distance <= 1 on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 month on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 1 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level Abs date difference <= 10 year on comparison birth_date not observed in dataset, unable to train m value

WARNING:
Level All other comparisons on comparison birth_date not observed in dataset, unable to train m value

Iteration 1: Largest change in params was 0.187 in the m_probability of phone, level Exact match on phone
Iteration 2: Largest change in params was 7.74e-10 in probability_two_random_records_match

EM converged after 2 iterations
m probability not trained for street_address1 - Exact match on street_address1 (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on sector (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on district (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - Exact match on area (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for postal_code1 - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Exact match on given_name (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - Jaro-Winkler distance of given_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for given_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - Exact match on family_name (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - Jaro-Winkler distance of family_name >= 0.88 (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for family_name - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - DamerauLevenshtein distance <= 1 (comparison vector value: 4). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 month (comparison vector value: 3). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 1 year (comparison vector value: 2). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - Abs date difference <= 10 year (comparison vector value: 1). This usually means the comparison level was never observed in the training data.
m probability not trained for birth_date - All other comparisons (comparison vector value: 0). This usually means the comparison level was never observed in the training data.

Your model is not yet fully trained. Missing estimates for:
- postal_code0 (some u values are not trained, some m values are not trained).
- postal_code1 (some u values are not trained, some m values are not trained).
- birth_date (some m values are not trained).
Blocking time: 0.03 seconds
Predict time: 0.49 seconds

-- WARNING --
You have called predict(), but there are some parameter estimates which have neither been estimated or specified in your settings dictionary. To produce predictions the following untrained trained parameters will use default values.
Comparison: 'postal_code0':
m values not fully trained
Comparison: 'postal_code0':
u values not fully trained
Comparison: 'postal_code1':
m values not fully trained
Comparison: 'postal_code1':
u values not fully trained
Comparison: 'birth_date':
m values not fully trained
The 'probability_two_random_records_match' setting has been set to the default value (0.0001).
If this is not the desired behaviour, either:

assign a value for probability_two_random_records_match in your settings dictionary, or
estimate with the linker.estimate_probability_two_random_records_match function.
Completed iteration 1, num representatives needing updating: 0

the result is:
[email protected] cli] $ cat /tmp/ddd.csv
,cluster_id,unique_id,path,family_name,given_name,gender,birth_date,id,truth_value,phone,street_address0,city0,state0,postal_code0,street_address1,city1,state1,postal_code1,ssn
394,23,23,,morrison,elizabeth,F,1953-12-05,,,,,,,,,,,,
458,24,24,,morrison,elizabeth,F,1953-12-05,,,,,,,,,,,,

Seems it doesn't put these 2 records in the same cluster as duplicates. It seems for other records it works, but this is a strange case that is not found as duplicate - not sure why .

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not matching 2 obvious records.. #212

Not matching 2 obvious records.. #212

cage77 commented Dec 9, 2024

Not matching 2 obvious records.. #212

Not matching 2 obvious records.. #212

Comments

cage77 commented Dec 9, 2024