You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue seems similar to #2, but I'm not encountering problems during pair generation.
Hello @djvanderlaan, thanks for the package, it's really easy to use, but I'm having an issue when trying to link two subdatasets.
In my case, I'm using a dataset of disease reports where some reports are for returning patients. I want to link the return patients to their original visit, but there is no unique id, hence, problinking. Below is a minimal example of the process:
id
sex
age
return_visit
date
1
M
25
TRUE
Aug 01
2
F
19
TRUE
Sep 29
3
M
25
FALSE
Sep 15
4
F
19
FALSE
Jul 19
I have extra variables to "identify" my patients, but the basic idea is that 3 and 4 are the same people as 1 and 2. So I created two subdatasets based on the return_visit variable and used just simple blocking and some filtering to reduce the number of pairs.
So far so good.
I then created a dummy variable with value TRUE so it would capture all pairs and used a selection of date variables to select for possible cases, leaving me with just 83.000 pairs. The problem arises when I use link(), and it returns 3.3 million records.
Surely my pairs are in there, but it seems to be a full_join of both datasets and I cannot for the life of me understand why link doesn't respect my selection variable or why it includes every record. Is it an issue with subdatasets? Have I made a mistake somewhere between using select_greedy and link()? Is it an issue with using datasets that have the same number of variables and who all have the same names?
Unfortunately, for confidentiality reasons, I cannot provide reprex, but if you can point me in the right direction I'll do my own research. Thanks.
Stage
Number of records
return visits
550 thousand patients
first visits
2.8 million patients
after blocking
1.07 million pairs
after filtering
83 thousand pairs
after linking
3.3 million records
The text was updated successfully, but these errors were encountered:
This issue seems similar to #2, but I'm not encountering problems during pair generation.
Hello @djvanderlaan, thanks for the package, it's really easy to use, but I'm having an issue when trying to link two subdatasets.
In my case, I'm using a dataset of disease reports where some reports are for returning patients. I want to link the return patients to their original visit, but there is no unique id, hence, problinking. Below is a minimal example of the process:
I have extra variables to "identify" my patients, but the basic idea is that 3 and 4 are the same people as 1 and 2. So I created two subdatasets based on the return_visit variable and used just simple blocking and some filtering to reduce the number of pairs.
So far so good.
I then created a dummy variable with value TRUE so it would capture all pairs and used a selection of date variables to select for possible cases, leaving me with just 83.000 pairs. The problem arises when I use
link()
, and it returns 3.3 million records.Surely my pairs are in there, but it seems to be a full_join of both datasets and I cannot for the life of me understand why link doesn't respect my selection variable or why it includes every record. Is it an issue with subdatasets? Have I made a mistake somewhere between using select_greedy and
link()
? Is it an issue with using datasets that have the same number of variables and who all have the same names?Unfortunately, for confidentiality reasons, I cannot provide reprex, but if you can point me in the right direction I'll do my own research. Thanks.
The text was updated successfully, but these errors were encountered: