-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve spaCy performance #154
Comments
Thanks for digging into this so thouroughly. Sounds like there are advantages and disadvantages to each. And performance may be heavily dependent on the data being run through it. Fortunately IsIdentifiable can run multiple versions of NER without recompilation. And we can even run multiple at once if helpful (either in parallel or with ConsensusRule). But I'd like to start by documenting how to setup each of those options. If the testing script is problematic we can always adjust it to just expect 1+ classifications for example. I'll start by updating my docs PR based on your feedback. I see that @jas has added the The docs should be from the perspective of a new user who just wants to run on some CSVs / DICOMs etc. I've been telling our data analysts about how this would simplify validation (currently one analyst is having to review 100 datasets extracted as part of GOFUSION / GODARTS) and is eyeballing the data manually 😮 |
Q: How does this compare to Stanford NER in the Java nerd we use at present? Is there a pressing need to switch to/add spaCy? |
This is fairly low priority at the moment seeing as we're not actively using Spacy in production, so let's leave this on the backlog. |
As seen in #151 (comment) then NER component of spaCy has changed from v2 to v3. With the sample test string "We are taking John to Queen Margaret Hospital today." you would expect two or three elements: John, Queen Margaret Hospital, and today. Alternatively, just Queen Margaret or Margaret Hospital would suffice. However v3 only found "today".
It turns out that the particular test string chosen was only correct in v2 due to good luck, and only wrong in v3 due to bad luck.
A test program has been written to try "We are taking X to Y today." for various combinations of X and Y, in both spacy v2 and v3, and in several language models. The results show that it's very sensitive to the particular text of X or Y (actually, X and Y), which is not intuitive, as we might have expected any X or Y to make grammatical sense, even nonsense names.
The best performance now comes with using spaCy v3, and the
en_core_web_trf
language model, although as this is transformer-based it requires additional python modules including pytorch, and so would undoubtedly work much faster with a GPU. SpaCy v2 with theen_core_web_lg
model is also quite good, but has some unusual behaviour such as producing the single entity "John to Y Hospital" instead of two separate entities.We should consider whether we change to v2 and the lg model, or v3 and the trf model. This would require an upgrade to the live pipeline in the safe haven.
The text was updated successfully, but these errors were encountered: