Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve spaCy performance #154

Open
howff opened this issue Aug 10, 2022 · 3 comments
Open

Improve spaCy performance #154

howff opened this issue Aug 10, 2022 · 3 comments

Comments

@howff
Copy link

howff commented Aug 10, 2022

As seen in #151 (comment) then NER component of spaCy has changed from v2 to v3. With the sample test string "We are taking John to Queen Margaret Hospital today." you would expect two or three elements: John, Queen Margaret Hospital, and today. Alternatively, just Queen Margaret or Margaret Hospital would suffice. However v3 only found "today".

It turns out that the particular test string chosen was only correct in v2 due to good luck, and only wrong in v3 due to bad luck.

A test program has been written to try "We are taking X to Y today." for various combinations of X and Y, in both spacy v2 and v3, and in several language models. The results show that it's very sensitive to the particular text of X or Y (actually, X and Y), which is not intuitive, as we might have expected any X or Y to make grammatical sense, even nonsense names.

The best performance now comes with using spaCy v3, and the en_core_web_trf language model, although as this is transformer-based it requires additional python modules including pytorch, and so would undoubtedly work much faster with a GPU. SpaCy v2 with the en_core_web_lg model is also quite good, but has some unusual behaviour such as producing the single entity "John to Y Hospital" instead of two separate entities.

We should consider whether we change to v2 and the lg model, or v3 and the trf model. This would require an upgrade to the live pipeline in the safe haven.

@tznind
Copy link
Contributor

tznind commented Aug 10, 2022

Thanks for digging into this so thouroughly. Sounds like there are advantages and disadvantages to each. And performance may be heavily dependent on the data being run through it.

Fortunately IsIdentifiable can run multiple versions of NER without recompilation. And we can even run multiple at once if helpful (either in parallel or with ConsensusRule).

But I'd like to start by documenting how to setup each of those options. If the testing script is problematic we can always adjust it to just expect 1+ classifications for example.

I'll start by updating my docs PR based on your feedback.

I see that @jas has added the -d option for running with a specific language file (see #153). Maybe we can beef up the script more so that the daemon can run as one file or another more easily. But we need to make sure all dependencies are clearly included in our docs for new users of the tool.

The docs should be from the perspective of a new user who just wants to run on some CSVs / DICOMs etc. I've been telling our data analysts about how this would simplify validation (currently one analyst is having to review 100 datasets extracted as part of GOFUSION / GODARTS) and is eyeballing the data manually 😮

@jas88
Copy link
Member

jas88 commented Aug 14, 2022

Q: How does this compare to Stanford NER in the Java nerd we use at present? Is there a pressing need to switch to/add spaCy?

@rkm
Copy link
Member

rkm commented Aug 15, 2022

This is fairly low priority at the moment seeing as we're not actively using Spacy in production, so let's leave this on the backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants