diff --git a/.gitignore b/.gitignore index 2937fcf8..24212224 100644 --- a/.gitignore +++ b/.gitignore @@ -1,4 +1,8 @@ # General file exclusion +deduce/data/lookup/cache/* + +# Top level scripts ignored by default +/*.py # Exclude the following filetypes *.sav diff --git a/CHANGELOG.md b/CHANGELOG.md index c157f3e0..46176dbe 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,15 +5,47 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## 3.0.0 (2023-12-20) + +### Added +- speed optimizations, ~250% +- pseudo-annotating eponymous diseases (e.g. Creutzfeldt-Jakob) +- `PatientNameAnnotator`, which replaces `deduce.pattern` +- a structured way for loading and building lookup structures (lists and tries), including caching +- `pre_match_words` for some regexp annotators, speeding up the annotating +- option to present a user config as dict (using `config` keyword) + +### Changed +- speedup for `TokenPatternAnnotator` +- some internals of `ContextPatternAnnotator` +- initials now detected by lookup list, rather than pattern +- redactor open and close chars from `<` `>` to `[` `]`, as previous chars caused issues in html (so deidentified text now shows `[PATIENT]`, `[LOCATIE]`, etc.) +- names of lookup structures to singular (`prefix`, rather than `prefixes`) +- `INSTELLING` tag to `ZIEKENHUIS` and `ZORGINSTELLING` +- refactored and simplified annotator loading, specifically the `annotator_type` config keyword now accepts references to classes (e.g `deduce.annotator.TokenPatternAnnotator`) +- renamed `interfix_with_capital` annotator to `interfix_with_name` + +### Deprecated +- the `config_file` keyword, now replaced by `config` which accepts both filenames and dicts +- old lookup list names, e.g. `prefixes` now replaced by `prefix` +- annotator types 'custom', 'regexp', 'token_pattern', 'dd_token_pattern' and 'annotation_context', all replaced by setting class directly as annotator_type + +### Removed +- automated coverage reporting on coveralls.io +- options `lowercase_lookup`, `lowercase_neg_lookup` for token patterns +- everything in `deduce.pattern`, patient patterns now replaced by `PatientNameAnnotator` +- `utils.any_in_text` + +### Fixed +- some small additions/removals for specific lookup lists +- smaller bugs related to overlapping matches + ## 2.5.0 (2023-11-28) ### Added - the `RegexpPseudoAnnotator` component for filtering regexp matches based on preceding/following words - a `prefix_with_interfix` pattern for names, detecting e.g. `Dr. van Loon` -### Fixed -- a bug with `BsnAnnotator` with non-digit characters in regexp - ### Changed - the age detection component, with improved logic and pseudo patterns - annotations are no longer counted adjacent when separated by a comma @@ -22,6 +54,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - extended the postbus pattern for `xx.xxx` format (old notation) - some smaller optimizations and exceptions for institution, hospital, placename, residence, medical term, first name, and last name lookup lists +### Fixed +- a bug with `BsnAnnotator` with non-digit characters in regexp + ## 2.4.2 (2023-11-22) ### Changed @@ -98,15 +133,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - detects year-month-day format in addition to (day-month-year) - loading a custom config now only replaces the config options that are explicitly set, using defaults for those not included in the custom config -### Fixed -- annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged) +### Deprecated +- backwards compatibility, which was temporary added to transition from v1 to v2 ### Removed - a separate patient identifier tag, now superseded by a generic tag - detection of day/month combinations for dates, as this caused many false positives (e.g. lab values, numeric scores) -### Deprecated -- backwards compatibility, which was temporary added to transition from v1 to v2 +### Fixed +- annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged) ## 2.0.3 (2023-04-06) @@ -125,6 +160,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## 2.0.0 (2022-12-05) +### Added +- introduced new interface for deidentification, using `Deduce()` class +- a separate documentation page, with tutorial and migration guide +- support for python 3.10 and 3.11 + ### Changed - major refactor that touches pretty much every line of code - use `docdeid` package for logic @@ -134,12 +174,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - refactor annotators into separate classes, using structured annotations - guidelines for contributing -### Added -- introduced new interface for deidentification, using `Deduce()` class -- a separate documentation page, with tutorial and migration guide -- support for python 3.10 and 3.11 - - ### Removed - the `annotate_text` and `deidentify_annotations` functions - all in-text annotation (under the hood) and associated functions @@ -152,14 +186,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## 1.0.8 (2021-11-29) +### Added +- warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation + ### Fixed - various modifications related to adding or subtracting spaces in annotated texts - remove the lowercasing of institutions' names - therefore, all structured annotations have texts matching the original text in the same span -### Added -- warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation - ## 1.0.7 (2021-11-03) ### Changed diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 8c458052..2bcfc365 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -13,23 +13,29 @@ Before starting, some things to consider: * This project uses poetry for package management. Install it with ```pip install poetry``` * Set up the environment is easy, just use ```poetry install``` * The makefile contains some useful commands when developing: - * `make test` runs the tests (including coverage) * `make format` formats the package code * `make lint` runs the linters (check the output) * `make clean` removes build/test artifacts, etc * And for docs: * `make build-docs` builds the docs - * `make clean-docs` removes docs build + +## Runing the tests + +```bash +pytest . +``` ## PR checlist * Verify that tests are passing * Verify that tests are updated/added according to changes * Run the formatters (`make format`) -* Run the linters (`make lint`) and check the output for anything preventable +* Run the linters (`make lint`) * Add a section to the changelog * Add a description to your PR +If all the steps above are followed, this ensures a quick review and release of your contribution. + ## Releasing * Readthedocs has a webhook connected to pushes on the main branch. It will trigger and update automatically. * Create a [release on github](https://github.com/vmenger/docdeid/releases/new), create a tag with the right version, manually copy and paste from the changelog diff --git a/LICENSE.md b/LICENSE.md index fae84003..6dd967c1 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -1,166 +1,675 @@ # License - GNU LESSER GENERAL PUBLIC LICENSE + GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 - Copyright (C) 2007 Free Software Foundation, Inc. + Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. - - This version of the GNU Lesser General Public License incorporates -the terms and conditions of version 3 of the GNU General Public -License, supplemented by the additional permissions listed below. - - 0. Additional Definitions. - - As used herein, "this License" refers to version 3 of the GNU Lesser -General Public License, and the "GNU GPL" refers to version 3 of the GNU -General Public License. - - "The Library" refers to a covered work governed by this License, -other than an Application or a Combined Work as defined below. - - An "Application" is any work that makes use of an interface provided -by the Library, but which is not otherwise based on the Library. -Defining a subclass of a class defined by the Library is deemed a mode -of using an interface provided by the Library. - - A "Combined Work" is a work produced by combining or linking an -Application with the Library. The particular version of the Library -with which the Combined Work was made is also called the "Linked -Version". - - The "Minimal Corresponding Source" for a Combined Work means the -Corresponding Source for the Combined Work, excluding any source code -for portions of the Combined Work that, considered in isolation, are -based on the Application, and not on the Linked Version. - - The "Corresponding Application Code" for a Combined Work means the -object code and/or source code for the Application, including any data -and utility programs needed for reproducing the Combined Work from the -Application, but excluding the System Libraries of the Combined Work. - - 1. Exception to Section 3 of the GNU GPL. - - You may convey a covered work under sections 3 and 4 of this License -without being bound by section 3 of the GNU GPL. - - 2. Conveying Modified Versions. - - If you modify a copy of the Library, and, in your modifications, a -facility refers to a function or data to be supplied by an Application -that uses the facility (other than as an argument passed when the -facility is invoked), then you may convey a copy of the modified -version: - - a) under this License, provided that you make a good faith effort to - ensure that, in the event an Application does not supply the - function or data, the facility still operates, and performs - whatever part of its purpose remains meaningful, or - - b) under the GNU GPL, with none of the additional permissions of - this License applicable to that copy. - - 3. Object Code Incorporating Material from Library Header Files. - - The object code form of an Application may incorporate material from -a header file that is part of the Library. You may convey such object -code under terms of your choice, provided that, if the incorporated -material is not limited to numerical parameters, data structure -layouts and accessors, or small macros, inline functions and templates -(ten or fewer lines in length), you do both of the following: - - a) Give prominent notice with each copy of the object code that the - Library is used in it and that the Library and its use are - covered by this License. - - b) Accompany the object code with a copy of the GNU GPL and this license - document. - - 4. Combined Works. - - You may convey a Combined Work under terms of your choice that, -taken together, effectively do not restrict modification of the -portions of the Library contained in the Combined Work and reverse -engineering for debugging such modifications, if you also do each of -the following: - - a) Give prominent notice with each copy of the Combined Work that - the Library is used in it and that the Library and its use are - covered by this License. - - b) Accompany the Combined Work with a copy of the GNU GPL and this license - document. - - c) For a Combined Work that displays copyright notices during - execution, include the copyright notice for the Library among - these notices, as well as a reference directing the user to the - copies of the GNU GPL and this license document. - - d) Do one of the following: - - 0) Convey the Minimal Corresponding Source under the terms of this - License, and the Corresponding Application Code in a form - suitable for, and under terms that permit, the user to - recombine or relink the Application with a modified version of - the Linked Version to produce a modified Combined Work, in the - manner specified by section 6 of the GNU GPL for conveying - Corresponding Source. - - 1) Use a suitable shared library mechanism for linking with the - Library. A suitable mechanism is one that (a) uses at run time - a copy of the Library already present on the user's computer - system, and (b) will operate properly with a modified version - of the Library that is interface-compatible with the Linked - Version. - - e) Provide Installation Information, but only if you would otherwise - be required to provide such information under section 6 of the - GNU GPL, and only to the extent that such information is - necessary to install and execute a modified version of the - Combined Work produced by recombining or relinking the - Application with a modified version of the Linked Version. (If - you use option 4d0, the Installation Information must accompany - the Minimal Corresponding Source and Corresponding Application - Code. If you use option 4d1, you must provide the Installation - Information in the manner specified by section 6 of the GNU GPL - for conveying Corresponding Source.) - - 5. Combined Libraries. - - You may place library facilities that are a work based on the -Library side by side in a single library together with other library -facilities that are not Applications and are not covered by this -License, and convey such a combined library under terms of your -choice, if you do both of the following: - - a) Accompany the combined library with a copy of the same work based - on the Library, uncombined with any other library facilities, - conveyed under the terms of this License. - - b) Give prominent notice with the combined library that part of it - is a work based on the Library, and explaining where to find the - accompanying uncombined form of the same work. - - 6. Revised Versions of the GNU Lesser General Public License. - - The Free Software Foundation may publish revised and/or new versions -of the GNU Lesser General Public License from time to time. Such new -versions will be similar in spirit to the present version, but may -differ in detail to address new problems or concerns. - - Each version is given a distinguishing version number. If the -Library as you received it specifies that a certain numbered version -of the GNU Lesser General Public License "or any later version" -applies to it, you have the option of following the terms and -conditions either of that published version or of any later version -published by the Free Software Foundation. If the Library as you -received it does not specify a version number of the GNU Lesser -General Public License, you may choose any version of the GNU Lesser -General Public License ever published by the Free Software Foundation. - - If the Library as you received it specifies that a proxy can decide -whether future versions of the GNU Lesser General Public License shall -apply, that proxy's public statement of acceptance of any version is -permanent authorization for you to choose that version for the -Library. \ No newline at end of file + Preamble + + The GNU General Public License is a free, copyleft license for +software and other kinds of works. + + The licenses for most software and other practical works are designed +to take away your freedom to share and change the works. By contrast, +the GNU General Public License is intended to guarantee your freedom to +share and change all versions of a program--to make sure it remains free +software for all its users. We, the Free Software Foundation, use the +GNU General Public License for most of our software; it applies also to +any other work released this way by its authors. You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +them if you wish), that you receive source code or can get it if you +want it, that you can change the software or use pieces of it in new +free programs, and that you know you can do these things. + + To protect your rights, we need to prevent others from denying you +these rights or asking you to surrender the rights. Therefore, you have +certain responsibilities if you distribute copies of the software, or if +you modify it: responsibilities to respect the freedom of others. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must pass on to the recipients the same +freedoms that you received. You must make sure that they, too, receive +or can get the source code. And you must show them these terms so they +know their rights. + + Developers that use the GNU GPL protect your rights with two steps: +(1) assert copyright on the software, and (2) offer you this License +giving you legal permission to copy, distribute and/or modify it. + + For the developers' and authors' protection, the GPL clearly explains +that there is no warranty for this free software. For both users' and +authors' sake, the GPL requires that modified versions be marked as +changed, so that their problems will not be attributed erroneously to +authors of previous versions. + + Some devices are designed to deny users access to install or run +modified versions of the software inside them, although the manufacturer +can do so. This is fundamentally incompatible with the aim of +protecting users' freedom to change the software. The systematic +pattern of such abuse occurs in the area of products for individuals to +use, which is precisely where it is most unacceptable. Therefore, we +have designed this version of the GPL to prohibit the practice for those +products. If such problems arise substantially in other domains, we +stand ready to extend this provision to those domains in future versions +of the GPL, as needed to protect the freedom of users. + + Finally, every program is threatened constantly by software patents. +States should not allow patents to restrict development and use of +software on general-purpose computers, but in those that do, we wish to +avoid the special danger that patents applied to a free program could +make it effectively proprietary. To prevent this, the GPL assures that +patents cannot be used to render the program non-free. + + The precise terms and conditions for copying, distribution and +modification follow. + + TERMS AND CONDITIONS + + 0. Definitions. + + "This License" refers to version 3 of the GNU General Public License. + + "Copyright" also means copyright-like laws that apply to other kinds of +works, such as semiconductor masks. + + "The Program" refers to any copyrightable work licensed under this +License. Each licensee is addressed as "you". "Licensees" and +"recipients" may be individuals or organizations. + + To "modify" a work means to copy from or adapt all or part of the work +in a fashion requiring copyright permission, other than the making of an +exact copy. The resulting work is called a "modified version" of the +earlier work or a work "based on" the earlier work. + + A "covered work" means either the unmodified Program or a work based +on the Program. + + To "propagate" a work means to do anything with it that, without +permission, would make you directly or secondarily liable for +infringement under applicable copyright law, except executing it on a +computer or modifying a private copy. Propagation includes copying, +distribution (with or without modification), making available to the +public, and in some countries other activities as well. + + To "convey" a work means any kind of propagation that enables other +parties to make or receive copies. Mere interaction with a user through +a computer network, with no transfer of a copy, is not conveying. + + An interactive user interface displays "Appropriate Legal Notices" +to the extent that it includes a convenient and prominently visible +feature that (1) displays an appropriate copyright notice, and (2) +tells the user that there is no warranty for the work (except to the +extent that warranties are provided), that licensees may convey the +work under this License, and how to view a copy of this License. If +the interface presents a list of user commands or options, such as a +menu, a prominent item in the list meets this criterion. + + 1. Source Code. + + The "source code" for a work means the preferred form of the work +for making modifications to it. "Object code" means any non-source +form of a work. + + A "Standard Interface" means an interface that either is an official +standard defined by a recognized standards body, or, in the case of +interfaces specified for a particular programming language, one that +is widely used among developers working in that language. + + The "System Libraries" of an executable work include anything, other +than the work as a whole, that (a) is included in the normal form of +packaging a Major Component, but which is not part of that Major +Component, and (b) serves only to enable use of the work with that +Major Component, or to implement a Standard Interface for which an +implementation is available to the public in source code form. A +"Major Component", in this context, means a major essential component +(kernel, window system, and so on) of the specific operating system +(if any) on which the executable work runs, or a compiler used to +produce the work, or an object code interpreter used to run it. + + The "Corresponding Source" for a work in object code form means all +the source code needed to generate, install, and (for an executable +work) run the object code and to modify the work, including scripts to +control those activities. However, it does not include the work's +System Libraries, or general-purpose tools or generally available free +programs which are used unmodified in performing those activities but +which are not part of the work. For example, Corresponding Source +includes interface definition files associated with source files for +the work, and the source code for shared libraries and dynamically +linked subprograms that the work is specifically designed to require, +such as by intimate data communication or control flow between those +subprograms and other parts of the work. + + The Corresponding Source need not include anything that users +can regenerate automatically from other parts of the Corresponding +Source. + + The Corresponding Source for a work in source code form is that +same work. + + 2. Basic Permissions. + + All rights granted under this License are granted for the term of +copyright on the Program, and are irrevocable provided the stated +conditions are met. This License explicitly affirms your unlimited +permission to run the unmodified Program. The output from running a +covered work is covered by this License only if the output, given its +content, constitutes a covered work. This License acknowledges your +rights of fair use or other equivalent, as provided by copyright law. + + You may make, run and propagate covered works that you do not +convey, without conditions so long as your license otherwise remains +in force. You may convey covered works to others for the sole purpose +of having them make modifications exclusively for you, or provide you +with facilities for running those works, provided that you comply with +the terms of this License in conveying all material for which you do +not control copyright. Those thus making or running the covered works +for you must do so exclusively on your behalf, under your direction +and control, on terms that prohibit them from making any copies of +your copyrighted material outside their relationship with you. + + Conveying under any other circumstances is permitted solely under +the conditions stated below. Sublicensing is not allowed; section 10 +makes it unnecessary. + + 3. Protecting Users' Legal Rights From Anti-Circumvention Law. + + No covered work shall be deemed part of an effective technological +measure under any applicable law fulfilling obligations under article +11 of the WIPO copyright treaty adopted on 20 December 1996, or +similar laws prohibiting or restricting circumvention of such +measures. + + When you convey a covered work, you waive any legal power to forbid +circumvention of technological measures to the extent such circumvention +is effected by exercising rights under this License with respect to +the covered work, and you disclaim any intention to limit operation or +modification of the work as a means of enforcing, against the work's +users, your or third parties' legal rights to forbid circumvention of +technological measures. + + 4. Conveying Verbatim Copies. + + You may convey verbatim copies of the Program's source code as you +receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy an appropriate copyright notice; +keep intact all notices stating that this License and any +non-permissive terms added in accord with section 7 apply to the code; +keep intact all notices of the absence of any warranty; and give all +recipients a copy of this License along with the Program. + + You may charge any price or no price for each copy that you convey, +and you may offer support or warranty protection for a fee. + + 5. Conveying Modified Source Versions. + + You may convey a work based on the Program, or the modifications to +produce it from the Program, in the form of source code under the +terms of section 4, provided that you also meet all of these conditions: + + a) The work must carry prominent notices stating that you modified + it, and giving a relevant date. + + b) The work must carry prominent notices stating that it is + released under this License and any conditions added under section + 7. This requirement modifies the requirement in section 4 to + "keep intact all notices". + + c) You must license the entire work, as a whole, under this + License to anyone who comes into possession of a copy. This + License will therefore apply, along with any applicable section 7 + additional terms, to the whole of the work, and all its parts, + regardless of how they are packaged. This License gives no + permission to license the work in any other way, but it does not + invalidate such permission if you have separately received it. + + d) If the work has interactive user interfaces, each must display + Appropriate Legal Notices; however, if the Program has interactive + interfaces that do not display Appropriate Legal Notices, your + work need not make them do so. + + A compilation of a covered work with other separate and independent +works, which are not by their nature extensions of the covered work, +and which are not combined with it such as to form a larger program, +in or on a volume of a storage or distribution medium, is called an +"aggregate" if the compilation and its resulting copyright are not +used to limit the access or legal rights of the compilation's users +beyond what the individual works permit. Inclusion of a covered work +in an aggregate does not cause this License to apply to the other +parts of the aggregate. + + 6. Conveying Non-Source Forms. + + You may convey a covered work in object code form under the terms +of sections 4 and 5, provided that you also convey the +machine-readable Corresponding Source under the terms of this License, +in one of these ways: + + a) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by the + Corresponding Source fixed on a durable physical medium + customarily used for software interchange. + + b) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by a + written offer, valid for at least three years and valid for as + long as you offer spare parts or customer support for that product + model, to give anyone who possesses the object code either (1) a + copy of the Corresponding Source for all the software in the + product that is covered by this License, on a durable physical + medium customarily used for software interchange, for a price no + more than your reasonable cost of physically performing this + conveying of source, or (2) access to copy the + Corresponding Source from a network server at no charge. + + c) Convey individual copies of the object code with a copy of the + written offer to provide the Corresponding Source. This + alternative is allowed only occasionally and noncommercially, and + only if you received the object code with such an offer, in accord + with subsection 6b. + + d) Convey the object code by offering access from a designated + place (gratis or for a charge), and offer equivalent access to the + Corresponding Source in the same way through the same place at no + further charge. You need not require recipients to copy the + Corresponding Source along with the object code. If the place to + copy the object code is a network server, the Corresponding Source + may be on a different server (operated by you or a third party) + that supports equivalent copying facilities, provided you maintain + clear directions next to the object code saying where to find the + Corresponding Source. Regardless of what server hosts the + Corresponding Source, you remain obligated to ensure that it is + available for as long as needed to satisfy these requirements. + + e) Convey the object code using peer-to-peer transmission, provided + you inform other peers where the object code and Corresponding + Source of the work are being offered to the general public at no + charge under subsection 6d. + + A separable portion of the object code, whose source code is excluded +from the Corresponding Source as a System Library, need not be +included in conveying the object code work. + + A "User Product" is either (1) a "consumer product", which means any +tangible personal property which is normally used for personal, family, +or household purposes, or (2) anything designed or sold for incorporation +into a dwelling. In determining whether a product is a consumer product, +doubtful cases shall be resolved in favor of coverage. For a particular +product received by a particular user, "normally used" refers to a +typical or common use of that class of product, regardless of the status +of the particular user or of the way in which the particular user +actually uses, or expects or is expected to use, the product. A product +is a consumer product regardless of whether the product has substantial +commercial, industrial or non-consumer uses, unless such uses represent +the only significant mode of use of the product. + + "Installation Information" for a User Product means any methods, +procedures, authorization keys, or other information required to install +and execute modified versions of a covered work in that User Product from +a modified version of its Corresponding Source. The information must +suffice to ensure that the continued functioning of the modified object +code is in no case prevented or interfered with solely because +modification has been made. + + If you convey an object code work under this section in, or with, or +specifically for use in, a User Product, and the conveying occurs as +part of a transaction in which the right of possession and use of the +User Product is transferred to the recipient in perpetuity or for a +fixed term (regardless of how the transaction is characterized), the +Corresponding Source conveyed under this section must be accompanied +by the Installation Information. But this requirement does not apply +if neither you nor any third party retains the ability to install +modified object code on the User Product (for example, the work has +been installed in ROM). + + The requirement to provide Installation Information does not include a +requirement to continue to provide support service, warranty, or updates +for a work that has been modified or installed by the recipient, or for +the User Product in which it has been modified or installed. Access to a +network may be denied when the modification itself materially and +adversely affects the operation of the network or violates the rules and +protocols for communication across the network. + + Corresponding Source conveyed, and Installation Information provided, +in accord with this section must be in a format that is publicly +documented (and with an implementation available to the public in +source code form), and must require no special password or key for +unpacking, reading or copying. + + 7. Additional Terms. + + "Additional permissions" are terms that supplement the terms of this +License by making exceptions from one or more of its conditions. +Additional permissions that are applicable to the entire Program shall +be treated as though they were included in this License, to the extent +that they are valid under applicable law. If additional permissions +apply only to part of the Program, that part may be used separately +under those permissions, but the entire Program remains governed by +this License without regard to the additional permissions. + + When you convey a copy of a covered work, you may at your option +remove any additional permissions from that copy, or from any part of +it. (Additional permissions may be written to require their own +removal in certain cases when you modify the work.) You may place +additional permissions on material, added by you to a covered work, +for which you have or can give appropriate copyright permission. + + Notwithstanding any other provision of this License, for material you +add to a covered work, you may (if authorized by the copyright holders of +that material) supplement the terms of this License with terms: + + a) Disclaiming warranty or limiting liability differently from the + terms of sections 15 and 16 of this License; or + + b) Requiring preservation of specified reasonable legal notices or + author attributions in that material or in the Appropriate Legal + Notices displayed by works containing it; or + + c) Prohibiting misrepresentation of the origin of that material, or + requiring that modified versions of such material be marked in + reasonable ways as different from the original version; or + + d) Limiting the use for publicity purposes of names of licensors or + authors of the material; or + + e) Declining to grant rights under trademark law for use of some + trade names, trademarks, or service marks; or + + f) Requiring indemnification of licensors and authors of that + material by anyone who conveys the material (or modified versions of + it) with contractual assumptions of liability to the recipient, for + any liability that these contractual assumptions directly impose on + those licensors and authors. + + All other non-permissive additional terms are considered "further +restrictions" within the meaning of section 10. If the Program as you +received it, or any part of it, contains a notice stating that it is +governed by this License along with a term that is a further +restriction, you may remove that term. If a license document contains +a further restriction but permits relicensing or conveying under this +License, you may add to a covered work material governed by the terms +of that license document, provided that the further restriction does +not survive such relicensing or conveying. + + If you add terms to a covered work in accord with this section, you +must place, in the relevant source files, a statement of the +additional terms that apply to those files, or a notice indicating +where to find the applicable terms. + + Additional terms, permissive or non-permissive, may be stated in the +form of a separately written license, or stated as exceptions; +the above requirements apply either way. + + 8. Termination. + + You may not propagate or modify a covered work except as expressly +provided under this License. Any attempt otherwise to propagate or +modify it is void, and will automatically terminate your rights under +this License (including any patent licenses granted under the third +paragraph of section 11). + + However, if you cease all violation of this License, then your +license from a particular copyright holder is reinstated (a) +provisionally, unless and until the copyright holder explicitly and +finally terminates your license, and (b) permanently, if the copyright +holder fails to notify you of the violation by some reasonable means +prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is +reinstated permanently if the copyright holder notifies you of the +violation by some reasonable means, this is the first time you have +received notice of violation of this License (for any work) from that +copyright holder, and you cure the violation prior to 30 days after +your receipt of the notice. + + Termination of your rights under this section does not terminate the +licenses of parties who have received copies or rights from you under +this License. If your rights have been terminated and not permanently +reinstated, you do not qualify to receive new licenses for the same +material under section 10. + + 9. Acceptance Not Required for Having Copies. + + You are not required to accept this License in order to receive or +run a copy of the Program. Ancillary propagation of a covered work +occurring solely as a consequence of using peer-to-peer transmission +to receive a copy likewise does not require acceptance. However, +nothing other than this License grants you permission to propagate or +modify any covered work. These actions infringe copyright if you do +not accept this License. Therefore, by modifying or propagating a +covered work, you indicate your acceptance of this License to do so. + + 10. Automatic Licensing of Downstream Recipients. + + Each time you convey a covered work, the recipient automatically +receives a license from the original licensors, to run, modify and +propagate that work, subject to this License. You are not responsible +for enforcing compliance by third parties with this License. + + An "entity transaction" is a transaction transferring control of an +organization, or substantially all assets of one, or subdividing an +organization, or merging organizations. If propagation of a covered +work results from an entity transaction, each party to that +transaction who receives a copy of the work also receives whatever +licenses to the work the party's predecessor in interest had or could +give under the previous paragraph, plus a right to possession of the +Corresponding Source of the work from the predecessor in interest, if +the predecessor has it or can get it with reasonable efforts. + + You may not impose any further restrictions on the exercise of the +rights granted or affirmed under this License. For example, you may +not impose a license fee, royalty, or other charge for exercise of +rights granted under this License, and you may not initiate litigation +(including a cross-claim or counterclaim in a lawsuit) alleging that +any patent claim is infringed by making, using, selling, offering for +sale, or importing the Program or any portion of it. + + 11. Patents. + + A "contributor" is a copyright holder who authorizes use under this +License of the Program or a work on which the Program is based. The +work thus licensed is called the contributor's "contributor version". + + A contributor's "essential patent claims" are all patent claims +owned or controlled by the contributor, whether already acquired or +hereafter acquired, that would be infringed by some manner, permitted +by this License, of making, using, or selling its contributor version, +but do not include claims that would be infringed only as a +consequence of further modification of the contributor version. For +purposes of this definition, "control" includes the right to grant +patent sublicenses in a manner consistent with the requirements of +this License. + + Each contributor grants you a non-exclusive, worldwide, royalty-free +patent license under the contributor's essential patent claims, to +make, use, sell, offer for sale, import and otherwise run, modify and +propagate the contents of its contributor version. + + In the following three paragraphs, a "patent license" is any express +agreement or commitment, however denominated, not to enforce a patent +(such as an express permission to practice a patent or covenant not to +sue for patent infringement). To "grant" such a patent license to a +party means to make such an agreement or commitment not to enforce a +patent against the party. + + If you convey a covered work, knowingly relying on a patent license, +and the Corresponding Source of the work is not available for anyone +to copy, free of charge and under the terms of this License, through a +publicly available network server or other readily accessible means, +then you must either (1) cause the Corresponding Source to be so +available, or (2) arrange to deprive yourself of the benefit of the +patent license for this particular work, or (3) arrange, in a manner +consistent with the requirements of this License, to extend the patent +license to downstream recipients. "Knowingly relying" means you have +actual knowledge that, but for the patent license, your conveying the +covered work in a country, or your recipient's use of the covered work +in a country, would infringe one or more identifiable patents in that +country that you have reason to believe are valid. + + If, pursuant to or in connection with a single transaction or +arrangement, you convey, or propagate by procuring conveyance of, a +covered work, and grant a patent license to some of the parties +receiving the covered work authorizing them to use, propagate, modify +or convey a specific copy of the covered work, then the patent license +you grant is automatically extended to all recipients of the covered +work and works based on it. + + A patent license is "discriminatory" if it does not include within +the scope of its coverage, prohibits the exercise of, or is +conditioned on the non-exercise of one or more of the rights that are +specifically granted under this License. You may not convey a covered +work if you are a party to an arrangement with a third party that is +in the business of distributing software, under which you make payment +to the third party based on the extent of your activity of conveying +the work, and under which the third party grants, to any of the +parties who would receive the covered work from you, a discriminatory +patent license (a) in connection with copies of the covered work +conveyed by you (or copies made from those copies), or (b) primarily +for and in connection with specific products or compilations that +contain the covered work, unless you entered into that arrangement, +or that patent license was granted, prior to 28 March 2007. + + Nothing in this License shall be construed as excluding or limiting +any implied license or other defenses to infringement that may +otherwise be available to you under applicable patent law. + + 12. No Surrender of Others' Freedom. + + If conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot convey a +covered work so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you may +not convey it at all. For example, if you agree to terms that obligate you +to collect a royalty for further conveying from those to whom you convey +the Program, the only way you could satisfy both those terms and this +License would be to refrain entirely from conveying the Program. + + 13. Use with the GNU Affero General Public License. + + Notwithstanding any other provision of this License, you have +permission to link or combine any covered work with a work licensed +under version 3 of the GNU Affero General Public License into a single +combined work, and to convey the resulting work. The terms of this +License will continue to apply to the part which is the covered work, +but the special requirements of the GNU Affero General Public License, +section 13, concerning interaction through a network will apply to the +combination as such. + + 14. Revised Versions of this License. + + The Free Software Foundation may publish revised and/or new versions of +the GNU General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + + Each version is given a distinguishing version number. If the +Program specifies that a certain numbered version of the GNU General +Public License "or any later version" applies to it, you have the +option of following the terms and conditions either of that numbered +version or of any later version published by the Free Software +Foundation. If the Program does not specify a version number of the +GNU General Public License, you may choose any version ever published +by the Free Software Foundation. + + If the Program specifies that a proxy can decide which future +versions of the GNU General Public License can be used, that proxy's +public statement of acceptance of a version permanently authorizes you +to choose that version for the Program. + + Later license versions may give you additional or different +permissions. However, no additional obligations are imposed on any +author or copyright holder as a result of your choosing to follow a +later version. + + 15. Disclaimer of Warranty. + + THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY +APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT +HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY +OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM +IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF +ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. Limitation of Liability. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS +THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY +GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF +DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD +PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), +EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF +SUCH DAMAGES. + + 17. Interpretation of Sections 15 and 16. + + If the disclaimer of warranty and limitation of liability provided +above cannot be given local legal effect according to their terms, +reviewing courts shall apply local law that most closely approximates +an absolute waiver of all civil liability in connection with the +Program, unless a warranty or assumption of liability accompanies a +copy of the Program in return for a fee. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +state the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . + +Also add information on how to contact you by electronic and paper mail. + + If the program does terminal interaction, make it output a short +notice like this when it starts in an interactive mode: + + Copyright (C) + This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, your program's commands +might be different; for a GUI interface, you would use an "about box". + + You should also get your employer (if you work as a programmer) or school, +if any, to sign a "copyright disclaimer" for the program, if necessary. +For more information on this, and how to apply and follow the GNU GPL, see +. + + The GNU General Public License does not permit incorporating your program +into proprietary programs. If your program is a subroutine library, you +may consider it more useful to permit linking proprietary applications with +the library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. But first, please read +. diff --git a/Makefile b/Makefile index f5e2f2b5..d45f63e6 100644 --- a/Makefile +++ b/Makefile @@ -10,6 +10,7 @@ lint: build-docs: sphinx-apidoc --module-first --force --templatedir=docs/templates -o docs/source/api deduce sphinx-build docs/source docs/_build/html -c docs/ + python docs/emojize.py docs/_build/html clean: rm -rf .coverage diff --git a/README.md b/README.md index a76ecfa8..160502e8 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,4 @@ -# deduce - [![tests](https://github.com/vmenger/deduce/actions/workflows/test.yml/badge.svg)](https://github.com/vmenger/deduce/actions/workflows/test.yml) -[![coverage](https://coveralls.io/repos/github/vmenger/deduce/badge.svg)](https://coveralls.io/github/vmenger/deduce?branch=master) [![build](https://github.com/vmenger/deduce/actions/workflows/build.yml/badge.svg)](https://github.com/vmenger/deduce/actions/workflows/build.yml) [![documentation](https://readthedocs.org/projects/deduce/badge/?version=latest)](https://deduce.readthedocs.io/en/latest/?badge=latest) ![pypi version](https://img.shields.io/pypi/v/deduce) @@ -10,50 +7,48 @@ ![license](https://img.shields.io/github/license/vmenger/deduce) [![black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) -[Installation](#installation) - [Versions](#versions) - [Getting Started](#getting-started) - [Documentation](#documentation) - [Contributiong](#contributing) - [Authors](#authors) - [License](#license) +# deduce + +> Deduce 3.0.0 is out! It is way more accurate, and faster too. It's fully backward compatible, but some functionality is scheduled for removal, read more about it here: [docs/migrating-to-v3](https://deduce.readthedocs.io/en/latest/migrating.html) -> Deduce 2.0.0 has been released! It includes a 10x speedup, and way more features for customizing and tailoring. Some small changes are needed to keep going from version 1, read more about it here: [docs/migrating-to-v2](https://deduce.readthedocs.io/en/latest/migrating.html) +* :sparkles: Remove sensitive information from clinical text written in Dutch +* :mag: Rule based logic for detecting e.g. names, locations, institutions, identifiers, phone numbers +* :triangular_ruler: Useful out of the box, but customization higly recommended +* :seedling: Originally validated in [Menger et al. (2017)](http://www.sciencedirect.com/science/article/pii/S0736585316307365), but further optimized since + +> :exclamation: Deduce is useful out of the box, but please validate and customize on your own data before using it in a critical environment. Remember that de-identification is almost never perfect, and that clinical text often contains other specific details that can link it to a specific person. Be aware that de-identification should primarily be viewed as a way to mitigate risk of identification, rather than a way to obtain anonymous data. -De-identify clinial text written in Dutch using `deduce`, a rule-based de-identification method for Dutch clinical text. +Currently, `deduce` can remove the following types of Protected Health Information (PHI): -The development, principles and validation of `deduce` were initially described in [Menger et al. (2017)](http://www.sciencedirect.com/science/article/pii/S0736585316307365). De-identification of clinical text is needed for using text data for analysis, to comply with legal requirements and to protect the privacy of patients. By default, our rule-based method removes Protected Health Information (PHI) in the following categories: +* :bust_in_silhouette: person names, including prefixes and initials +* :earth_americas: geographical locations smaller than a country +* :hospital: names of hospitals and healthcare institutions +* :calendar: dates (combinations of day, month and year) +* :birthday: ages +* :1234: BSN numbers +* :1234: identifiers (7+ digits without a specific format, e.g. patient identifiers, AGB, BIG) +* :phone: phone numbers +* :e-mail: e-mail addresses +* :link: URLs -* Person names, including initials -* Geographical locations smaller than a country -* Names of institutions that are related to patient treatment -* Dates (combinations of day, month and year) -* Ages -* BSN numbers -* Identifiers (7+ digits without a specific format, e.g. patient identifiers, AGB, BIG) -* Telephone numbers -* E-mail addresses -* URLs +## Citing If you use `deduce`, please cite the following paper: [Menger, V.J., Scheepers, F., van Wijk, L.M., Spruit, M. (2017). DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text, Telematics and Informatics, 2017, ISSN 0736-5853](http://www.sciencedirect.com/science/article/pii/S0736585316307365) + + + + ## Installation ``` python pip install deduce ``` -## Versions - -For most cases the latest version is suitable, but some specific milestones are: - -* `2.0.0` - Major refactor, with speedups, many new options for customizing, functionally very similar to original -* `1.0.8` - Small bugfixes compared to original release -* `1.0.1` - Original release with [Menger et al. (2017)](http://www.sciencedirect.com/science/article/pii/S0736585316307365) - -Detailed versioning information is accessible in the [changelog](CHANGELOG.md). - - - - ## Getting started The basic way to use `deduce`, is to pass text to the `deidentify` method of a `Deduce` object: @@ -95,12 +90,12 @@ AnnotationSet({ print(doc.deidentified_text) -"""betreft: , bsn , patnr . De is jaar oud en woonachtig in -. Hij werd op door arts ontslagen van de kliniek van het . -Voor nazorg kan hij worden bereikt via of .""" +"""betreft: [PERSOON-1], bsn [BSN-1], patnr [ID-1]. De [PERSOON-1] is [LEEFTIJD-1] jaar oud en woonachtig in +[LOCATIE-1]. Hij werd op [DATUM-1] door arts [PERSOON-2] ontslagen van de kliniek van het [INSTELLING-1]. +Voor nazorg kan hij worden bereikt via [EMAIL-1] of [TELEFOONNUMMER-1].""" ``` -Aditionally, if the names of the patient are known, they may be added as `metadata`, where they will be picked up by `deduce`: +Additionally, if the names of the patient are known, they may be added as `metadata`, where they will be picked up by `deduce`: ```python from deduce.person import Person @@ -110,20 +105,29 @@ doc = deduce.deidentify(text, metadata={'patient': patient}) print (doc.deidentified_text) -"""betreft: , bsn , patnr . De is jaar oud en woonachtig in -. Hij werd op door arts ontslagen van de kliniek van het . -Voor nazorg kan hij worden bereikt via of .""" +"""betreft: [PATIENT], bsn [BSN-1], patnr [ID-1]. De [PATIENT] is [LEEFTIJD-1] jaar oud en woonachtig in +[LOCATIE-1]. Hij werd op [DATUM-1] door arts [PERSOON-2] ontslagen van de kliniek van het [INSTELLING-1]. +Voor nazorg kan hij worden bereikt via [EMAIL-1] of [TELEFOONNUMMER-1].""" ``` -As you can see, adding known names keeps references to `` in text. It also increases recall, as not all known names are contained in the lookup lists. +As you can see, adding known names keeps references to `[PATIENT]` in text. It also increases recall, as not all known names are contained in the lookup lists. -## Documentation +## Versions + +For most cases the latest version is suitable, but some specific milestones are: + +* `3.0.0` - Many optimizations in accuracy, smaller refactors, further speedups +* `2.0.0` - Major refactor, with speedups, many new options for customizing, functionally very similar to original +* `1.0.8` - Small bugfixes compared to original release +* `1.0.1` - Original release with [Menger et al. (2017)](http://www.sciencedirect.com/science/article/pii/S0736585316307365) + +Detailed versioning information is accessible in the [changelog](CHANGELOG.md). -A more extensive tutorial on using, configuring and modifying `deduce` is available at: [docs/tutorial](https://deduce.readthedocs.io/en/latest/tutorial.html) +## Documentation -Basic documentation and API are available at: [docs](https://deduce.readthedocs.io/en/latest/) +All documentation, including a more extensive tutorial on using, configuring and modifying `deduce`, and its API, is available at: [docs/tutorial](https://deduce.readthedocs.io/en/latest/) ## Contributing @@ -137,4 +141,4 @@ For setting up the dev environment and contributing guidelines, see: [docs/contr ## License -This project is licensed under the GNU LGPLv3 license - see the [LICENSE.md](LICENSE.md) file for details \ No newline at end of file +This project is licensed under the GNU General Public License v3.0 - see the [LICENSE.md](LICENSE.md) file for details \ No newline at end of file diff --git a/config.json b/base_config.json similarity index 65% rename from config.json rename to base_config.json index f3ca3bd8..d0f37871 100644 --- a/config.json +++ b/base_config.json @@ -10,23 +10,23 @@ false ] }, - "redactor_open_char": "<", - "redactor_close_char": ">", + "redactor_open_char": "[", + "redactor_close_char": "]", "annotators": { "prefix_with_initial": { - "annotator_type": "token_pattern", + "annotator_type": "deduce.annotator.TokenPatternAnnotator", "group": "names", "args": { "tag": "prefix+initiaal", "skip": ["."], "pattern": [ { - "lookup": "prefixes" + "lookup": "prefix" }, { "or": [ { - "is_initial": true + "lookup": "initial" }, { "is_initials": true @@ -37,17 +37,17 @@ } }, "prefix_with_interfix": { - "annotator_type": "token_pattern", + "annotator_type": "deduce.annotator.TokenPatternAnnotator", "group": "names", "args": { "tag": "prefix+interfix+naam", "skip": ["."], "pattern": [ { - "lookup": "prefixes" + "lookup": "prefix" }, { - "lookup": "interfixes" + "lookup": "interfix" }, { "like_name": true @@ -56,14 +56,14 @@ } }, "prefix_with_name": { - "annotator_type": "token_pattern", + "annotator_type": "deduce.annotator.TokenPatternAnnotator", "group": "names", "args": { "tag": "prefix+naam", "skip": ["."], "pattern": [ { - "lookup": "prefixes" + "lookup": "prefix" }, { "and": [ @@ -79,19 +79,19 @@ } }, "interfix_with_name": { - "annotator_type": "token_pattern", + "annotator_type": "deduce.annotator.TokenPatternAnnotator", "group": "names", "args": { "tag": "interfix+achternaam", "skip": [], "pattern": [ { - "lookup": "interfixes" + "lookup": "interfix" }, { "and": [ { - "lookup": "interfix_surnames" + "lookup": "interfix_surname" }, { "neg_lookup": "whitelist" @@ -101,15 +101,15 @@ ] } }, - "initial_with_capital": { - "annotator_type": "token_pattern", + "initial_with_name": { + "annotator_type": "deduce.annotator.TokenPatternAnnotator", "group": "names", "args": { "tag": "initiaal+naam", "skip": ["."], "pattern": [ { - "is_initial": true + "lookup": "initial" }, { "and": [ @@ -120,7 +120,7 @@ "neg_lookup": "whitelist" }, { - "neg_lookup": "prefixes" + "neg_lookup": "prefix" } ] } @@ -128,17 +128,17 @@ } }, "initial_interfix": { - "annotator_type": "token_pattern", + "annotator_type": "deduce.annotator.TokenPatternAnnotator", "group": "names", "args": { "tag": "initiaal+interfix+naam", "skip": ["."], "pattern": [ { - "is_initial": true + "lookup": "initial" }, { - "lookup": "interfixes" + "lookup": "interfix" }, { "like_name": true @@ -147,67 +147,32 @@ } }, "first_name_lookup": { - "annotator_type": "multi_token", + "annotator_type": "docdeid.process.MultiTokenLookupAnnotator", "group": "names", "args": { "tag": "voornaam", - "lookup_values": "first_names" + "overlapping": true, + "lookup_values": "first_name" } }, "surname_lookup": { - "annotator_type": "multi_token", + "annotator_type": "docdeid.process.MultiTokenLookupAnnotator", "group": "names", "args": { "tag": "achternaam", - "lookup_values": "surnames" + "overlapping": true, + "lookup_values": "surname" } }, - "person_first_name": { - "annotator_type": "dd_token_pattern", + "patient_name": { + "annotator_type": "deduce.annotator.PatientNameAnnotator", "group": "names", "args": { - "pattern": { - "module": "deduce.pattern", - "class": "PersonFirstNamePattern", - "tag": "voornaam_patient" - } - } - }, - "person_initial_from_name": { - "annotator_type": "dd_token_pattern", - "group": "names", - "args": { - "pattern": { - "module": "deduce.pattern", - "class": "PersonInitialFromNamePattern", - "tag": "initiaal_patient" - } - } - }, - "person_initials": { - "annotator_type": "dd_token_pattern", - "group": "names", - "args": { - "pattern": { - "module": "deduce.pattern", - "class": "PersonInitialsPattern", - "tag": "achternaam_patient" - } - } - }, - "person_surname": { - "annotator_type": "dd_token_pattern", - "group": "names", - "args": { - "pattern": { - "module": "deduce.pattern", - "class": "PersonSurnamePattern", - "tag": "achternaam_patient" - } + "tag": "_" } }, "name_context": { - "annotator_type": "annotation_context", + "annotator_type": "deduce.annotator.ContextAnnotator", "group": "names", "args": { "iterative": true, @@ -217,13 +182,17 @@ "direction": "right", "pre_tag": [ "initiaal", - "naam" + "naam", + "voornaam", + "achternaam", + "voornaam_patient", + "achternaam_patient" ], "tag": "{tag}+interfix+achternaam", "skip": [".", "-"], "pattern": [ { - "lookup": "interfixes" + "lookup": "interfix" }, { "like_name": true @@ -236,13 +205,17 @@ "pre_tag": [ "initiaal", "naam", + "voornaam", + "achternaam", + "voornaam_patient", + "achternaam_patient", "interfix" ], "tag": "initiaal+{tag}", "skip": ["."], "pattern": [ { - "is_initial": true + "lookup": "initial" } ] }, @@ -250,7 +223,11 @@ "name": "naam_left", "direction": "left", "pre_tag": [ - "naam" + "naam", + "voornaam", + "achternaam", + "voornaam_patient", + "achternaam_patient" ], "tag": "naam+{tag}", "skip": ["-"], @@ -264,7 +241,7 @@ "neg_lookup": "whitelist" }, { - "neg_lookup": "prefixes" + "neg_lookup": "prefix" } ] } @@ -276,8 +253,12 @@ "pre_tag": [ "prefix", "initiaal", - "interfix", - "naam" + "naam", + "voornaam", + "achternaam", + "voornaam_patient", + "achternaam_patient", + "interfix" ], "tag": "{tag}+naam", "skip": ["-"], @@ -291,7 +272,7 @@ "neg_lookup": "whitelist" }, { - "neg_lookup": "prefixes" + "neg_lookup": "prefix" } ] } @@ -303,8 +284,12 @@ "pre_tag": [ "prefix", "initiaal", - "interfix", - "naam" + "naam", + "voornaam", + "achternaam", + "voornaam_patient", + "achternaam_patient", + "interfix" ], "tag": "prefix+{tag}", "skip": ["."], @@ -312,7 +297,7 @@ { "and": [ { - "lookup": "prefixes" + "lookup": "prefix" } ] } @@ -321,16 +306,26 @@ ] } }, + "eponymous_disease": { + "annotator_type": "docdeid.process.MultiTokenLookupAnnotator", + "group": "names", + "args": { + "lookup_values": "eponymous_disease", + "tag": "pseudo_name", + "overlapping": true + } + }, "placename": { - "annotator_type": "multi_token", + "annotator_type": "docdeid.process.MultiTokenLookupAnnotator", "group": "locations", "args": { - "lookup_values": "placenames", + "lookup_values": "placename", + "overlapping": true, "tag": "locatie" } }, "street_pattern": { - "annotator_type": "token_pattern", + "annotator_type": "deduce.annotator.TokenPatternAnnotator", "group": "locations", "args": { "pattern": [ @@ -343,16 +338,17 @@ } }, "street_lookup": { - "annotator_type": "multi_token", + "annotator_type": "docdeid.process.MultiTokenLookupAnnotator", "group": "locations", "args": { - "lookup_values": "streets", + "lookup_values": "street", + "overlapping": true, "tag": "straat", "priority": 1 } }, "housenumber": { - "annotator_type": "annotation_context", + "annotator_type": "deduce.annotator.ContextAnnotator", "group": "locations", "args": { "iterative": true, @@ -403,7 +399,7 @@ } }, "postal_code": { - "annotator_type": "regexp", + "annotator_type": "docdeid.process.RegexpAnnotator", "group": "locations", "args": { "regexp_pattern": "(\\d{4}([A-Za-z]{2}| [A-Z]{2}))(?[-/\\. ])([1-9]|0[1-9]|1[012])(?P=sep)((19|20|\\'|`)?\\d{2}))(?!\\d)", @@ -445,16 +444,17 @@ } }, "date_dmy_2": { - "annotator_type": "regexp", + "annotator_type": "docdeid.process.RegexpAnnotator", "group": "dates", "args": { "regexp_pattern": "(?i)(?[-/\\. ])([1-9]|0[1-9]|1[012])(?P=sep)([1-9]|0[1-9]|[12][0-9]|3[01]))(\\D|$)", @@ -463,33 +463,31 @@ } }, "date_ymd_2": { - "annotator_type": "regexp", + "annotator_type": "docdeid.process.RegexpAnnotator", "group": "dates", "args": { "regexp_pattern": "(?i)(? bool: """ @@ -59,18 +59,27 @@ class PersonAnnotationConverter(dd.process.AnnotationProcessor): Responsible for processing the annotations produced by all name annotators (regular and context-based). - Resolves overlap between them, and then maps the tags to either "patient" or - "persoon", based on whether "patient" is in the tag (e.g. voornaam_patient => - patient, achternaam_onbekend => persoon). + Any overlap with annotations that are contain "pseudo" in their tag are removed, as + are those annotations. Then resolves overlap between remaining annotations, and maps + the tags to either "patient" or "persoon", based on whether "patient" is in the tag + (e.g. voornaam_patient => patient, achternaam_onbekend => persoon). """ def __init__(self) -> None: + def map_tag_to_prio(tag: str) -> int: + if "pseudo" in tag: + return 0 + if "patient" in tag: + return 1 + + return 2 + self._overlap_resolver = dd.process.OverlapResolver( - sort_by=["tag", "length"], - sort_by_callbacks={ - "tag": lambda x: "patient" not in x, - "length": lambda x: -x, - }, + sort_by=("tag", "length"), + sort_by_callbacks=frozendict( + tag=map_tag_to_prio, + length=lambda x: -x, + ), ) def process_annotations( @@ -88,6 +97,7 @@ def process_annotations( tag="patient" if "patient" in annotation.tag else "persoon", ) for annotation in new_annotations + if ("pseudo" not in annotation.tag and len(annotation.text.strip()) != 0) ) diff --git a/deduce/annotator.py b/deduce/annotator.py index 651c5429..9fdcf9a3 100644 --- a/deduce/annotator.py +++ b/deduce/annotator.py @@ -1,11 +1,16 @@ +"""Contains components for annotating.""" + import re +import warnings from typing import Literal, Optional import docdeid as dd -from docdeid import Annotation, Document +from docdeid import Annotation, Document, Tokenizer from docdeid.process import RegexpAnnotator -import deduce.utils +from deduce.utils import str_match + +warnings.simplefilter(action="default") _DIRECTION_MAP = { "left": { @@ -50,6 +55,13 @@ def match(cls, pattern_position: dict, **kwargs) -> bool: # pylint: disable=R09 if func == "re_match": return re.match(value, kwargs.get("token").text) is not None if func == "is_initial": + + warnings.warn( + "is_initial matcher pattern is deprecated and will be removed " + "in a future version", + DeprecationWarning, + ) + return ( ( len(kwargs.get("token").text) == 1 @@ -72,10 +84,6 @@ def match(cls, pattern_position: dict, **kwargs) -> bool: # pylint: disable=R09 return kwargs.get("token").text in kwargs.get("ds")[value] if func == "neg_lookup": return kwargs.get("token").text not in kwargs.get("ds")[value] - if func == "lowercase_lookup": - return kwargs.get("token").text.lower() in kwargs.get("ds")[value] - if func == "lowercase_neg_lookup": - return kwargs.get("token").text.lower() not in kwargs.get("ds")[value] if func == "and": return all( _PatternPositionMatcher.match(pattern_position=x, **kwargs) @@ -115,6 +123,27 @@ def __init__( self.ds = ds self.skip = set(skip or []) + self._start_words = None + self._matching_pipeline = None + + if len(self.pattern) > 0 and "lookup" in self.pattern[0]: + + if self.ds is None: + raise RuntimeError( + "Created pattern with lookup in TokenPatternAnnotator, but " + "no lookup structures provided." + ) + + lookup_list = self.ds[self.pattern[0]["lookup"]] + + if not isinstance(lookup_list, dd.ds.LookupSet): + raise ValueError( + f"Expected a LookupSet, but got a " f"{type(lookup_list)}." + ) + + self._start_words = lookup_list.items() + self._matching_pipeline = lookup_list.matching_pipeline + super().__init__(*args, **kwargs) @staticmethod @@ -131,9 +160,9 @@ def _get_chained_token( def _match_sequence( # pylint: disable=R0913 self, - doc: Document, + text: str, pattern: list[dict], - start_token: dd.tokenize.Token, + start_token: dd.tokenizer.Token, direction: Literal["left", "right"] = "right", skip: Optional[set[str]] = None, ) -> Optional[dd.Annotation]: @@ -141,7 +170,7 @@ def _match_sequence( # pylint: disable=R0913 Sequentially match a pattern against a specified start_token. Args: - doc: The document that is being processed. + text: The original document text. pattern: The pattern to match. start_token: The start token to match. direction: The direction to match, choice of "left" or "right". @@ -173,7 +202,7 @@ def _match_sequence( # pylint: disable=R0913 ) return dd.Annotation( - text=doc.text[start_token.start_char : end_token.end_char], + text=text[start_token.start_char : end_token.end_char], start_char=start_token.start_char, end_char=end_token.end_char, tag=self.tag, @@ -193,16 +222,26 @@ def annotate(self, doc: dd.Document) -> list[dd.Annotation]: A list of Annotation. """ - return [ - annotation - for token in doc.get_tokens() - if ( - annotation := self._match_sequence( - doc, self.pattern, token, direction="right", skip=self.skip - ) + annotations = [] + + tokens = doc.get_tokens() + + if self._start_words is not None: + tokens = tokens.token_lookup( + lookup_values=self._start_words, + matching_pipeline=self._matching_pipeline, + ) + + for token in tokens: + + annotation = self._match_sequence( + doc.text, self.pattern, token, direction="right", skip=self.skip ) - is not None - ] + + if annotation is not None: + annotations.append(annotation) + + return annotations class ContextAnnotator(TokenPatternAnnotator): @@ -210,28 +249,35 @@ class ContextAnnotator(TokenPatternAnnotator): Extends existing annotations to the left or right, based on specified patterns. Args: - iterative: Whether the extension process should recurse, or stop after one + ds: Any datastructures, that can be used for lookup or other logic + iterative: Whether the extension process should repeat, or stop after one iteration. """ - def __init__(self, *args, iterative: bool = True, **kwargs) -> None: + def __init__( + self, + *args, + ds: Optional[dd.ds.DsCollection] = None, + iterative: bool = True, + **kwargs, + ) -> None: self.iterative = iterative - super().__init__(*args, **kwargs, tag="_") + super().__init__(*args, **kwargs, ds=ds, tag="_") def _apply_context_pattern( - self, doc: dd.Document, annotations: dd.AnnotationSet, context_pattern: dict + self, text: str, annotations: dd.AnnotationSet, context_pattern: dict ) -> dd.AnnotationSet: - new_annotations = dd.AnnotationSet() + direction = context_pattern["direction"] + skip = set(context_pattern.get("skip", [])) + + for annotation in annotations.copy(): - for annotation in annotations: tag = list(_DIRECTION_MAP[direction]["order"](annotation.tag.split("+")))[ -1 ] - skip = set(context_pattern.get("skip", [])) - if not deduce.utils.any_in_text(context_pattern["pre_tag"], tag): - new_annotations.add(annotation) + if tag not in context_pattern["pre_tag"]: continue attr = _DIRECTION_MAP[direction]["attr"] @@ -239,7 +285,7 @@ def _apply_context_pattern( _DIRECTION_MAP[direction]["start_token"](annotation), attr, skip ) new_annotation = self._match_sequence( - doc, + text, context_pattern["pattern"], start_token, direction=direction, @@ -251,32 +297,51 @@ def _apply_context_pattern( (annotation, new_annotation) ) - new_annotations.add( - dd.Annotation( - text=doc.text[left_ann.start_char : right_ann.end_char], - start_char=left_ann.start_char, - end_char=right_ann.end_char, - start_token=left_ann.start_token, - end_token=right_ann.end_token, - tag=context_pattern["tag"].format(tag=annotation.tag), - priority=annotation.priority, - ) + merged_annotation = dd.Annotation( + text=text[left_ann.start_char : right_ann.end_char], + start_char=left_ann.start_char, + end_char=right_ann.end_char, + start_token=left_ann.start_token, + end_token=right_ann.end_token, + tag=context_pattern["tag"].format(tag=annotation.tag), + priority=annotation.priority, ) - else: - new_annotations.add(annotation) - return new_annotations + annotations.remove(annotation) + annotations.add(merged_annotation) + + return annotations + + def _annotate(self, text: str, annotations: dd.AnnotationSet) -> dd.AnnotationSet: + """ + Does the annotation, by calling _apply_context_pattern, and then optionally + recursing. Also keeps track of the (un)changed annotations, so they are not + repeatedly processed. + + Args: + text: The input text. + annotations: The input annotations. + + Returns: + An extended set of annotations, based on the patterns provided. + """ - def _annotate( - self, doc: dd.Document, annotations: list[dd.Annotation] - ) -> list[dd.Annotation]: - original_annotations = annotations + original_annotations = annotations.copy() for context_pattern in self.pattern: - annotations = self._apply_context_pattern(doc, annotations, context_pattern) + annotations = self._apply_context_pattern( + text, annotations, context_pattern + ) + + if self.iterative: + + changed = dd.AnnotationSet(annotations.difference(original_annotations)) + annotations = dd.AnnotationSet( + annotations.intersection(original_annotations) + ) - if self.iterative and (annotations != original_annotations): - annotations = self._annotate(doc, annotations) + if changed: + annotations.update(self._annotate(text, changed)) return annotations @@ -291,18 +356,173 @@ def annotate(self, doc: dd.Document) -> list[dd.Annotation]: An empty list, as annotations are modified and not added. """ - doc.annotations = self._annotate(doc, list(doc.annotations)) + doc.annotations = self._annotate(doc.text, doc.annotations) return [] +class PatientNameAnnotator(dd.process.Annotator): + """ + Annotates patient names, based on information present in document metadata. This + class implements logic for detecting first name(s), initials and surnames. + + Args: + tokenizer: A tokenizer, that is used for breaking up the patient surname + into multiple tokens. + """ + + def __init__(self, tokenizer: Tokenizer, *args, **kwargs) -> None: + + self.tokenizer = tokenizer + self.skip = [".", "-", " "] + + super().__init__(*args, **kwargs) + + @staticmethod + def _match_first_names( + doc: dd.Document, token: dd.Token + ) -> Optional[tuple[dd.Token, dd.Token]]: + + for first_name in doc.metadata["patient"].first_names: + + if str_match(token.text, first_name) or ( + len(token.text) > 3 + and str_match(token.text, first_name, max_edit_distance=1) + ): + return token, token + + return None + + @staticmethod + def _match_initial_from_name( + doc: dd.Document, token: dd.Token + ) -> Optional[tuple[dd.Token, dd.Token]]: + + for _, first_name in enumerate(doc.metadata["patient"].first_names): + if str_match(token.text, first_name[0]): + next_token = token.next() + + if (next_token is not None) and str_match(next_token.text, "."): + return token, next_token + + return token, token + + return None + + @staticmethod + def _match_initials( + doc: dd.Document, token: dd.Token + ) -> Optional[tuple[dd.Token, dd.Token]]: + + if str_match(token.text, doc.metadata["patient"].initials): + return token, token + + return None + + def next_with_skip(self, token: dd.Token) -> Optional[dd.Token]: + """Find the next token, while skipping certain punctuation.""" + + while True: + token = token.next() + + if (token is None) or (token not in self.skip): + break + + return token + + def _match_surname( + self, doc: dd.Document, token: dd.Token + ) -> Optional[tuple[dd.Token, dd.Token]]: + + if doc.metadata["surname_pattern"] is None: + doc.metadata["surname_pattern"] = self.tokenizer.tokenize( + doc.metadata["patient"].surname + ) + + surname_pattern = doc.metadata["surname_pattern"] + + surname_token = surname_pattern[0] + start_token = token + + while True: + if not str_match(surname_token.text, token.text, max_edit_distance=1): + return None + + match_end_token = token + + surname_token = self.next_with_skip(surname_token) + token = self.next_with_skip(token) + + if surname_token is None: + return start_token, match_end_token # end of pattern + + if token is None: + return None # end of tokens + + def annotate(self, doc: Document) -> list[Annotation]: + """ + Annotates the document, based on the patient metadata. + + Args: + doc: The input document. + + Returns: A document with any relevant Annotations added. + """ + + if doc.metadata is None or doc.metadata["patient"] is None: + return [] + + matcher_to_attr = { + self._match_first_names: ("first_names", "voornaam_patient"), + self._match_initial_from_name: ("first_names", "initiaal_patient"), + self._match_initials: ("initials", "initiaal_patient"), + self._match_surname: ("surname", "achternaam_patient"), + } + + matchers = [] + patient_metadata = doc.metadata["patient"] + + for matcher, (attr, tag) in matcher_to_attr.items(): + if getattr(patient_metadata, attr) is not None: + matchers.append((matcher, tag)) + + annotations = [] + + for token in doc.get_tokens(): + + for matcher, tag in matchers: + + match = matcher(doc, token) + + if match is None: + continue + + start_token, end_token = match + + annotations.append( + dd.Annotation( + text=doc.text[start_token.start_char : end_token.end_char], + start_char=start_token.start_char, + end_char=end_token.end_char, + tag=tag, + priority=self.priority, + start_token=start_token, + end_token=end_token, + ) + ) + + return annotations + + class RegexpPseudoAnnotator(RegexpAnnotator): """ Regexp annotator that filters out matches preceded or followed by certain terms. - Currently matches on sequential alhpa characters preceding or following the match. + Currently matches on sequential alpha characters preceding or following the match. + This annotator does not depend on any tokenizer. - pre_pseudo: A list of strings that invalidate a match when preceding it - post_pseudo: A list of strings that invalidate a match when following it - lowercase: Whether to match lowercase + Args: + pre_pseudo: A list of strings that invalidate a match when preceding it + post_pseudo: A list of strings that invalidate a match when following it + lowercase: Whether to match lowercase """ def __init__( @@ -406,7 +626,17 @@ def _validate_match(self, match: re.Match, doc: Document) -> bool: class BsnAnnotator(dd.process.Annotator): - """Annotates BSN nummers.""" + """ + Annotates Burgerservicenummer (BSN), according to the elfproef logic. + See also: https://nl.wikipedia.org/wiki/Burgerservicenummer + + Args: + bsn_regexp: A regexp to match potential BSN nummers. The simplest form could be + 9-digit numbers, but matches with periods or other punctutation can also be + accepted. Any non-digit characters are removed from the match before + the elfproef is applied. + capture_group: The regexp capture group to consider. + """ def __init__( self, bsn_regexp: str, *args, capture_group: int = 0, **kwargs @@ -454,7 +684,15 @@ def annotate(self, doc: Document) -> list[Annotation]: class PhoneNumberAnnotator(dd.process.Annotator): - """Annotates phone numbers.""" + """ + Annotates phone numbers, based on a regexp and min and max number of digits. + Additionally employs some logic like detecting parentheses and hyphens. + + Args: + phone_regexp: The regexp to detect phone numbers. + min_digits: The minimum number of digits that need to be present. + max_digits: The maximum number of digits that need to be present. + """ def __init__( self, diff --git a/deduce/data/__init__.py b/deduce/data/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/deduce/data/lookup/cache/__init__.py b/deduce/data/lookup/cache/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/deduce/data/lookup/src/__init__.py b/deduce/data/lookup/src/__init__.py new file mode 100644 index 00000000..ae186018 --- /dev/null +++ b/deduce/data/lookup/src/__init__.py @@ -0,0 +1,17 @@ +all_lists = [ + "institutions/lst_healthcare_institution", + "institutions/lst_hospital", + "institutions/lst_hospital_abbr", + "locations/lst_placename", + "locations/lst_street", + "names/lst_first_name", + "names/lst_initial", + "names/lst_interfix", + "names/lst_interfix_surname", + "names/lst_prefix", + "names/lst_surname", + "whitelist/lst_common_word", + "whitelist/lst_eponymous_disease", + "whitelist/lst_medical_term", + "whitelist/lst_stop_word", +] diff --git a/deduce-data/lookup_lists/institutions/healthcare_institution_exceptions.txt b/deduce/data/lookup/src/institutions/lst_healthcare_institution/exceptions.txt similarity index 100% rename from deduce-data/lookup_lists/institutions/healthcare_institution_exceptions.txt rename to deduce/data/lookup/src/institutions/lst_healthcare_institution/exceptions.txt diff --git a/deduce-data/lookup_lists/institutions/healthcare_institutions.txt b/deduce/data/lookup/src/institutions/lst_healthcare_institution/items.txt similarity index 100% rename from deduce-data/lookup_lists/institutions/healthcare_institutions.txt rename to deduce/data/lookup/src/institutions/lst_healthcare_institution/items.txt diff --git a/deduce/data/lookup/src/institutions/lst_healthcare_institution/transform.json b/deduce/data/lookup/src/institutions/lst_healthcare_institution/transform.json new file mode 100644 index 00000000..a56ed996 --- /dev/null +++ b/deduce/data/lookup/src/institutions/lst_healthcare_institution/transform.json @@ -0,0 +1,51 @@ +{ + "transforms": { + "instelling": { + "Huisartsenpraktijk": [ + "Huisartsenpraktijk", + "huisartsenpraktijk", + "Huisartspraktijk", + "huisartspraktijk" + ] + }, + "prefix": { + "\\bDe\\b": [ + "De", + "de" + ] + }, + "punct": { + "\\.": [ + ".", + "" + ], + "-": [ + "-", + "", + " " + ], + " & ": [ + " & ", + " en " + ] + }, + "spell": { + "y": [ + "y", + "ij" + ], + "Y": [ + "Y", + "IJ" + ], + "ij": [ + "ij", + "y" + ], + "IJ": [ + "IJ", + "Y" + ] + } + } +} \ No newline at end of file diff --git a/deduce-data/lookup_lists/institutions/hospitals.txt b/deduce/data/lookup/src/institutions/lst_hospital/items.txt similarity index 99% rename from deduce-data/lookup_lists/institutions/hospitals.txt rename to deduce/data/lookup/src/institutions/lst_hospital/items.txt index 41338f1e..e5dc7978 100644 --- a/deduce-data/lookup_lists/institutions/hospitals.txt +++ b/deduce/data/lookup/src/institutions/lst_hospital/items.txt @@ -156,7 +156,6 @@ Lichtenberg Ziekenhuis Liduina Ziekenhuis Liefdehuis Lievensberg -Lievensberg Lievensberg Ziekenhuis Lorentz Lorentz Ziekenhuis @@ -366,4 +365,4 @@ Zuyderland Medisch Centrum het Catharina Gasthuis het Lange Land het Lange Land Ziekenhuis -mProve +mProve \ No newline at end of file diff --git a/deduce/data/lookup/src/institutions/lst_hospital/transform.json b/deduce/data/lookup/src/institutions/lst_hospital/transform.json new file mode 100644 index 00000000..dca99a2b --- /dev/null +++ b/deduce/data/lookup/src/institutions/lst_hospital/transform.json @@ -0,0 +1,101 @@ +{ + "transforms": { + "zkh": { + " (Ziekenhuis|Gasthuis|Kliniek)": [ + " Ziekenhuis", + " Ziekenhuizen", + " Zkh", + " Zkh.", + " Gasthuis", + " Kliniek", + " Klinieken", + " ziekenhuis", + " ziekenhuizen", + " zkh", + " zkh.", + " gasthuis", + " kliniek", + " klinieken", + "ziekenhuis", + "ziekenhuizen", + "zkh", + "zkh.", + "gasthuis", + "kliniek", + "klinieken" + ], + "^(Ziekenhuis|Gasthuis|Kliniek)": [ + "Ziekenhuis", + "Zkh", + "Zkh.", + "Gasthuis", + "Kliniek", + "ziekenhuis", + "zkh", + "zkh.", + "gasthuis", + "kliniek" + ], + "Medisch Centrum": [ + "Medisch Centrum", + "MC" + ] + }, + "zkh_2": { + "Universitair Medisch Centrum": [ + "Universitair Medisch Centrum", + "UMC" + ] + }, + "prefix": { + "\\bhet\\b": [ + "Het", + "het", + "'T", + "'t", + "`T", + "`t", + "T", + "t", + "" + ], + "\\bSint\\b": [ + "Sint", + "sint", + "St.", + "st.", + "st", + "" + ] + }, + "punct": { + "\\.": [ + ".", + "" + ], + "-": [ + "-", + "", + " " + ] + }, + "spelling": { + "y": [ + "y", + "ij" + ], + "Y": [ + "Y", + "IJ" + ], + "ij": [ + "ij", + "y" + ], + "IJ": [ + "IJ", + "Y" + ] + } + } +} \ No newline at end of file diff --git a/deduce-data/lookup_lists/institutions/hospital_abbr.txt b/deduce/data/lookup/src/institutions/lst_hospital_abbr/items.txt similarity index 100% rename from deduce-data/lookup_lists/institutions/hospital_abbr.txt rename to deduce/data/lookup/src/institutions/lst_hospital_abbr/items.txt diff --git a/deduce-data/lookup_lists/locations/municipalities.txt b/deduce/data/lookup/src/locations/lst_placename/lst_municipality/items.txt similarity index 100% rename from deduce-data/lookup_lists/locations/municipalities.txt rename to deduce/data/lookup/src/locations/lst_placename/lst_municipality/items.txt diff --git a/deduce-data/lookup_lists/locations/provinces.txt b/deduce/data/lookup/src/locations/lst_placename/lst_province/items.txt similarity index 100% rename from deduce-data/lookup_lists/locations/provinces.txt rename to deduce/data/lookup/src/locations/lst_placename/lst_province/items.txt diff --git a/deduce-data/lookup_lists/locations/regions.txt b/deduce/data/lookup/src/locations/lst_placename/lst_region/items.txt similarity index 100% rename from deduce-data/lookup_lists/locations/regions.txt rename to deduce/data/lookup/src/locations/lst_placename/lst_region/items.txt diff --git a/deduce-data/lookup_lists/locations/residence_exceptions.txt b/deduce/data/lookup/src/locations/lst_placename/lst_residence/exceptions.txt similarity index 98% rename from deduce-data/lookup_lists/locations/residence_exceptions.txt rename to deduce/data/lookup/src/locations/lst_placename/lst_residence/exceptions.txt index f893c64d..971cf20f 100644 --- a/deduce-data/lookup_lists/locations/residence_exceptions.txt +++ b/deduce/data/lookup/src/locations/lst_placename/lst_residence/exceptions.txt @@ -18,6 +18,7 @@ Gooi Groet Hall Heenweg +Holt Hub Ie Juist diff --git a/deduce-data/lookup_lists/locations/residences.txt b/deduce/data/lookup/src/locations/lst_placename/lst_residence/items.txt similarity index 100% rename from deduce-data/lookup_lists/locations/residences.txt rename to deduce/data/lookup/src/locations/lst_placename/lst_residence/items.txt diff --git a/deduce/data/lookup/src/locations/lst_placename/transform.json b/deduce/data/lookup/src/locations/lst_placename/transform.json new file mode 100644 index 00000000..70239c9d --- /dev/null +++ b/deduce/data/lookup/src/locations/lst_placename/transform.json @@ -0,0 +1,189 @@ +{ + "transforms": { + "prefix": { + "\\bhet\\b": [ + "Het", + "het", + "'T", + "'t", + "`T", + "`t", + "T", + "t" + ], + "\\bSint\\b": [ + "Sint", + "sint", + "St.", + "st." + ], + "\\bit\\b": [ + "It", + "it", + "Het", + "het", + "'T", + "'t", + "`T", + "`t", + "T", + "t" + ] + }, + "prop": { + "(\\b|^)Aan\\b": [ + "Aan", + "aan" + ], + "(\\b|^)Bij\\b": [ + "Bij", + "bij" + ], + "(\\b|^)De\\b": [ + "De", + "de" + ], + "(\\b|^)Den\\b": [ + "Den", + "den" + ], + "(\\b|^)En\\b": [ + "En", + "en" + ], + "(\\b|^)Het\\b": [ + "Het", + "het", + "'T", + "'t", + "`T", + "`t", + "T", + "t" + ], + "(\\b|^)In\\b": [ + "In", + "in" + ], + "(\\b|^)Oan\\b": [ + "Oan", + "oan" + ], + "(\\b|^)Of\\b": [ + "Of", + "of" + ], + "(\\b|^)Op\\b": [ + "Op", + "op" + ], + "(\\b|^)Over\\b": [ + "Over", + "over" + ], + "(\\b|^)'S\\b": [ + "'S", + "'s" + ], + "(\\b|^)Ter\\b": [ + "Ter", + "ter" + ], + "(\\b|^)Van\\b": [ + "Van", + "van", + "v.", + "V." + ] + }, + "province": { + "(?<=\\()Fr(?=\\))": [ + "Fr", + "FR", + "Frl", + "FRL", + "F" + ], + "(?<=\\()Gr(?=\\))": [ + "Gr", + "GR", + "Gn", + "GN", + "G" + ], + "(?<=\\()Dr(?=\\))": [ + "Dr", + "DR", + "Dn", + "DN", + "D" + ], + "(?<=\\()Ov(?=\\))": [ + "Ov", + "OV", + "O" + ], + "(?<=\\()Nh(?=\\))": [ + "Nh", + "NH" + ], + "(?<=\\()Ut(?=\\))": [ + "Ut", + "UT", + "U" + ], + "(?<=\\()Gld(?=\\))": [ + "Gld", + "GLD", + "G" + ], + "(?<=\\()Li(?=\\))": [ + "Li", + "LI", + "L" + ], + "(?<=\\()Nb(?=\\))": [ + "Nb", + "NB" + ], + "(?<=\\()Zh(?=\\))": [ + "Zh", + "ZH" + ], + "(?<=\\()Ze(?=\\))": [ + "Ze", + "ZE", + "Z" + ] + }, + "punct": { + "\\.": [ + ".", + "" + ], + "-": [ + "-", + "", + " " + ] + }, + "spell": { + "y": [ + "y", + "ij" + ], + "Y": [ + "Y", + "IJ" + ], + "ij": [ + "ij", + "y" + ], + "IJ": [ + "IJ", + "Y" + ] + } + } +} \ No newline at end of file diff --git a/deduce-data/lookup_lists/locations/streets/street_exceptions.txt b/deduce/data/lookup/src/locations/lst_street/exceptions.txt similarity index 76% rename from deduce-data/lookup_lists/locations/streets/street_exceptions.txt rename to deduce/data/lookup/src/locations/lst_street/exceptions.txt index c2821122..16bcf647 100644 --- a/deduce-data/lookup_lists/locations/streets/street_exceptions.txt +++ b/deduce/data/lookup/src/locations/lst_street/exceptions.txt @@ -1,4 +1,3 @@ -April Augustus Balie Berg @@ -6,35 +5,26 @@ Binnen Bosch Broek Centrum -December Eind Elementen Euro -Februari Gemini Generaal Groep -Januari -Juli -Juni Kamer Kent Laat Maar -Maart Maat Mark -Mei Middag Morgen Noord November -Oktober Oost Plaats Postbus Segment -September Standaard Start Volume diff --git a/deduce-data/lookup_lists/locations/streets/streets_manual.txt b/deduce/data/lookup/src/locations/lst_street/items.txt similarity index 99% rename from deduce-data/lookup_lists/locations/streets/streets_manual.txt rename to deduce/data/lookup/src/locations/lst_street/items.txt index 2e9053b4..d3cafbd3 100644 --- a/deduce-data/lookup_lists/locations/streets/streets_manual.txt +++ b/deduce/data/lookup/src/locations/lst_street/items.txt @@ -104,7 +104,6 @@ 11e Laan 12 Septemberstraat 1213-Laan -1213-Laan 126e Wingweg 12e Laan 12e Septemberlaan @@ -125,7 +124,6 @@ 18e Wyk 18e Laan 1940-45-Laan -1940-45-Laan 19e Wyk 1e Achterholtsweg 1e Achterstraat @@ -354,7 +352,6 @@ 24 Bunder 24 Oktoberplein 24e Laan -24e Laan 25 Juni Straat 25 Junistraat 28 Oktoberstraat @@ -61345,7 +61342,6 @@ Karel Simmelinkstraat Karel Slabbaertstraat Karel V Laan Karel V Straat -Karel V Laan Karel V-Straat Karel V-Laan Karel Wiersmastraat @@ -66702,7 +66698,6 @@ Koning Willem Alexanderstraat Koning Willem I Straat Koning Willem I-Laan Koning Willem I-Park -Koning Willem I-Laan Koning Willem II Plein Koning Willem II Straat Koning Willem II-Straat @@ -135732,10 +135727,8 @@ Willem II-Laan Willem II-passage Willem III Laan Willem III Straat -Willem III Laan Willem III-Laan Willem III-Straat -Willem III-Laan Willem Idenburglaan Willem Idenburgpad Willem J. Kolfflaan diff --git a/deduce-data/lookup_lists/locations/streets/streets_bag.txt b/deduce/data/lookup/src/locations/lst_street/streets_bag.txt similarity index 100% rename from deduce-data/lookup_lists/locations/streets/streets_bag.txt rename to deduce/data/lookup/src/locations/lst_street/streets_bag.txt diff --git a/deduce/data/lookup/src/locations/lst_street/transform.json b/deduce/data/lookup/src/locations/lst_street/transform.json new file mode 100644 index 00000000..44cca14c --- /dev/null +++ b/deduce/data/lookup/src/locations/lst_street/transform.json @@ -0,0 +1,712 @@ +{ + "transforms": { + "prefix": { + "\\bAbraham\\b": [ + "Abraham", + "Abr.", + "abr." + ], + "\\bAdmiraal\\b": [ + "Admiraal", + "Adm.", + "adm." + ], + "\\bAlbert\\b": [ + "Albert", + "Alb.", + "alb." + ], + "\\bBurgemeester\\b": [ + "Burgemeester", + "Burg.", + "burg." + ], + "\\bChris\\b": [ + "Chris", + "Chr.", + "chr." + ], + "\\bCommissaris\\b": [ + "Commissaris", + "Comm.", + "comm." + ], + "\\bDominee\\b": [ + "Dominee", + "Ds.", + "ds." + ], + "\\bDoctor\\b": [ + "Doctor", + "Dr.", + "dr." + ], + "\\bDokter\\b": [ + "Dokter", + "Dr.", + "dr." + ], + "\\bDoctorandus\\b": [ + "Doctorandus", + "Drs.", + "drs." + ], + "\\bFamilie\\b": [ + "Familie", + "Fam.", + "fam." + ], + "\\bGebroeders\\b": [ + "Gebroeders", + "Gebr.", + "gebr.", + "Gebrs.", + "gebrs." + ], + "\\bGeneraal\\b": [ + "Generaal", + "Gen.", + "gen." + ], + "\\bHertog\\b": [ + "Hertog", + "Hert.", + "hert." + ], + "\\bIngenieur\\b": [ + "Ingenieur", + "Ir.", + "ir.", + "Ing.", + "ing." + ], + "\\bJacobus\\b": [ + "Jacobus", + "Jac.", + "jac." + ], + "\\bJacob\\b": [ + "Jacobus", + "Jac.", + "jac." + ], + "\\bJacqueline\\b": [ + "Jacqueline", + "Jacq.", + "jacq." + ], + "\\bJonkhkeer\\b": [ + "Jonkhkeer", + "Jhr.", + "jhr." + ], + "\\bJonkvrouw\\b": [ + "Jonkvrouw", + "Jkvr.", + "jkvr." + ], + "\\bJohan\\b": [ + "Johan", + "Joh.", + "joh." + ], + "\\bKardinaal\\b": [ + "Kardinaal", + "Kard.", + "kard." + ], + "\\bKolonel\\b": [ + "Kolonel", + "Kol.", + "kol." + ], + "\\bKoningin\\b": [ + "Koningin", + "Kon.", + "kon." + ], + "\\bKoning\\b": [ + "Koning", + "Kon.", + "kon." + ], + "\\bMajoor\\b": [ + "Majoor", + "Maj.", + "maj." + ], + "\\bMevrouw\\b": [ + "Mevrouw", + "Mevr.", + "mevr." + ], + "\\bMinister\\b": [ + "Minister", + "Min.", + "min." + ], + "\\bMeester\\b": [ + "Meester", + "Mr.", + "mr." + ], + "\\bMonseigneur\\b": [ + "Monseigneur", + "Mgr.", + "mgr." + ], + "\\bPrinses\\b": [ + "Prinses", + "Pr.", + "pr." + ], + "\\bProfessor\\b": [ + "Professor", + "Prof.", + "prof." + ], + "\\bRector\\b": [ + "Rector", + "Rect.", + "rect." + ], + "\\bSecretaris\\b": [ + "Secretaris", + "Secr.", + "secr." + ], + "\\bSenior\\b": [ + "Senior", + "Sr.", + "sr." + ], + "\\bSint\\b": [ + "Sint", + "sint", + "St.", + "st." + ], + "\\bTheo\\b": [ + "Theo", + "Th.", + "th." + ], + "\\bVeldmaarschalk\\b": [ + "Veldmaarschalk", + "Veldm.", + "Veldm" + ], + "\\bVicaris\\b": [ + "Vicaris", + "Vic.", + "vic." + ], + "\\bZuster\\b": [ + "Zuster", + "Zr.", + "zr." + ] + }, + "prop": { + "\\baan\\b": [ + "Aan", + "aan" + ], + "\\bachter\\b": [ + "Achter", + "achter" + ], + "\\band\\b": [ + "And", + "and" + ], + "\\bbie\\b": [ + "Bie", + "bie" + ], + "\\bbij\\b": [ + "Bij", + "bij" + ], + "\\bbinnenzijde\\b": [ + "Binnenzijde", + "binnenzijde", + "BZ", + "Bz", + "bz" + ], + "\\bbuitenzijde\\b": [ + "Buitenzijde", + "buitenzijde", + "BZ", + "Bz", + "bz" + ], + "\\bda\\b": [ + "Da", + "da" + ], + "\\bde\\b": [ + "De", + "de" + ], + "\\bdel\\b": [ + "Del", + "del" + ], + "\\bden\\b": [ + "Den", + "den" + ], + "\\bder\\b": [ + "Der", + "der" + ], + "\\bdes\\b": [ + "Des", + "des" + ], + "\\bdi\\b": [ + "Di", + "di" + ], + "\\bdie\\b": [ + "Die", + "die" + ], + "\\bdoor\\b": [ + "Door", + "door" + ], + "\\bdu\\b": [ + "Du", + "du" + ], + "\\bein\\b": [ + "Ein", + "ein" + ], + "\\ben\\b": [ + "En", + "en" + ], + "\\bfan\\b": [ + "Fan", + "fan" + ], + "\\bge\\b": [ + "Ge", + "ge" + ], + "\\bgen\\b": [ + "Gen", + "gen" + ], + "\\bhet\\b": [ + "Het", + "het", + "'T", + "'t", + "`T", + "`t", + "T", + "t" + ], + "\\bin\\b": [ + "In", + "in" + ], + "\\bis\\b": [ + "Is", + "is" + ], + "\\bit\\b": [ + "It", + "it", + "Het", + "het", + "'T", + "'t", + "`T", + "`t", + "T", + "t" + ], + "\\bla\\b": [ + "La", + "la" + ], + "\\blangs\\b": [ + "Langs", + "langs" + ], + "\\ble\\b": [ + "Le", + "le" + ], + "\\bnaar\\b": [ + "Naar", + "naar" + ], + "\\bnabij\\b": [ + "Nabij", + "nabij" + ], + "\\boan\\b": [ + "Oan", + "oan" + ], + "\\bof\\b": [ + "Of", + "of" + ], + "\\bom\\b": [ + "Om", + "om" + ], + "\\bonder\\b": [ + "Onder", + "onder" + ], + "\\bop\\b": [ + "Op", + "op" + ], + "\\bover\\b": [ + "Over", + "over" + ], + "\\bsur\\b": [ + "Sur", + "sur" + ], + "\\bte\\b": [ + "Te", + "te" + ], + "\\bten\\b": [ + "Ten", + "ten" + ], + "\\bter\\b": [ + "Ter", + "ter" + ], + "\\btot\\b": [ + "Tot", + "tot" + ], + "\\btusschen\\b": [ + "Tusschen", + "tusschen" + ], + "\\btussen\\b": [ + "Tussen", + "tussen" + ], + "\\but\\b": [ + "Ut", + "ut" + ], + "\\buten\\b": [ + "Uten", + "uten" + ], + "\\bvan\\b": [ + "Van", + "van", + "v.", + "V." + ], + "\\bvon\\b": [ + "Von", + "von" + ], + "\\bvoor\\b": [ + "Voor", + "voor" + ] + }, + "windrichting": { + "\\bNoord$": [ + "Noord", + "noord", + "N" + ], + "\\bOost$": [ + "Oost", + "oost", + "O" + ], + "\\bZuid$": [ + "Zuid", + "zuid", + "Z" + ], + "\\bWest$": [ + "West", + "west", + "W" + ], + "NZ$": [ + "N.Z.", + "N.z.", + "n.z.", + "Noordzijde", + "noordzijde", + "" + ], + "OZ$": [ + "O.Z.", + "O.z.", + "o.z.", + "Oostzijde", + "oostzijde", + "" + ], + "ZZ$": [ + "Z.Z.", + "Z.z.", + "z.z.", + "Zuidzijde", + "zuidzijde", + "" + ], + "WZ$": [ + "W.Z.", + "W.z.", + "w.z.", + "Westzijde", + "westzijde", + "" + ], + "NO$": [ + "N.O.", + "N.o.", + "n.o.", + "" + ], + "NW$": [ + "N.W.", + "N.w.", + "n.w.", + "" + ], + "ZO$": [ + "Z.O.", + "Z.o.", + "z.o.", + "" + ], + "ZW$": [ + "Z.W.", + "Z.w.", + "z.w.", + "" + ] + }, + "suffix": { + "dreef$": [ + "dreef", + "drf" + ], + "gracht$": [ + "gracht", + "gr" + ], + "hof$": [ + "hof", + "hf" + ], + "laan$": [ + "laan", + "ln" + ], + "markt$": [ + "markt", + "mrkt" + ], + "pad$": [ + "pad", + "pd" + ], + "park$": [ + "park", + "prk" + ], + "plantsoen$": [ + "plantsoen", + "plnts", + "pltsn" + ], + "plein$": [ + "plein", + "pln" + ], + "singel$": [ + "singel", + "sngl" + ], + "steeg$": [ + "steeg", + "stg", + "st" + ], + "straat$": [ + "straat", + "str" + ], + "weg$": [ + "weg", + "wg" + ] + }, + "loc": { + "\\bAcker\\b": [ + "Acker", + "acker" + ], + "\\bAkker\\b": [ + "Akker", + "akker" + ], + "\\bBoulevard\\b": [ + "Boulevard", + "boulevard" + ], + "\\bDijk\\b": [ + "Dijk", + "dijk" + ], + "\\bDreef\\b": [ + "Dreef", + "dreef" + ], + "\\bDwarsweg\\b": [ + "Dwarsweg", + "dwarsweg" + ], + "\\bDyk\\b": [ + "Dyk", + "dyk" + ], + "\\bErf\\b": [ + "Erf", + "erf" + ], + "\\bHeide\\b": [ + "Heide", + "heide" + ], + "\\bHof\\b": [ + "Hof", + "hof" + ], + "\\bKade\\b": [ + "Kade", + "kade" + ], + "\\bKanaal\\b": [ + "Kanaal", + "kanaal" + ], + "\\bLaan\\b": [ + "Laan", + "laan" + ], + "\\bPad\\b": [ + "Pad", + "pad" + ], + "\\bPark\\b": [ + "Park", + "park" + ], + "\\bPlantsoen\\b": [ + "Plantsoen", + "plantsoen" + ], + "\\bPlein\\b": [ + "Plein", + "plein" + ], + "\\bReed\\b": [ + "Reed", + "reed" + ], + "\\bRotonde\\b": [ + "Rotonde", + "rotonde" + ], + "\\bSloot\\b": [ + "Sloot", + "sloot" + ], + "\\bSluis\\b": [ + "Sluis", + "sluis" + ], + "\\bSteeg\\b": [ + "Steeg", + "steeg" + ], + "\\bStraat\\b": [ + "Straat", + "straat" + ], + "\\bTunnel\\b": [ + "Tunnel", + "tunnel" + ], + "\\bWal\\b": [ + "Wal", + "wal" + ], + "\\bWeg\\b": [ + "Weg", + "weg" + ], + "\\bWei\\b": [ + "Wei", + "wei" + ], + "\\bWijk\\b": [ + "Wijk", + "wijk" + ], + "\\bVen\\b": [ + "Ven", + "ven" + ] + }, + "punct": { + "\\.": [ + ".", + "" + ], + "-": [ + "-", + "", + " " + ] + }, + "spelling": { + "y": [ + "y", + "ij" + ], + "Y": [ + "Y", + "IJ" + ], + "ij": [ + "ij", + "y" + ], + "IJ": [ + "IJ", + "Y" + ] + } + } +} \ No newline at end of file diff --git a/deduce-data/lookup_lists/names/first_name_exceptions.txt b/deduce/data/lookup/src/names/lst_first_name/exceptions.txt similarity index 100% rename from deduce-data/lookup_lists/names/first_name_exceptions.txt rename to deduce/data/lookup/src/names/lst_first_name/exceptions.txt diff --git a/deduce-data/lookup_lists/names/first_names.txt b/deduce/data/lookup/src/names/lst_first_name/items.txt similarity index 100% rename from deduce-data/lookup_lists/names/first_names.txt rename to deduce/data/lookup/src/names/lst_first_name/items.txt diff --git a/deduce/data/lookup/src/names/lst_initial/items.txt b/deduce/data/lookup/src/names/lst_initial/items.txt new file mode 100644 index 00000000..95a181a1 --- /dev/null +++ b/deduce/data/lookup/src/names/lst_initial/items.txt @@ -0,0 +1,54 @@ +A +B +C +Ch +Chr +D +E +F +G +H +I +J +K +L +M +N +O +P +Ph +Q +R +S +T +Th +U +V +W +X +Y +Z +À +Á +Â +Ã +Ä +Å +Ç +È +É +Ê +Ë +Ì +Í +Î +Ï +Ñ +Ó +Ô +Õ +Ö +Ø +Ù +Ü +Š \ No newline at end of file diff --git a/deduce-data/lookup_lists/names/interfixes.txt b/deduce/data/lookup/src/names/lst_interfix/items.txt similarity index 100% rename from deduce-data/lookup_lists/names/interfixes.txt rename to deduce/data/lookup/src/names/lst_interfix/items.txt diff --git a/deduce-data/lookup_lists/names/interfix_surname_exceptions.txt b/deduce/data/lookup/src/names/lst_interfix_surname/exceptions.txt similarity index 100% rename from deduce-data/lookup_lists/names/interfix_surname_exceptions.txt rename to deduce/data/lookup/src/names/lst_interfix_surname/exceptions.txt diff --git a/deduce-data/lookup_lists/names/interfix_surnames.txt b/deduce/data/lookup/src/names/lst_interfix_surname/items.txt similarity index 100% rename from deduce-data/lookup_lists/names/interfix_surnames.txt rename to deduce/data/lookup/src/names/lst_interfix_surname/items.txt diff --git a/deduce-data/lookup_lists/names/prefixes.txt b/deduce/data/lookup/src/names/lst_prefix/items.txt similarity index 100% rename from deduce-data/lookup_lists/names/prefixes.txt rename to deduce/data/lookup/src/names/lst_prefix/items.txt diff --git a/deduce-data/lookup_lists/names/surname_exceptions.txt b/deduce/data/lookup/src/names/lst_surname/exceptions.txt similarity index 100% rename from deduce-data/lookup_lists/names/surname_exceptions.txt rename to deduce/data/lookup/src/names/lst_surname/exceptions.txt diff --git a/deduce-data/lookup_lists/names/surnames.txt b/deduce/data/lookup/src/names/lst_surname/items.txt similarity index 100% rename from deduce-data/lookup_lists/names/surnames.txt rename to deduce/data/lookup/src/names/lst_surname/items.txt diff --git a/deduce/data/lookup/src/whitelist/lst_common_word/exceptions.txt b/deduce/data/lookup/src/whitelist/lst_common_word/exceptions.txt new file mode 100644 index 00000000..ac5cef46 --- /dev/null +++ b/deduce/data/lookup/src/whitelist/lst_common_word/exceptions.txt @@ -0,0 +1,31 @@ +bel +best +boos +broer +brood +buren +dag +dik +donker +fijn +goed +groot +helder +hemel +huis +jong +jongen +kaas +klein +komen +kort +los +min +oost +papa +piek +snel +vader +vrijdag +wit +zondag \ No newline at end of file diff --git a/deduce-data/lookup_lists/top_1000_terms.txt b/deduce/data/lookup/src/whitelist/lst_common_word/items.txt similarity index 99% rename from deduce-data/lookup_lists/top_1000_terms.txt rename to deduce/data/lookup/src/whitelist/lst_common_word/items.txt index bec09390..2cc6a83e 100644 --- a/deduce-data/lookup_lists/top_1000_terms.txt +++ b/deduce/data/lookup/src/whitelist/lst_common_word/items.txt @@ -185,7 +185,6 @@ drinken drogen dromen droog -droog druk dubbel duits @@ -271,7 +270,6 @@ gevaar gevaarlijk gevangenis geven -geven gevolg gewicht gewoon @@ -308,7 +306,6 @@ hal halen half hallo -hallo hamer hand hard @@ -500,7 +497,6 @@ maken makkelijk mama man -man mand manier map @@ -522,7 +518,6 @@ mes met meubel mevrouw -mevrouw middel midden mij @@ -627,7 +622,6 @@ opnemen oranje orde oud -oud ouder over overeenkomen @@ -948,7 +942,6 @@ vrijheid vroeg vroeger vrouw -vrouw vrouwe vullen vuur diff --git a/deduce/data/lookup/src/whitelist/lst_eponymous_disease/items.txt b/deduce/data/lookup/src/whitelist/lst_eponymous_disease/items.txt new file mode 100644 index 00000000..f8c8724a --- /dev/null +++ b/deduce/data/lookup/src/whitelist/lst_eponymous_disease/items.txt @@ -0,0 +1,226 @@ +Aarskog-Scott +Aase-Smith +Abdallat-Davis-Farrage +Abderhalden-Kaufmann-Lignac +Abderhalden-Lignac-Kaufmann +Achard-Thiers +Adams-Oliver +Adams-Stokes +Adson-Caffey +Ahumada-Del Castillo +Aicardi-Goutières +Albers-Schönberg +Albright-Butler-Bloomberg +Albright-Hadorn +Alibert-Bazin +Alice in Wonderland +Alpers-Huttenlocher +Andersen-Tawil +Anderson-Fabry +Anton-Babinski +Aran-Duchenne +Arnold-Chiari +Ayerza-Arrillaga +Babinski-Froment +Babinski-Fröhlich +Babinski-Nageotte +Baller-Gerold +Bamberger-Marie +Bamforth-Lazarus +Bannayan-Riley-Ruvalcaba +Bannayan-Zonana +Bardet-Biedl +Barraquer-Simons +Barré-Liéou +Bart-Pumphrey +Bassen-Kornzweig +Beckwith-Wiedemann +Berardinelli-Seip +Bernard-Horner +Bernard-Soulier +Bernhardt-Roth +Besnier-Boeck-Schaumann +Bing-Horton +Bing-Neel +Birt-Hogg-Dubé +Bland-White-Garland +Bloch-Sulzberger +Bonnevie-Ullrich +Bourneville-Pringle +Brachman-de Lange +Brailsford-Morquio +Brill-Symmers +Brill-Zinsser +Brissaud-Sicard +Brown-Séquard +Bruck-de Lange +Bruns-Garland +Bruton-Gitlin +Budd-Chiari +Bürger-Grütz +Caffey-Silverman +Camurati-Engelmann +Carney-Stratakis +Charcot-Marie-Tooth +Charles Bonnet +Chiari-Frommel +Christ-Siemens-Touraine +Christensen-Krabbe +Churg-Strauss +Chédiak-Higashi +Claude Bernard-Horner +Clerambault-Kandinsky +Coffin-Lowry +Coffin-Siris +Collet-Sicard +Cornelia de Lange +Creutzfeldt-Jakob +Crigler-Najjar +Crocq-Cassirer +Cronkhite-Canada +Cruveilhier-Baumgarten +Curschmann-Batten-Steinert +Curschmann-Steinert +Céstan-Chenais +Danbolt-Closs +Dandy-Walker +Dejerine-Sottas +Dennie-Marfan +Denys-Drash +Diamond-Blackfan +Doege-Potter +Donnai-Barrow +Dubin-Johnson +Duchenne-Aran +Ehlers-Danlos +Emery-Dreifuss +Erb-Duchenne +Erdheim-Chester +Fitz-Hugh-Curtis +Flajani-Basedow +Foix-Alajouanine +Foix-Chavany-Marie +Forbes-Albright +Fritsch-Asherman +Gerbec-Morgagni-Adams-Stokes +Gerbezius-Morgagni-Adams-Stokes +Gilles de la Tourette +Gorlin-Goltz +Graves-Basedow +Guillain-Barré +Guillain-Barré-Strohl +Hailey-Hailey +Hallervorden-Spatz +Hand-Schüller-Christian +Hecht-Scott +Henoch-Schönlein +Holt-Oram +Hurler-Scheie +Hutchinson-Gilford +Irvine-Gass +Jakob-Creutzfeldt +Jarvi-Nasu-Hakola +Johanson-Blizzard +Jones-Smith +Kasabach-Merritt +Kashin-Beck +Kearns-Sayre +Kelly-Patterson +Kenny-Caffey +Kimmelstiel-Wilson +King-Kopetzky +Klüver-Bucy +Kugelberg-Welander +Laband-Zimmermann +Laurence-Moon +Laurence-Moon-Bardet-Biedl +Laurence-Moon-Biedl +Laurence-Moon-Biedl-Bardet +Legg-Calvé-Perthes +Lennox-Gastaut +Lesch-Nyhan +Letterer-Siwe +Lewandowsky-Lutz +Li-Fraumeni +Libman-Sacks +Loeys-Dietz +Lou Gehrig +Lujan-Fryns +Machado-Joseph +Mallory-Weiss +Marie-Foix-Alajouanine +Marshall-Smith +Marshall-Smith-Weaver +Martin-Albright +May-Hegglin +Mayer-Rokitansky-Küster-Hauser +McCune-Albight +McCune-Albright +Meckel-Gruber +Miss Havisham +Mowat-Wilson +Mucha-Habermann +Mulvihill-Smith +Munchausen by proxy +Myhre-Riley-Smith +Nasu-Hakola +Non-Hodgkin +Opitz-Kaveggia +Osgood-Schlatter +Osler-Weber-Rendu +Paget-Schroetter +Paget-Von Schrötter +Paterson-Brown-Kelly +Pelizaeus-Merzbacher +Peutz-Jeghers +Pfaundler-Hurler +Pierre-Robin +Plummer-Vinson +Potocki-Lupski +Potocki-Shaffer +Prader-Willi +Ramsay Hunts +Raymond Céstan +Riley-Day +Riley-Smith +Rubinstein-Taybi +Russell-Silver +Ruvalcaba-Myhre +Ruvalcaba-Myhre-Smith +Ruzicka-Goerz-Anton +Sanjad-Sakati +Sanjad-Sakati-Richardson-Kirk +Schinzel-Giedion +Seaver Cassidy +Shwachman-Bodian-Diamond +Silver-Russell +Sjögren-Larsson +Smith-Lemli-Opitz +Steele-Richardson-Olszewski +Stevens-Johnson +Stokes-Adams +Sturge-Weber +Tatton-Brown-Rahman +Tay-Sachs +Temple-Baraitser +Treacher Collins +Unverricht-Lundborg +Verner Morrison +Vogt-Koyanagi-Harada +Von Hippel-Lindau +Waldenstrom-Kjellberg +Waterhouse-Friderichsen +Weber-Christian +Werdnig-Hoffmann +Wernicke-Korsakoff +Westerhof-Beemer-Cormane +Willis-Ekbom +Wiskott-Aldrich +Wittmaack-Ekbom +Wohlfart-Kugelberg-Welander +Wolff-Parkinson-White +Zimmermann-Laband +Zollinger-Ellison +Zondek-Bromberg-Rozin +Zuelzer-Kaplan +Zuelzer-Ogden \ No newline at end of file diff --git a/deduce/data/lookup/src/whitelist/lst_eponymous_disease/lst_eponymous_single/items.txt b/deduce/data/lookup/src/whitelist/lst_eponymous_disease/lst_eponymous_single/items.txt new file mode 100644 index 00000000..9a1957b6 --- /dev/null +++ b/deduce/data/lookup/src/whitelist/lst_eponymous_disease/lst_eponymous_single/items.txt @@ -0,0 +1,760 @@ +Aarskog ziekte +Aase ziekte +Abercrombie ziekte +Ackerman ziekte +Addison ziekte +Aicardi ziekte +Alagille ziekte +Albright ziekte +Alexander ziekte +Alice in Wonderland ziekte +Alpers ziekte +Alport ziekte +Alström ziekte +Alvarez ziekte +Alzheimer ziekte +Anders ziekte +Andersen ziekte +Angelman ziekte +Angelucci ziekte +Anton ziekte +Apert ziekte +Asherman ziekte +Asperger ziekte +Avellis ziekte +Ayerza ziekte +Baastrup ziekte +Babesiosis ziekte +Babington ziekte +Baker ziekte +Balo ziekte +Bamberger ziekte +Bancroft ziekte +Bang ziekte +Bankart ziekte +Banti ziekte +Barlow ziekte +Barrett ziekte +Barth ziekte +Bartholin ziekte +Bartter ziekte +Basedow ziekte +Batten ziekte +Bazin ziekte +Bechterew ziekte +Becker ziekte +Begbie ziekte +Behçet ziekte +Bekhterev ziekte +Bell ziekte +Benedikt ziekte +Benjamin ziekte +Berdon ziekte +Berger ziekte +Bergeron ziekte +Bernard ziekte +Bernheim ziekte +Besnier ziekte +Bickerstaff ziekte +Biermer ziekte +Bietti ziekte +Bilharzia ziekte +Binder ziekte +Binswanger ziekte +Bloom ziekte +Blount ziekte +Boerhaave ziekte +Bogorad ziekte +Bourneville ziekte +Bowen ziekte +Brandt ziekte +Brenner ziekte +Brewer ziekte +Bright ziekte +Briquet ziekte +Brissaud ziekte +Broadbent ziekte +Brock ziekte +Brodie ziekte +Brooke ziekte +Brucellosis ziekte +Brugada ziekte +Bruns ziekte +Buerger ziekte +Bumke ziekte +Burkitt ziekte +Burnett ziekte +Bywaters ziekte +Bárány ziekte +Calvé ziekte +Canavan ziekte +Cannon ziekte +Cantú ziekte +Capgras ziekte +Caplan ziekte +Carney ziekte +Caroli ziekte +Carrion ziekte +Castleman ziekte +Chagas ziekte +Charcot ziekte +Cheadle ziekte +Chiari ziekte +Chilaiditi ziekte +Christmas ziekte +Claude ziekte +Clerambault ziekte +Coats ziekte +Cock ziekte +Cockayne ziekte +Cogan ziekte +Cohen ziekte +Concato ziekte +Conn ziekte +Cooley ziekte +Cori ziekte +Costello ziekte +Costen ziekte +Cotard ziekte +Cowden ziekte +Crohn ziekte +Crosby ziekte +Crouzon ziekte +Cruz ziekte +Cryer ziekte +Csillag ziekte +Curling ziekte +Cushing ziekte +Da Costa ziekte +Dalrymple ziekte +De Clérambault ziekte +De Quervain ziekte +Dent ziekte +Dercum ziekte +Devic ziekte +Di Guglielmo ziekte +DiGeorge ziekte +Diogenes ziekte +Donovanosis ziekte +Down ziekte +Dravet ziekte +Dressler ziekte +Duane ziekte +Duchenne ziekte +Dukes ziekte +Duncan ziekte +Dupuytren ziekte +Duroziez ziekte +Eales ziekte +Ebstein ziekte +Edwards ziekte +Ehrlichiosis ziekte +Eisenmenger ziekte +Ekbom ziekte +Emanuel ziekte +Engelmann ziekte +Erb ziekte +Evans ziekte +Fabry ziekte +Fanconi ziekte +Farber ziekte +Felty ziekte +Flajan ziekte +Forbes ziekte +Forestier ziekte +Fournier ziekte +Fregoli ziekte +Frey ziekte +Friedreich ziekte +Fritsch ziekte +Fryns ziekte +Fuchs ziekte +Ganser ziekte +Gaucher ziekte +Ghon ziekte +Gilbert ziekte +Gitelman ziekte +Glanzmann ziekte +Goldenhar ziekte +Goodpasture ziekte +Gouverneur ziekte +Graves ziekte +Grawitz ziekte +Greig ziekte +Grinker ziekte +Gruber ziekte +Gunther ziekte +Hallopeau ziekte +Hansen ziekte +Hardikar ziekte +Hartnup ziekte +Hashimoto ziekte +Havisham ziekte +Heyde ziekte +Hirschsprung ziekte +Hodgkin ziekte +Horner ziekte +Horton ziekte +Huntington ziekte +Hurler ziekte +Illig ziekte +Jaeken ziekte +Jalili ziekte +Joseph ziekte +Kahler ziekte +Kallmann ziekte +Kanner ziekte +Kaposi ziekte +Kartagener ziekte +Kawasaki ziekte +Kennedy ziekte +Kienbock ziekte +Kikuchi ziekte +Kimura ziekte +Kinsbourne ziekte +Kjer ziekte +Klatskin ziekte +Klinefelter ziekte +Korsakoff ziekte +Kounis ziekte +Krabbe ziekte +Krukenberg ziekte +Kuttner ziekte +Köhler ziekte +Laband ziekte +Lafora ziekte +Laron ziekte +Leigh ziekte +Leiner ziekte +Leishmaniasis ziekte +Lejeune ziekte +Lemierre ziekte +Lennox ziekte +Lenègre ziekte +Lev ziekte +Liddle ziekte +Lisfranc ziekte +Listeriosis ziekte +Lobomycosis ziekte +Lowe ziekte +Ludwig ziekte +Lyme ziekte +Lynch ziekte +Löffler ziekte +Löfgren ziekte +Machado ziekte +Maladie ziekte +Mansonelliasis ziekte +Marburg ziekte +Marfan ziekte +Marsh ziekte +Marshall ziekte +Maydl ziekte +Mazzotti ziekte +McArdle ziekte +Meckel ziekte +Meigs ziekte +Menkes ziekte +Middleton ziekte +Mikulicz ziekte +Mirizzi ziekte +Miss Havisham ziekte +Mondor ziekte +Monge ziekte +Morbus ziekte +Mortimer ziekte +Morton ziekte +Moschcowitz ziekte +Ménière ziekte +Ménétrier ziekte +Möbius ziekte +Münchausen ziekte +Noonan ziekte +Ormond ziekte +Othello ziekte +Paget ziekte +Parkinson ziekte +Patau ziekte +Pearson ziekte +Pendred ziekte +Perthes ziekte +Peyronie ziekte +Pfeiffer ziekte +Pick ziekte +Pickardt ziekte +Plummer ziekte +Plyushkin ziekte +Poland ziekte +Pompe ziekte +Pott ziekte +Potter ziekte +Prasad ziekte +Primrose ziekte +Prinzmetal ziekte +Purtilo ziekte +Quarelli ziekte +Quervain ziekte +Ranke ziekte +Raynaud ziekte +Refsum ziekte +Reiter ziekte +Rett ziekte +Reye ziekte +Rickettsiosis ziekte +Riddoch ziekte +Riedel ziekte +Riggs ziekte +Ritter ziekte +Robles ziekte +Roger ziekte +Rolandic ziekte +Rotor ziekte +Saint ziekte +Sandhoff ziekte +Sandifer ziekte +Sanfilippo ziekte +Schamberg ziekte +Scheie ziekte +Scheuermann ziekte +Schilder ziekte +Schnitzler ziekte +Seligmann ziekte +Sever ziekte +Shabbir ziekte +Sheehan ziekte +Shprintzen ziekte +Simmonds ziekte +Sipple ziekte +Sjögren ziekte +Skumin ziekte +Stargardt ziekte +Still ziekte +Strümpell ziekte +Susac ziekte +Sutton ziekte +Takayasu ziekte +Theileriosis ziekte +Thomsen ziekte +Tietz ziekte +Tietze ziekte +Todd ziekte +Tourette ziekte +Turcot ziekte +Turner ziekte +Usher ziekte +Valentino ziekte +Vincent ziekte +Virchow ziekte +Von Gierke ziekte +Von Recklinghausen ziekte +Von Willebrand ziekte +Von Zumbusch ziekte +Waardenburg ziekte +Waldenstrom ziekte +Waldenström ziekte +Warkany ziekte +Warthin ziekte +Watson ziekte +Wegener ziekte +Weil ziekte +Welander ziekte +Wells ziekte +Wermer ziekte +Werner ziekte +Wernicke ziekte +Westerhof ziekte +Whipple ziekte +Williams ziekte +Wilms ziekte +Wilson ziekte +Wolman ziekte +Yesudian ziekte +Zahorsky ziekte +Zellweger ziekte +Zenker ziekte +Zieve ziekte +Zondek ziekte +Zoon ziekte +Zuelzer ziekte +Zumbusch ziekte +de Quervain ziekte +ziekte van Aarskog +ziekte van Aase +ziekte van Abercrombie +ziekte van Ackerman +ziekte van Addison +ziekte van Aicardi +ziekte van Alagille +ziekte van Albright +ziekte van Alexander +ziekte van Alice in Wonderland +ziekte van Alpers +ziekte van Alport +ziekte van Alström +ziekte van Alvarez +ziekte van Alzheimer +ziekte van Anders +ziekte van Andersen +ziekte van Angelman +ziekte van Angelucci +ziekte van Anton +ziekte van Apert +ziekte van Asherman +ziekte van Asperger +ziekte van Avellis +ziekte van Ayerza +ziekte van Baastrup +ziekte van Babesiosis +ziekte van Babington +ziekte van Baker +ziekte van Balo +ziekte van Bamberger +ziekte van Bancroft +ziekte van Bang +ziekte van Bankart +ziekte van Banti +ziekte van Barlow +ziekte van Barrett +ziekte van Barth +ziekte van Bartholin +ziekte van Bartter +ziekte van Basedow +ziekte van Batten +ziekte van Bazin +ziekte van Bechterew +ziekte van Becker +ziekte van Begbie +ziekte van Behçet +ziekte van Bekhterev +ziekte van Bell +ziekte van Benedikt +ziekte van Benjamin +ziekte van Berdon +ziekte van Berger +ziekte van Bergeron +ziekte van Bernard +ziekte van Bernheim +ziekte van Besnier +ziekte van Bickerstaff +ziekte van Biermer +ziekte van Bietti +ziekte van Bilharzia +ziekte van Binder +ziekte van Binswanger +ziekte van Bloom +ziekte van Blount +ziekte van Boerhaave +ziekte van Bogorad +ziekte van Bourneville +ziekte van Bowen +ziekte van Brandt +ziekte van Brenner +ziekte van Brewer +ziekte van Bright +ziekte van Briquet +ziekte van Brissaud +ziekte van Broadbent +ziekte van Brock +ziekte van Brodie +ziekte van Brooke +ziekte van Brucellosis +ziekte van Brugada +ziekte van Bruns +ziekte van Buerger +ziekte van Bumke +ziekte van Burkitt +ziekte van Burnett +ziekte van Bywaters +ziekte van Bárány +ziekte van Calvé +ziekte van Canavan +ziekte van Cannon +ziekte van Cantú +ziekte van Capgras +ziekte van Caplan +ziekte van Carney +ziekte van Caroli +ziekte van Carrion +ziekte van Castleman +ziekte van Chagas +ziekte van Charcot +ziekte van Cheadle +ziekte van Chiari +ziekte van Chilaiditi +ziekte van Christmas +ziekte van Claude +ziekte van Clerambault +ziekte van Coats +ziekte van Cock +ziekte van Cockayne +ziekte van Cogan +ziekte van Cohen +ziekte van Concato +ziekte van Conn +ziekte van Cooley +ziekte van Cori +ziekte van Costello +ziekte van Costen +ziekte van Cotard +ziekte van Cowden +ziekte van Crohn +ziekte van Crosby +ziekte van Crouzon +ziekte van Cruz +ziekte van Cryer +ziekte van Csillag +ziekte van Curling +ziekte van Cushing +ziekte van Da Costa +ziekte van Dalrymple +ziekte van De Clérambault +ziekte van De Quervain +ziekte van Dent +ziekte van Dercum +ziekte van Devic +ziekte van Di Guglielmo +ziekte van DiGeorge +ziekte van Diogenes +ziekte van Donovanosis +ziekte van Down +ziekte van Dravet +ziekte van Dressler +ziekte van Duane +ziekte van Duchenne +ziekte van Dukes +ziekte van Duncan +ziekte van Dupuytren +ziekte van Duroziez +ziekte van Eales +ziekte van Ebstein +ziekte van Edwards +ziekte van Ehrlichiosis +ziekte van Eisenmenger +ziekte van Ekbom +ziekte van Emanuel +ziekte van Engelmann +ziekte van Erb +ziekte van Evans +ziekte van Fabry +ziekte van Fanconi +ziekte van Farber +ziekte van Felty +ziekte van Flajan +ziekte van Forbes +ziekte van Forestier +ziekte van Fournier +ziekte van Fregoli +ziekte van Frey +ziekte van Friedreich +ziekte van Fritsch +ziekte van Fryns +ziekte van Fuchs +ziekte van Ganser +ziekte van Gaucher +ziekte van Ghon +ziekte van Gilbert +ziekte van Gitelman +ziekte van Glanzmann +ziekte van Goldenhar +ziekte van Goodpasture +ziekte van Gouverneur +ziekte van Graves +ziekte van Grawitz +ziekte van Greig +ziekte van Grinker +ziekte van Gruber +ziekte van Gunther +ziekte van Hallopeau +ziekte van Hansen +ziekte van Hardikar +ziekte van Hartnup +ziekte van Hashimoto +ziekte van Havisham +ziekte van Heyde +ziekte van Hirschsprung +ziekte van Hodgkin +ziekte van Horner +ziekte van Horton +ziekte van Huntington +ziekte van Hurler +ziekte van Illig +ziekte van Jaeken +ziekte van Jalili +ziekte van Joseph +ziekte van Kahler +ziekte van Kallmann +ziekte van Kanner +ziekte van Kaposi +ziekte van Kartagener +ziekte van Kawasaki +ziekte van Kennedy +ziekte van Kienbock +ziekte van Kikuchi +ziekte van Kimura +ziekte van Kinsbourne +ziekte van Kjer +ziekte van Klatskin +ziekte van Klinefelter +ziekte van Korsakoff +ziekte van Kounis +ziekte van Krabbe +ziekte van Krukenberg +ziekte van Kuttner +ziekte van Köhler +ziekte van Laband +ziekte van Lafora +ziekte van Laron +ziekte van Leigh +ziekte van Leiner +ziekte van Leishmaniasis +ziekte van Lejeune +ziekte van Lemierre +ziekte van Lennox +ziekte van Lenègre +ziekte van Lev +ziekte van Liddle +ziekte van Lisfranc +ziekte van Listeriosis +ziekte van Lobomycosis +ziekte van Lowe +ziekte van Ludwig +ziekte van Lyme +ziekte van Lynch +ziekte van Löffler +ziekte van Löfgren +ziekte van Machado +ziekte van Maladie +ziekte van Mansonelliasis +ziekte van Marburg +ziekte van Marfan +ziekte van Marsh +ziekte van Marshall +ziekte van Maydl +ziekte van Mazzotti +ziekte van McArdle +ziekte van Meckel +ziekte van Meigs +ziekte van Menkes +ziekte van Middleton +ziekte van Mikulicz +ziekte van Mirizzi +ziekte van Miss Havisham +ziekte van Mondor +ziekte van Monge +ziekte van Morbus +ziekte van Mortimer +ziekte van Morton +ziekte van Moschcowitz +ziekte van Ménière +ziekte van Ménétrier +ziekte van Möbius +ziekte van Münchausen +ziekte van Noonan +ziekte van Ormond +ziekte van Othello +ziekte van Paget +ziekte van Parkinson +ziekte van Patau +ziekte van Pearson +ziekte van Pendred +ziekte van Perthes +ziekte van Peyronie +ziekte van Pfeiffer +ziekte van Pick +ziekte van Pickardt +ziekte van Plummer +ziekte van Plyushkin +ziekte van Poland +ziekte van Pompe +ziekte van Pott +ziekte van Potter +ziekte van Prasad +ziekte van Primrose +ziekte van Prinzmetal +ziekte van Purtilo +ziekte van Quarelli +ziekte van Quervain +ziekte van Ranke +ziekte van Raynaud +ziekte van Refsum +ziekte van Reiter +ziekte van Rett +ziekte van Reye +ziekte van Rickettsiosis +ziekte van Riddoch +ziekte van Riedel +ziekte van Riggs +ziekte van Ritter +ziekte van Robles +ziekte van Roger +ziekte van Rolandic +ziekte van Rotor +ziekte van Saint +ziekte van Sandhoff +ziekte van Sandifer +ziekte van Sanfilippo +ziekte van Schamberg +ziekte van Scheie +ziekte van Scheuermann +ziekte van Schilder +ziekte van Schnitzler +ziekte van Seligmann +ziekte van Sever +ziekte van Shabbir +ziekte van Sheehan +ziekte van Shprintzen +ziekte van Simmonds +ziekte van Sipple +ziekte van Sjögren +ziekte van Skumin +ziekte van Stargardt +ziekte van Still +ziekte van Strümpell +ziekte van Susac +ziekte van Sutton +ziekte van Takayasu +ziekte van Theileriosis +ziekte van Thomsen +ziekte van Tietz +ziekte van Tietze +ziekte van Todd +ziekte van Tourette +ziekte van Turcot +ziekte van Turner +ziekte van Usher +ziekte van Valentino +ziekte van Vincent +ziekte van Virchow +ziekte van Von Gierke +ziekte van Von Recklinghausen +ziekte van Von Willebrand +ziekte van Von Zumbusch +ziekte van Waardenburg +ziekte van Waldenstrom +ziekte van Waldenström +ziekte van Warkany +ziekte van Warthin +ziekte van Watson +ziekte van Wegener +ziekte van Weil +ziekte van Welander +ziekte van Wells +ziekte van Wermer +ziekte van Werner +ziekte van Wernicke +ziekte van Westerhof +ziekte van Whipple +ziekte van Williams +ziekte van Wilms +ziekte van Wilson +ziekte van Wolman +ziekte van Yesudian +ziekte van Zahorsky +ziekte van Zellweger +ziekte van Zenker +ziekte van Zieve +ziekte van Zondek +ziekte van Zoon +ziekte van Zuelzer +ziekte van Zumbusch +ziekte van de Quervain \ No newline at end of file diff --git a/deduce/data/lookup/src/whitelist/lst_eponymous_disease/lst_eponymous_single/transform.json b/deduce/data/lookup/src/whitelist/lst_eponymous_disease/lst_eponymous_single/transform.json new file mode 100644 index 00000000..8f0e933a --- /dev/null +++ b/deduce/data/lookup/src/whitelist/lst_eponymous_disease/lst_eponymous_single/transform.json @@ -0,0 +1,22 @@ +{ + "transforms": { + "ziekte_1": { + " ziekte$": [ + " ziekte", + "' ziekte", + "'s ziekte" + ] + }, + "ziekte_2": { + "ziekte": [ + "ziekte", + "syndroom", + "afwijking", + "tumor", + "reactie", + "complex", + "aandoening" + ] + } + } +} \ No newline at end of file diff --git a/deduce/data/lookup/src/whitelist/lst_eponymous_disease/transform.json b/deduce/data/lookup/src/whitelist/lst_eponymous_disease/transform.json new file mode 100644 index 00000000..1975e8bf --- /dev/null +++ b/deduce/data/lookup/src/whitelist/lst_eponymous_disease/transform.json @@ -0,0 +1,39 @@ +{ + "transforms": { + "ziekte_1": { + " ziekte$": [ + " ziekte", + "' ziekte", + "'s ziekte" + ] + }, + "ziekte_2": { + "ziekte": [ + "ziekte", + "syndroom", + "afwijking", + "tumor", + "reactie", + "complex", + "aandoening" + ] + }, + "prop": { + "\\bVon": [ + "Von", + "von" + ] + }, + "punct": { + "\\.": [ + ".", + "" + ], + "-": [ + "-", + "", + " " + ] + } + } +} \ No newline at end of file diff --git a/deduce-data/lookup_lists/medical_terms.txt b/deduce/data/lookup/src/whitelist/lst_medical_term/items.txt similarity index 100% rename from deduce-data/lookup_lists/medical_terms.txt rename to deduce/data/lookup/src/whitelist/lst_medical_term/items.txt diff --git a/deduce-data/lookup_lists/stop_words.txt b/deduce/data/lookup/src/whitelist/lst_stop_word/items.txt similarity index 100% rename from deduce-data/lookup_lists/stop_words.txt rename to deduce/data/lookup/src/whitelist/lst_stop_word/items.txt diff --git a/deduce/deduce.py b/deduce/deduce.py index 20f619d4..a0d935c2 100644 --- a/deduce/deduce.py +++ b/deduce/deduce.py @@ -1,118 +1,360 @@ +"""Loads Deduce and all its components.""" + +import importlib.metadata +import itertools import json +import logging import os -import re +import sys +import warnings from pathlib import Path -from typing import Optional +from typing import Any, Optional, Union import docdeid as dd +from deprecated import deprecated +from frozendict import frozendict from deduce import utils -from deduce.annotation_processing import ( +from deduce.annotation_processor import ( CleanAnnotationTag, DeduceMergeAdjacentAnnotations, PersonAnnotationConverter, RemoveAnnotations, ) from deduce.annotator import ContextAnnotator, TokenPatternAnnotator -from deduce.lookup_sets import get_lookup_sets -from deduce.redact import DeduceRedactor +from deduce.lookup_struct_loader import load_interfix_lookup, load_prefix_lookup +from deduce.lookup_structs import get_lookup_structs, load_raw_itemsets +from deduce.redactor import DeduceRedactor from deduce.tokenizer import DeduceTokenizer +__version__ = importlib.metadata.version(__package__ or __name__) + + +_BASE_PATH = Path(os.path.dirname(__file__)).parent +_LOOKUP_LIST_PATH = _BASE_PATH / "deduce" / "data" / "lookup" +_BASE_CONFIG_FILE = _BASE_PATH / "base_config.json" + + +logging.basicConfig(stream=sys.stdout, level=logging.DEBUG) +warnings.simplefilter(action="default") -class Deduce(dd.DocDeid): + +class Deduce(dd.DocDeid): # pylint: disable=R0903 """ Main class for de-identifiation. - Inherits from ``docdeid.DocDeid``, and as such, most information is available - in the documentation there. + Inherits from ``docdeid.DocDeid``, and as such, most information on deidentifying + text with a Deduce object is available there. + + Args: + load_base_config: Whether or not to load the base config that is packaged with + deduce. This loads some sensible defaults, although further customization + is always recommended. + config: A specific user config, either as a dict, or pointing to a `json` file. + When `load_base_config` is set to `True`, only settings defined in `config` + are overwritten, and other defaults are kept. When `load_base_config` is + set to `False`, no defaults are loaded and only configuration from `config` + is applied. + looup_data_path: The path to look for lookup data, by default included in + the package. If you want to make changes to source files, it's recommended + to copy the source data and pointing deduce to this folder with this + argument. + build_lookup_structs: Will always reload and rebuild lookup structs rather than + using the cache when this is set to `True`. """ - def __init__( + def __init__( # pylint: disable=R0913 self, + load_base_config: bool = True, + config: Optional[Union[str, dict]] = None, config_file: Optional[str] = None, - use_config_defaults: Optional[bool] = True, + lookup_data_path: Union[str, Path] = _LOOKUP_LIST_PATH, + build_lookup_structs: bool = False, ) -> None: + super().__init__() - self.config_file = config_file - self.use_config_defaults = use_config_defaults + if config_file is not None: + + warnings.warn( + "The config_file keyword is deprecated, please use config " + "instead, which accepts both filenames and dictionaries.", + DeprecationWarning, + ) + + config = config_file - self.config = self._initialize_config() + self.config = self._initialize_config( + load_base_config=load_base_config, user_config=config + ) + + self.lookup_data_path = self._initialize_lookup_data_path(lookup_data_path) + self.tokenizers = {"default": self._initialize_tokenizer(self.lookup_data_path)} + + self.lookup_structs = get_lookup_structs( + lookup_path=self.lookup_data_path, + tokenizer=self.tokenizers["default"], + deduce_version=__version__, + build=build_lookup_structs, + ) + + extras = {"tokenizer": self.tokenizers["default"], "ds": self.lookup_structs} - self.lookup_sets = get_lookup_sets() - self.tokenizers = self._initialize_tokenizers() - self.initialize_doc_processors() + self.processors = _DeduceProcessorLoader().load( + config=self.config, extras=extras + ) - def _initialize_config(self) -> dict: + @staticmethod + def _initialize_config( + load_base_config: bool = True, + user_config: Optional[Union[str, dict]] = None, + ) -> frozendict: """ - Initialize the config file. + Initialize the configuration. Returns: The config as a dictionary, based provided input file and default logic. """ - if self.config_file is None and not self.use_config_defaults: + config: dict[str, Any] = {} + + if load_base_config: + + with open(_BASE_CONFIG_FILE, "r", encoding="utf-8") as file: + base_config = json.load(file) + + utils.overwrite_dict(config, base_config) + + if user_config is not None: + if isinstance(user_config, str): + with open(user_config, "r", encoding="utf-8") as file: + user_config = json.load(file) + + utils.overwrite_dict(config, user_config) + + return frozendict(config) + + @staticmethod + def _initialize_lookup_data_path(lookup_data_path: Union[str, Path]) -> Path: + + if isinstance(lookup_data_path, str): + lookup_data_path = Path(lookup_data_path) + + return lookup_data_path + + @staticmethod + def _initialize_tokenizer(lookup_data_path: Path) -> dd.Tokenizer: + + raw_itemsets = load_raw_itemsets( + base_path=lookup_data_path, + subdirs=["names/lst_interfix", "names/lst_prefix"], + ) + + prefix = load_prefix_lookup(raw_itemsets) + interfix = load_interfix_lookup(raw_itemsets) + + merge_terms = itertools.chain(prefix.items(), interfix.items()) + + return DeduceTokenizer(merge_terms=merge_terms) + + +class _DeduceProcessorLoader: # pylint: disable=R0903 + """Responsible for loading all processors that Deduce should use, based on config + and deduce logic.""" + + @staticmethod + def _get_multi_token_annotator(args: dict, extras: dict) -> dd.process.Annotator: + + lookup_struct = extras["ds"][args["lookup_values"]] + + if isinstance(lookup_struct, dd.ds.LookupSet): + args.update( + lookup_values=lookup_struct.items(), + matching_pipeline=lookup_struct.matching_pipeline, + tokenizer=extras["tokenizer]"], + ) + elif isinstance(lookup_struct, dd.ds.LookupTrie): + args.update(trie=lookup_struct) + del args["lookup_values"] + else: raise ValueError( - "Please specify a config file, or set use_config_defaults to True" + f"Don't know how to present lookup structure with type " + f"{type(lookup_struct)} to MultiTokenLookupAnnotator" ) - default_config_path = Path(os.path.dirname(__file__)).parent / "config.json" + return dd.process.MultiTokenLookupAnnotator(**args) + + @deprecated( + "The multi_token annotatortype is deprecated and will be removed in a " + "future version. Please set annotator_type field to " + "docdeid.process.MultiTokenAnnotator. See " + "https://github.com/vmenger/deduce/blob/main/base_config.json for examples." + ) + def _get_multi_token_annotator_old(self, *args, **kwargs) -> dd.process.Annotator: + return self._get_multi_token_annotator(*args, **kwargs) + + @staticmethod + @deprecated( + "The token_pattern annotatortype is deprecated and will be removed in " + "a future version. Please set annotator_type field to " + "deduce.annotator.TokenPatternAnnotator. See " + "https://github.com/vmenger/deduce/blob/main/base_config.json for " + "examples." + ) + def _get_token_pattern_annotator(args: dict, extras: dict) -> dd.process.Annotator: + + return TokenPatternAnnotator(**args, ds=extras["ds"]) + + @staticmethod + @deprecated( + "The dd_token_pattern annotatortype is deprecated and will be removed " + "in a future version. For patient name patterns, please use " + "deduce.annotator.PatientNameAnnotator. For other patterns, please " + "switch to deduce.annotator.TokenPatternAnnotator. See " + "https://github.com/vmenger/deduce/blob/main/base_config.json for " + "examples." + ) + def _get_dd_token_pattern_annotator( + args: dict, extras: dict + ) -> dd.process.Annotator: + + pattern_args = args.pop("pattern") + module = pattern_args.pop("module") + cls = pattern_args.pop("class") + cls = utils.class_for_name(module, cls) - if self.use_config_defaults: - with open(default_config_path, "r", encoding="utf-8") as file: - config = json.load(file) + pattern = utils.initialize_class(cls, args=pattern_args, extras=extras) - if self.config_file is not None: - with open(Path(self.config_file), "r", encoding="utf-8") as file: - custom_config = json.load(file) + return dd.process.TokenPatternAnnotator(pattern=pattern) - config = utils.overwrite_dict(config, custom_config) + @staticmethod + @deprecated( + "The annotation_context annotatortype is deprecated and will be " + "removed in a future version. Please set annotator_type field to " + "deduce.annotator.ContextAnnotator. See " + "https://github.com/vmenger/deduce/blob/main/base_config.json for " + "examples." + ) + def _get_context_annotator(args: dict, extras: dict) -> dd.process.Annotator: - return config + return ContextAnnotator(**args, ds=extras["ds"]) - def _initialize_tokenizers(self) -> dict: - """Initializes tokenizers.""" + @staticmethod + @deprecated( + "The custom annotatortype is deprecated and will be removed in a " + "future version. Please set annotator_type field to module.class " + "directly, and remove module and class from args. See " + "https://github.com/vmenger/deduce/blob/main/base_config.json for " + "examples." + ) + def _get_custom_annotator(args: dict, extras: dict) -> dd.process.Annotator: - merge_terms = dd.ds.LookupSet() - merge_terms += self.lookup_sets["interfixes"] - merge_terms += self.lookup_sets["prefixes"] + module = args.pop("module") + cls = args.pop("class") - return {"default": DeduceTokenizer(merge_terms=merge_terms)} + cls = utils.class_for_name(module, cls) + return utils.initialize_class(cls, args=args, extras=extras) @staticmethod - def _initialize_annotators( - annotator_cnfg: dict, - lookup_sets: dd.ds.DsCollection, - tokenizer: dd.tokenize.Tokenizer, + @deprecated( + "The regexp annotatortype is deprecated and will be removed in a future " + "version. Please set annotator_type field to " + "deduce.annotator.ContextAnnotator. See " + "https://github.com/vmenger/deduce/blob/main/base_config.json for " + "examples.", + ) + def _get_regexp_annotator( + args: dict, extras: dict # pylint: disable=W0613 + ) -> dd.process.Annotator: + + return dd.process.RegexpAnnotator(**args) + + @staticmethod + def _get_annotator_from_class( + annotator_type: str, args: dict, extras: dict + ) -> dd.process.Annotator: + + elems = annotator_type.split(".") + module_name = ".".join(elems[:-1]) + class_name = elems[-1] + + cls = utils.class_for_name(module_name=module_name, class_name=class_name) + + return utils.initialize_class(cls, args, extras) + + @staticmethod + def _get_or_create_annotator_group( + group_name: Optional[str], processors: dd.process.DocProcessorGroup ) -> dd.process.DocProcessorGroup: - """Initializes annotators.""" - extras = {"ds": lookup_sets, "lookup_sets": lookup_sets, "tokenizer": tokenizer} - return _AnnotatorFactory().get_annotators(annotator_cnfg, extras) + if group_name is None: + group = processors # top level + elif group_name in processors.get_names(recursive=False): + existing_group = processors[group_name] - def initialize_doc_processors(self) -> None: - """ - Initializes document processors. + if not isinstance(existing_group, dd.process.DocProcessorGroup): + raise RuntimeError( + f"processor with name {group_name} already exists, " + f"but is no group" + ) - Need to re-run this when updating lookup sets. - """ + group = existing_group - config = ( - self.config.copy() - ) # copy to prevent accidental overwrites, deletes, etc + else: + group = dd.process.DocProcessorGroup() + processors.add_processor(group_name, group) - self.processors = self._initialize_annotators( - config["annotators"].copy(), self.lookup_sets, self.tokenizers["default"] - ) - self.processors["names"].add_processor( + return group + + def _load_annotators( + self, config: frozendict, extras: dict + ) -> dd.process.DocProcessorGroup: + + annotator_creators = { + "docdeid.process.MultiTokenLookupAnnotator": self._get_multi_token_annotator, # noqa: E501, pylint: disable=C0301 + "multi_token": self._get_multi_token_annotator_old, + "token_pattern": self._get_token_pattern_annotator, + "dd_token_pattern": self._get_dd_token_pattern_annotator, + "annotation_context": self._get_context_annotator, + "regexp": self._get_regexp_annotator, + "custom": self._get_custom_annotator, + } + + annotators = dd.process.DocProcessorGroup() + + for annotator_name, annotator_info in config.items(): + + group = self._get_or_create_annotator_group( + annotator_info.get("group", None), processors=annotators + ) + + annotator_type = annotator_info["annotator_type"] + args = annotator_info["args"] + + if annotator_type in annotator_creators: + annotator = annotator_creators[annotator_type](args, extras) + else: + annotator = self._get_annotator_from_class(annotator_type, args, extras) + + group.add_processor(annotator_name, annotator) + + return annotators + + @staticmethod + def _load_name_processors(name_group: dd.process.DocProcessorGroup) -> None: + + name_group.add_processor( "person_annotation_converter", PersonAnnotationConverter() ) - self.processors["locations"].add_processor( + @staticmethod + def _load_location_processors(location_group: dd.process.DocProcessorGroup) -> None: + + location_group.add_processor( "remove_street_tags", RemoveAnnotations(tags=["straat"]) ) - self.processors["locations"].add_processor( + location_group.add_processor( "clean_street_tags", CleanAnnotationTag( tag_map={ @@ -122,8 +364,14 @@ def initialize_doc_processors(self) -> None: ), ) - sort_by_attrs = self.config["resolve_overlap_strategy"]["attributes"] - sort_by_ascending = self.config["resolve_overlap_strategy"]["ascending"] + @staticmethod + def _load_post_processors( + config: frozendict, post_group: dd.process.DocProcessorGroup + ) -> None: + """TODO.""" + + sort_by_attrs = config["resolve_overlap_strategy"]["attributes"] + sort_by_ascending = config["resolve_overlap_strategy"]["ascending"] sort_by = [] sort_by_callbacks = {} @@ -132,20 +380,17 @@ def initialize_doc_processors(self) -> None: sort_by.append(attr) sort_by_callbacks[attr] = (lambda x: x) if ascending else (lambda y: -y) - post_group = dd.process.DocProcessorGroup() - self.processors.add_processor("post_processing", post_group) - post_group.add_processor( "overlap_resolver", dd.process.OverlapResolver( - sort_by=sort_by, sort_by_callbacks=sort_by_callbacks + sort_by=tuple(sort_by), sort_by_callbacks=frozendict(sort_by_callbacks) ), ) post_group.add_processor( "merge_adjacent_annotations", DeduceMergeAdjacentAnnotations( - slack_regexp=config["adjacent_annotations_slack"] + slack_regexp=config["adjacent_annotations_slack"], check_overlap=False ), ) @@ -157,96 +402,37 @@ def initialize_doc_processors(self) -> None: ), ) - -class _AnnotatorFactory: # pylint: disable=R0903 - """Responsible for creating annotators, based on config.""" - - def __init__(self) -> None: - self.annotator_creators = { - "token_pattern": self._get_token_pattern_annotator, - "dd_token_pattern": self._get_dd_token_pattern_annotator, - "annotation_context": self._get_context_annotator, - "regexp": self._get_regexp_annotator, - "multi_token": self._get_multi_token_annotator, - "custom": self._get_custom_annotator, - } - - @staticmethod - def _get_token_pattern_annotator(args: dict, extras: dict) -> dd.process.Annotator: - return TokenPatternAnnotator(**args, ds=extras["ds"]) - - @staticmethod - def _get_dd_token_pattern_annotator( - args: dict, extras: dict - ) -> dd.process.Annotator: - pattern = utils.import_and_initialize(args.pop("pattern"), extras=extras) - return dd.process.TokenPatternAnnotator(pattern=pattern) - - @staticmethod - def _get_context_annotator(args: dict, extras: dict) -> dd.process.Annotator: - return ContextAnnotator(**args, ds=extras["ds"]) - - @staticmethod - def _get_regexp_annotator( - args: dict, extras: dict # pylint: disable=W0613 - ) -> dd.process.Annotator: - args["regexp_pattern"] = re.compile(args["regexp_pattern"]) - return dd.process.RegexpAnnotator(**args) - - @staticmethod - def _get_multi_token_annotator(args: dict, extras: dict) -> dd.process.Annotator: - if isinstance(args["lookup_values"], str): - lookup_set = extras["lookup_sets"][args["lookup_values"]] - - args["lookup_values"] = lookup_set.items() - args["matching_pipeline"] = lookup_set.matching_pipeline - - return dd.process.MultiTokenLookupAnnotator( - **args, tokenizer=extras["tokenizer"] - ) - - @staticmethod - def _get_custom_annotator(args: dict, extras: dict) -> dd.process.Annotator: - return utils.import_and_initialize(args=args, extras=extras) - - def get_annotators( - self, annotator_cnfg: dict, extras: dict - ) -> dd.process.DocProcessorGroup: + def load(self, config: frozendict, extras: dict) -> dd.process.DocProcessorGroup: """ - Get the annotators, requested in the annotator config. + Loads all processors. Loads annotators from config, and then adds document + processors based on logic that is internal to this class. Args: - annotator_cnfg: A dictionary containing configuration on which annotators - to initialize. - extras: Any additional objects passed to pattern or annotator init, - if present. + config: The config. + extras: Any extras that should be passed to annotators/annotation processors + as keyword arguments, e.g. tokenizers or datastructures. Returns: - A DocProcessorGroup containing the initialized annotators specified - in the config dict. + A docprocessorgroup containing all annotators/processors. """ - annotators = dd.process.DocProcessorGroup() + processors = self._load_annotators(config=config["annotators"], extras=extras) - for annotator_name, annotator_info in annotator_cnfg.items(): - if annotator_info["annotator_type"] not in self.annotator_creators: - raise ValueError( - f"Unexpected annotator_type {annotator_info['annotator_type']}" - ) - - group = annotators + self._load_name_processors( + name_group=self._get_or_create_annotator_group( + group_name="names", processors=processors + ) + ) - if "group" in annotator_info: - if annotator_info["group"] not in annotators.get_names(recursive=False): - annotators.add_processor( - annotator_info["group"], dd.process.DocProcessorGroup() - ) + self._load_location_processors( + location_group=self._get_or_create_annotator_group( + group_name="locations", processors=processors + ) + ) - group = annotators[annotator_info["group"]] + post_group = dd.process.DocProcessorGroup() + processors.add_processor("post_processing", post_group) - annotator = self.annotator_creators[annotator_info["annotator_type"]]( - annotator_info["args"], extras - ) - group.add_processor(annotator_name, annotator) + self._load_post_processors(config=config, post_group=post_group) - return annotators + return processors diff --git a/deduce/depr.py b/deduce/depr.py new file mode 100644 index 00000000..ec184fe2 --- /dev/null +++ b/deduce/depr.py @@ -0,0 +1,45 @@ +"""Contains deprecated components or functionality for backwards compatibility.""" + +import warnings + +import docdeid as dd + +warnings.simplefilter(action="default") + + +class DeprecatedDsCollection(dd.ds.DsCollection): + """Temporary deprecation wrapper.""" + + def __init__(self, deprecated_items: dict, *args, **kwargs) -> None: + self.deprecated_items = deprecated_items + self.deprecated_lists = { + k: dd.ds.LookupSet() for k, v in deprecated_items.items() if v is None + } + super().__init__(*args, **kwargs) + + def __getitem__(self, key: str) -> dd.ds.Datastructure: + if key in self.deprecated_items: + + new_key = self.deprecated_items[key] + + if new_key is None: + + warnings.warn( + f"The lookup structure '{key}' is no longer " + f"included in Deduce. If it was a list with exceptions, " + f"it is now automatically included in the normal list.", + DeprecationWarning, + ) + + return self.deprecated_lists[key] + + warnings.warn( + f"The lookup structure '{key}' has been renamed to " + f"'{new_key}', pleace replace it accordingly in your " + f"code/config", + DeprecationWarning, + ) + + return super().__getitem__(new_key) + + return super().__getitem__(key) diff --git a/deduce/lookup_sets.py b/deduce/lookup_sets.py deleted file mode 100644 index fa59b803..00000000 --- a/deduce/lookup_sets.py +++ /dev/null @@ -1,298 +0,0 @@ -import os -from pathlib import Path - -import docdeid as dd - -from deduce.str.processor import ( - FilterBasedOnLookupSet, - TitleCase, - UpperCase, - UpperCaseFirstChar, -) - -data_path = Path(os.path.dirname(__file__)).parent / "deduce-data" / "lookup_lists" - - -def _get_prefixes() -> dd.ds.LookupSet: - """Get prefixes LookupSet (e.g. 'dr', 'mw')""" - - prefixes = dd.ds.LookupSet() - - prefixes.add_items_from_file(os.path.join(data_path, "names", "prefixes.txt")) - prefixes.add_items_from_self(cleaning_pipeline=[UpperCaseFirstChar()]) - - return prefixes - - -def _get_first_names() -> dd.ds.LookupSet: - """Get first names LookupSet.""" - - first_names = dd.ds.LookupSet() - - first_names.add_items_from_file( - os.path.join(data_path, "names", "first_names.txt"), - cleaning_pipeline=[dd.str.FilterByLength(min_len=2)], - ) - - first_name_exceptions = _get_first_name_exceptions() - - first_names.remove_items_from_iterable(first_name_exceptions) - - first_names.add_items_from_self( - cleaning_pipeline=[ - FilterBasedOnLookupSet(filter_set=_get_whitelist(), case_sensitive=False), - ], - replace=True, - ) - - return first_names - - -def _get_first_name_exceptions() -> dd.ds.LookupSet: - """Get first name exceptions.""" - - first_name_exceptions = dd.ds.LookupSet() - - first_name_exceptions.add_items_from_file( - os.path.join(data_path, "names", "first_name_exceptions.txt"), - ) - - return first_name_exceptions - - -def _get_interfixes() -> dd.ds.LookupSet: - """Get interfixes LookupSet ('van der', etc.)""" - - interfixes = dd.ds.LookupSet() - - interfixes.add_items_from_file(os.path.join(data_path, "names", "interfixes.txt")) - interfixes.add_items_from_self(cleaning_pipeline=[UpperCaseFirstChar()]) - interfixes.add_items_from_self(cleaning_pipeline=[TitleCase()]) - interfixes.remove_items_from_iterable(["V."]) - - return interfixes - - -def _get_interfix_surnames() -> dd.ds.LookupSet: - """Get interfix surnames LookupSet (e.g. 'Jong' for 'de Jong')""" - - interfix_surnames = dd.ds.LookupSet() - - interfix_surnames.add_items_from_file( - os.path.join(data_path, "names", "interfix_surnames.txt"), - ) - - interfix_surname_exceptions = dd.ds.LookupSet() - - interfix_surname_exceptions.add_items_from_file( - os.path.join(data_path, "names", "interfix_surname_exceptions.txt") - ) - - interfix_surnames.remove_items_from_iterable(interfix_surname_exceptions) - - return interfix_surnames - - -def _get_surnames() -> dd.ds.LookupSet: - """Get surnames LookupSet.""" - - surnames = dd.ds.LookupSet() - - surnames.add_items_from_file( - os.path.join(data_path, "names", "surnames.txt"), - cleaning_pipeline=[dd.str.FilterByLength(min_len=2)], - ) - - surname_exceptions = _get_surname_exceptions() - - surnames.remove_items_from_iterable(surname_exceptions) - - surnames.add_items_from_self( - cleaning_pipeline=[ - FilterBasedOnLookupSet(filter_set=_get_whitelist(), case_sensitive=False), - ], - replace=True, - ) - - return surnames - - -def _get_surname_exceptions() -> dd.ds.LookupSet: - """Get surname exceptions.""" - - surname_exceptions = dd.ds.LookupSet() - - surname_exceptions.add_items_from_file( - os.path.join(data_path, "names", "surname_exceptions.txt"), - ) - - return surname_exceptions - - -def _get_streets() -> dd.ds.LookupSet: - """Get streets lookupset.""" - - streets = dd.ds.LookupSet() - - streets.add_items_from_file( - file_path=os.path.join(data_path, "locations", "streets", "streets_long.txt"), - cleaning_pipeline=[ - dd.str.StripString(), - dd.str.FilterByLength(min_len=4), - ], - ) - - streets.add_items_from_self(cleaning_pipeline=[dd.str.ReplaceNonAsciiCharacters()]) - - return streets - - -def _get_placenames() -> dd.ds.LookupSet: - """Get place names LookupSet.""" - - placenames = dd.ds.LookupSet() - - placenames.add_items_from_file( - file_path=os.path.join(data_path, "locations", "placenames_long.txt"), - cleaning_pipeline=[ - dd.str.StripString(), - ], - ) - - placenames.add_items_from_self( - cleaning_pipeline=[dd.str.ReplaceNonAsciiCharacters()] - ) - - placenames.add_items_from_self( - cleaning_pipeline=[ - dd.str.ReplaceValue("(", ""), - dd.str.ReplaceValue(")", ""), - dd.str.ReplaceValue(" ", " "), - ] - ) - - placenames.add_items_from_self(cleaning_pipeline=[UpperCase()]) - - placenames.add_items_from_self( - cleaning_pipeline=[ - FilterBasedOnLookupSet(filter_set=_get_whitelist(), case_sensitive=False), - ], - replace=True, - ) - - return placenames - - -def _get_hospitals() -> dd.ds.LookupSet: - - hospitals = dd.ds.LookupSet(matching_pipeline=[dd.str.LowercaseString()]) - - hospitals.add_items_from_file( - os.path.join(data_path, "institutions", "hospital_long.txt") - ) - - hospitals.add_items_from_file( - os.path.join(data_path, "institutions", "hospital_abbr.txt") - ) - - hospitals.add_items_from_self( - cleaning_pipeline=[dd.str.ReplaceNonAsciiCharacters()], - ) - - return hospitals - - -def _get_institutions() -> dd.ds.LookupSet: - """Get institutions LookupSet.""" - - institutions = dd.ds.LookupSet() - institutions.add_items_from_file( - os.path.join(data_path, "institutions", "healthcare_institutions_long.txt"), - cleaning_pipeline=[dd.str.StripString(), dd.str.FilterByLength(min_len=4)], - ) - - institutions.add_items_from_self(cleaning_pipeline=[UpperCase()]) - - institutions.add_items_from_self( - cleaning_pipeline=[dd.str.ReplaceNonAsciiCharacters()], - ) - institutions = institutions - _get_whitelist() - - return institutions - - -def _get_top_terms() -> dd.ds.LookupSet: - top1000 = dd.ds.LookupSet() - top1000.add_items_from_file( - os.path.join(data_path, "top_1000_terms.txt"), - ) - - surnames_lowercase = dd.ds.LookupSet() - surnames_lowercase.add_items_from_file( - os.path.join(data_path, "names", "surnames.txt"), - cleaning_pipeline=[ - dd.str.LowercaseString(), - dd.str.FilterByLength(min_len=2), - ], - ) - - top1000 = top1000 - surnames_lowercase - - return top1000 - - -def _get_whitelist() -> dd.ds.LookupSet: - """ - Get whitelist LookupSet. - - Composed of medical terms, top 1000 frequent words (except surnames), and stopwords. - Returns: - """ - med_terms = dd.ds.LookupSet() - med_terms.add_items_from_file( - os.path.join(data_path, "medical_terms.txt"), - ) - - top1000 = _get_top_terms() - - stopwords = dd.ds.LookupSet() - stopwords.add_items_from_file(os.path.join(data_path, "stop_words.txt")) - - whitelist = dd.ds.LookupSet(matching_pipeline=[dd.str.LowercaseString()]) - whitelist.add_items_from_iterable( - med_terms + top1000 + stopwords, - cleaning_pipeline=[dd.str.FilterByLength(min_len=2)], - ) - - return whitelist - - -def get_lookup_sets() -> dd.ds.DsCollection: - """ - Get all lookupsets. - - Returns: - A DsCollection with all lookup sets. - """ - - lookup_sets = dd.ds.DsCollection() - - lookup_set_mapping = { - "prefixes": _get_prefixes, - "first_names": _get_first_names, - "first_name_exceptions": _get_first_name_exceptions, - "interfixes": _get_interfixes, - "interfix_surnames": _get_interfix_surnames, - "surnames": _get_surnames, - "surname_exceptions": _get_surname_exceptions, - "streets": _get_streets, - "placenames": _get_placenames, - "hospitals": _get_hospitals, - "healthcare_institutions": _get_institutions, - "whitelist": _get_whitelist, - } - - for name, init_function in lookup_set_mapping.items(): - lookup_sets[name] = init_function() - - return lookup_sets diff --git a/deduce/lookup_struct_loader.py b/deduce/lookup_struct_loader.py new file mode 100644 index 00000000..7d1f642d --- /dev/null +++ b/deduce/lookup_struct_loader.py @@ -0,0 +1,239 @@ +"""Some functions for creating lookup structures from raw items.""" + +import docdeid as dd +from docdeid import Tokenizer + +from deduce.str import FilterBasedOnLookupSet, TitleCase, UpperCase, UpperCaseFirstChar +from deduce.utils import lookup_set_to_trie + + +def load_common_word_lookup(raw_itemsets: dict[str, set[str]]) -> dd.ds.LookupSet: + """Load common_word LookupSet.""" + + common_word = dd.ds.LookupSet() + common_word.add_items_from_iterable( + raw_itemsets["common_word"], + ) + + surnames_lowercase = dd.ds.LookupSet() + surnames_lowercase.add_items_from_iterable( + raw_itemsets["surname"], + cleaning_pipeline=[ + dd.str.LowercaseString(), + dd.str.FilterByLength(min_len=2), + ], + ) + + common_word -= surnames_lowercase + + return common_word + + +def load_whitelist_lookup(raw_itemsets: dict[str, set[str]]) -> dd.ds.LookupSet: + """ + Load whitelist LookupSet. + + Composed of medical terms, top 1000 frequent words (except surnames), and stopwords. + """ + medical_term = dd.ds.LookupSet() + + medical_term.add_items_from_iterable( + raw_itemsets["medical_term"], + ) + + common_word = load_common_word_lookup(raw_itemsets) + + stop_word = dd.ds.LookupSet() + stop_word.add_items_from_iterable(raw_itemsets["stop_word"]) + + whitelist = dd.ds.LookupSet(matching_pipeline=[dd.str.LowercaseString()]) + whitelist.add_items_from_iterable( + medical_term + common_word + stop_word, + cleaning_pipeline=[dd.str.FilterByLength(min_len=2)], + ) + + return whitelist + + +def load_eponymous_disease_lookup( + raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer +) -> dd.ds.LookupTrie: + """Loads eponymous disease LookupTrie (e.g. Henoch-Schonlein).""" + epo_disease = dd.ds.LookupSet() + epo_disease.add_items_from_iterable(raw_itemsets["eponymous_disease"]) + epo_disease.add_items_from_self( + cleaning_pipeline=[dd.str.ReplaceNonAsciiCharacters()] + ) + + return lookup_set_to_trie(epo_disease, tokenizer) + + +def load_prefix_lookup(raw_itemsets: dict[str, set[str]]) -> dd.ds.LookupSet: + """Load prefix LookupSet (e.g. 'dr', 'mw').""" + + prefix = dd.ds.LookupSet() + + prefix.add_items_from_iterable(raw_itemsets["prefix"]) + prefix.add_items_from_self(cleaning_pipeline=[UpperCaseFirstChar()]) + + return prefix + + +def load_first_name_lookup( + raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer +) -> dd.ds.LookupTrie: + """Load first_name LookupTrie.""" + + first_name = dd.ds.LookupSet() + + first_name.add_items_from_iterable( + raw_itemsets["first_name"], + cleaning_pipeline=[dd.str.FilterByLength(min_len=2)], + ) + + first_name.add_items_from_self( + cleaning_pipeline=[ + FilterBasedOnLookupSet( + filter_set=load_whitelist_lookup(raw_itemsets), case_sensitive=False + ), + ], + replace=True, + ) + + return lookup_set_to_trie(first_name, tokenizer) + + +def load_interfix_lookup(raw_itemsets: dict[str, set[str]]) -> dd.ds.LookupSet: + """Load interfix LookupSet ('van der', etc.).""" + + interfix = dd.ds.LookupSet() + + interfix.add_items_from_iterable(raw_itemsets["interfix"]) + interfix.add_items_from_self(cleaning_pipeline=[UpperCaseFirstChar()]) + interfix.add_items_from_self(cleaning_pipeline=[TitleCase()]) + interfix.remove_items_from_iterable(["V."]) + + return interfix + + +def load_surname_lookup( + raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer +) -> dd.ds.LookupTrie: + """Load surname LookupTrie.""" + + surname = dd.ds.LookupSet() + + surname.add_items_from_iterable( + raw_itemsets["surname"], + cleaning_pipeline=[dd.str.FilterByLength(min_len=2)], + ) + + surname.add_items_from_self( + cleaning_pipeline=[ + FilterBasedOnLookupSet( + filter_set=load_whitelist_lookup(raw_itemsets), case_sensitive=False + ), + ], + replace=True, + ) + + return lookup_set_to_trie(surname, tokenizer) + + +def load_street_lookup( + raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer +) -> dd.ds.LookupTrie: + """Load street LookupTrie.""" + + street = dd.ds.LookupSet() + + street.add_items_from_iterable( + raw_itemsets["street"], + cleaning_pipeline=[ + dd.str.StripString(), + dd.str.FilterByLength(min_len=4), + ], + ) + + street.add_items_from_self(cleaning_pipeline=[dd.str.ReplaceNonAsciiCharacters()]) + + return lookup_set_to_trie(street, tokenizer) + + +def load_placename_lookup( + raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer +) -> dd.ds.LookupTrie: + """Load placename LookupTrie.""" + + placename = dd.ds.LookupSet() + + placename.add_items_from_iterable( + raw_itemsets["placename"], + cleaning_pipeline=[ + dd.str.StripString(), + ], + ) + + placename.add_items_from_self( + cleaning_pipeline=[dd.str.ReplaceNonAsciiCharacters()] + ) + + placename.add_items_from_self( + cleaning_pipeline=[ + dd.str.ReplaceValue("(", ""), + dd.str.ReplaceValue(")", ""), + dd.str.ReplaceValue(" ", " "), + ] + ) + + placename.add_items_from_self(cleaning_pipeline=[UpperCase()]) + + placename.add_items_from_self( + cleaning_pipeline=[ + FilterBasedOnLookupSet( + filter_set=load_whitelist_lookup(raw_itemsets), case_sensitive=False + ), + ], + replace=True, + ) + + return lookup_set_to_trie(placename, tokenizer) + + +def load_hospital_lookup( + raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer +) -> dd.ds.LookupTrie: + """Load hopsital LookupTrie.""" + + hospital = dd.ds.LookupSet(matching_pipeline=[dd.str.LowercaseString()]) + + hospital.add_items_from_iterable(raw_itemsets["hospital"]) + + hospital.add_items_from_iterable(raw_itemsets["hospital_abbr"]) + + hospital.add_items_from_self( + cleaning_pipeline=[dd.str.ReplaceNonAsciiCharacters()], + ) + + return lookup_set_to_trie(hospital, tokenizer) + + +def load_institution_lookup( + raw_itemsets: dict[str, set[str]], tokenizer: Tokenizer +) -> dd.ds.LookupTrie: + """Load institution LookupTrie.""" + + institution = dd.ds.LookupSet() + institution.add_items_from_iterable( + raw_itemsets["healthcare_institution"], + cleaning_pipeline=[dd.str.StripString(), dd.str.FilterByLength(min_len=4)], + ) + + institution.add_items_from_self(cleaning_pipeline=[UpperCase()]) + + institution.add_items_from_self( + cleaning_pipeline=[dd.str.ReplaceNonAsciiCharacters()], + ) + institution = institution - load_whitelist_lookup(raw_itemsets) + + return lookup_set_to_trie(institution, tokenizer) diff --git a/deduce/lookup_structs.py b/deduce/lookup_structs.py new file mode 100644 index 00000000..104dfe2e --- /dev/null +++ b/deduce/lookup_structs.py @@ -0,0 +1,278 @@ +"""Responsible for loading, building and caching all lookup structures.""" + +import logging +import os +import pickle +from datetime import datetime +from pathlib import Path +from typing import Optional + +import docdeid as dd +from docdeid.tokenizer import Tokenizer + +from deduce.data.lookup.src import all_lists +from deduce.depr import DeprecatedDsCollection +from deduce.lookup_struct_loader import ( + load_eponymous_disease_lookup, + load_first_name_lookup, + load_hospital_lookup, + load_institution_lookup, + load_interfix_lookup, + load_placename_lookup, + load_prefix_lookup, + load_street_lookup, + load_surname_lookup, + load_whitelist_lookup, +) +from deduce.utils import apply_transform, optional_load_items, optional_load_json + +_SRC_SUBDIR = "src" +_CACHE_SUBDIR = "cache" +_CACHE_FILE = "lookup_structs.pickle" + +_LOOKUP_SET_LOADERS = { + "prefix": load_prefix_lookup, + "interfix": load_interfix_lookup, + "whitelist": load_whitelist_lookup, +} + +_LOOKUP_TRIE_LOADERS = { + "first_name": load_first_name_lookup, + "surname": load_surname_lookup, + "street": load_street_lookup, + "placename": load_placename_lookup, + "hospital": load_hospital_lookup, + "healthcare_institution": load_institution_lookup, + "eponymous_disease": load_eponymous_disease_lookup, +} + + +def load_raw_itemset(path: Path) -> set[str]: + """ + Load the raw items from a lookup list. This works by loading the data in items.txt, + removing the data in exceptions.txt (if any), and then applying the transformations + in transform_config.json (if any). If there are nested lookup lists, they will be + loaded and treated as if they are on items.txt. + + Args: + path: The path. + + Returns: + The raw items, as a set of strings. + """ + + items = optional_load_items(path / "items.txt") + exceptions = optional_load_items(path / "exceptions.txt") + + sub_list_dirs = list(path.glob("lst_*")) + + if items is None: + + if len(sub_list_dirs) == 0: + raise RuntimeError( + f"Cannot import lookup list {path}, did not find " + f"items.txt or any sublists." + ) + + items = set() + + if exceptions is not None: + items -= exceptions + + for sub_list_dir in sub_list_dirs: + items = items.union(load_raw_itemset(sub_list_dir)) + + transform_config = optional_load_json(path / "transform.json") + + if transform_config is not None: + items = apply_transform(items, transform_config) + + return items + + +def load_raw_itemsets(base_path: Path, subdirs: list[str]) -> dict[str, set[str]]: + """ + Loads one or more raw itemsets. Automatically parses its name from the folder name. + + Args: + base_path: The base path containing the lists. + subdirs: The lists to load. + + Returns: + The raw itemsetes, represented as a dictionary mapping the name of the + lookup list to a set of strings. + """ + + lists = {} + + for lst in subdirs: + name = lst.split("/")[-1] + name = name.removeprefix("lst_") + lists[name] = load_raw_itemset(base_path / _SRC_SUBDIR / lst) + + return lists + + +def validate_lookup_struct_cache( + cache: dict, base_path: Path, deduce_version: str +) -> bool: + """ + Validates lookup structure data loaded from cache. Invalidates when changes in + source are detected, or when deduce version doesn't match. + + Args: + cache: The data loaded from the pickled cache. + base_path: The base path to check for changed files. + deduce_version: The current deduce version. + + Returns: + True when the lookup structure data is valid, False otherwise. + """ + + if cache["deduce_version"] != deduce_version: + return False + + src_path = base_path / _SRC_SUBDIR + + for file in src_path.glob("**"): + + if datetime.fromtimestamp(os.stat(file).st_mtime) > datetime.fromisoformat( + cache["saved_datetime"] + ): + return False + + return True + + +def load_lookup_structs_from_cache( + base_path: Path, deduce_version: str +) -> Optional[dd.ds.DsCollection]: + """ + Loads lookup struct data from cache. Returns None when no cache is present, or when + it's invalid. + + Args: + base_path: The base path where to look for the cache. + deduce_version: The current deduce version, used to validate. + + Returns: + A DsCollection if present and valid, None otherwise. + """ + + cache_file = base_path / _CACHE_SUBDIR / _CACHE_FILE + + try: + with open(cache_file, "rb") as file: + cache = pickle.load(file) + except FileNotFoundError: + return None + + if validate_lookup_struct_cache( + cache=cache, base_path=base_path, deduce_version=deduce_version + ): + return cache["lookup_structs"] + + return None + + +def cache_lookup_structs( + lookup_structs: dd.ds.DsCollection, base_path: Path, deduce_version: str +) -> None: + """ + Saves lookup structs to cache, along with some metadata. + + Args: + lookup_structs: The lookup structures to cache. + base_path: The base path for lookup structures. + deduce_version: The current deduce version. + """ + + cache_file = base_path / _CACHE_SUBDIR / _CACHE_FILE + + cache = { + "deduce_version": deduce_version, + "saved_datetime": str(datetime.now()), + "lookup_structs": lookup_structs, + } + + with open(cache_file, "wb") as file: + pickle.dump(cache, file) + + +def get_lookup_structs( + lookup_path: Path, + tokenizer: Tokenizer, + deduce_version: str, + build: bool = False, + save_cache: bool = True, +) -> dd.ds.DsCollection: + """ + Loads all lookup structures, and handles caching. + Args: + lookup_path: The base path for lookup sets. + tokenizer: The tokenizer, used to create sequences for LookupTrie + deduce_version: The current deduce version, used to validate cache. + build: Whether to do a full build, even when cache is present and valid. + save_cache: Whether to save to cache. Only used after building. + + Returns: The lookup structures. + + """ + + if not build: + + lookup_structs = load_lookup_structs_from_cache(lookup_path, deduce_version) + + if lookup_structs is not None: + return lookup_structs + + logging.info( + "Please wait 1-2 minutes while lookup data structures are being " + "loaded and built. This process is only triggered for new installs, " + "when the source lookup lists have changed on disk, or when " + "explicitly triggered with Deduce(build_lookup_structs=True)." + ) + + lookup_structs = DeprecatedDsCollection( + deprecated_items={ + "prefixes": "prefix", + "first_names": "first_name", + "first_name_exceptions": None, + "interfixes": "interfix", + "interfix_surnames": "interfix_surname", + "surnames": "surname", + "surname_exceptions": None, + "streets": "street", + "placenames": "placename", + "hospitals": "hospital", + "healthcare_institutions": "healthcare_institution", + } + ) + + base_items = load_raw_itemsets(base_path=lookup_path, subdirs=all_lists) + + defaults = ( + set(base_items.keys()) + - set(_LOOKUP_SET_LOADERS.keys()) + - set(_LOOKUP_TRIE_LOADERS.keys()) + ) + + for name in defaults: + lookup_set = dd.ds.LookupSet() + lookup_set.add_items_from_iterable(base_items[name]) + lookup_structs[name] = lookup_set + + for name, set_init_function in _LOOKUP_SET_LOADERS.items(): + lookup_structs[name] = set_init_function(base_items) + + for name, trie_init_function in _LOOKUP_TRIE_LOADERS.items(): + lookup_structs[name] = trie_init_function(base_items, tokenizer) + + if save_cache: + cache_lookup_structs( + lookup_structs=lookup_structs, + base_path=lookup_path, + deduce_version=deduce_version, + ) + + return lookup_structs diff --git a/deduce/pattern/__init__.py b/deduce/pattern/__init__.py index 597f9d98..93153d26 100644 --- a/deduce/pattern/__init__.py +++ b/deduce/pattern/__init__.py @@ -1,3 +1,5 @@ +# type: ignore + from .name_patient import ( PersonFirstNamePattern, PersonInitialFromNamePattern, diff --git a/deduce/pattern/name_patient.py b/deduce/pattern/name_patient.py index 149099e4..0c068107 100644 --- a/deduce/pattern/name_patient.py +++ b/deduce/pattern/name_patient.py @@ -1,10 +1,20 @@ +# pylint: disable=R0801 +# type: ignore + from typing import Optional import docdeid as dd +from deprecated import deprecated from deduce.utils import str_match +DEPR_MESSAGE = ( + "Detection of names from metadata has moved to " + "deduce.annotator.PersonNameAnnotator" +) + +@deprecated(DEPR_MESSAGE) class PersonFirstNamePattern(dd.TokenPattern): """Matches the token against all of the patients first names as defined in the "patient" Person in the document metadata, with a max edit distance of @@ -27,6 +37,7 @@ def match( return None +@deprecated(DEPR_MESSAGE) class PersonInitialFromNamePattern(dd.TokenPattern): """ Matches the first characters of the patients first names, as defined in the @@ -54,6 +65,7 @@ def match( return None +@deprecated(DEPR_MESSAGE) class PersonInitialsPattern(dd.TokenPattern): """Matches the patients initials, as defined in the "patient" Person in the document metadata.""" diff --git a/deduce/person.py b/deduce/person.py index 9e20d1a6..b6d1416c 100644 --- a/deduce/person.py +++ b/deduce/person.py @@ -9,8 +9,7 @@ class Person: """ Contains information on a person. - Usable in a document metadata, where annotators can access it for more accurate - annotation. + Usable in a document metadata, where annotators can access it for annotation. """ first_names: Optional[list[str]] = None diff --git a/deduce/redact.py b/deduce/redactor.py similarity index 99% rename from deduce/redact.py rename to deduce/redactor.py index a896ca68..8a1dea2a 100644 --- a/deduce/redact.py +++ b/deduce/redactor.py @@ -21,7 +21,7 @@ def redact(self, text: str, annotations: dd.AnnotationSet) -> str: counter = 1 for annotation in sorted( - annotation_group, key=lambda a: a.get_sort_key(by=["end_char"]) + annotation_group, key=lambda a: a.get_sort_key(by=("end_char",)) ): if tag == "patient": annotations_to_intext_replacement[annotation] = ( diff --git a/deduce/tokenizer.py b/deduce/tokenizer.py index 647bdc27..dbe67df4 100644 --- a/deduce/tokenizer.py +++ b/deduce/tokenizer.py @@ -1,13 +1,12 @@ -import re from typing import Iterable, Optional import docdeid as dd import regex -_TOKENIZER_PATTERN = regex.compile(r"\w+|[\n\r\t]|.(? None: super().__init__() self._pattern = _TOKENIZER_PATTERN - self._trie = None + self._trie: Optional[dd.ds.LookupTrie] = None + + self._start_words: set[str] = set() if merge_terms is not None: - trie = dd.ds.LookupTrie() + self._init_merge_structures(merge_terms=merge_terms) + + def _init_merge_structures(self, merge_terms: Iterable) -> None: + """ + Initializes the merge structures. + + Args: + merge_terms: The provided terms that should be merged into a single token. + """ - for term in merge_terms: - tokens = [token.text for token in self._split_text(text=term)] - trie.add_item(tokens) + trie = dd.ds.LookupTrie() - self._trie = trie + for term in merge_terms: + tokens = [token.text for token in self._split_text(text=term)] + trie.add_item(tokens) + self._start_words.add(tokens[0]) + + self._trie = trie @staticmethod - def _join_tokens(text: str, tokens: list[dd.tokenize.Token]) -> dd.tokenize.Token: + def _join_tokens(text: str, tokens: list[dd.tokenizer.Token]) -> dd.tokenizer.Token: """ Join a list of tokens into a single token. Does this by creating a new token, that ranges from the first token start char to the last token end char. @@ -54,8 +66,8 @@ def _join_tokens(text: str, tokens: list[dd.tokenize.Token]) -> dd.tokenize.Toke ) def _merge( - self, text: str, tokens: list[dd.tokenize.Token] - ) -> list[dd.tokenize.Token]: + self, text: str, tokens: list[dd.tokenizer.Token] + ) -> list[dd.tokenizer.Token]: """ Merge a list of tokens based on the trie. @@ -66,13 +78,22 @@ def _merge( A list of tokens, with merge_terms joined in single tokens. """ + if self._trie is None: + return tokens + tokens_text = [token.text for token in tokens] tokens_merged = [] i = 0 while i < len(tokens): + + if tokens_text[i] not in self._start_words: + tokens_merged.append(tokens[i]) + i += 1 + continue + longest_matching_prefix = self._trie.longest_matching_prefix( - tokens_text[i:] + tokens_text, start_i=i ) if longest_matching_prefix is None: @@ -88,7 +109,7 @@ def _merge( return tokens_merged - def _split_text(self, text: str) -> list[dd.tokenize.Token]: + def _split_text(self, text: str) -> list[dd.tokenizer.Token]: """ Split text, based on the regexp pattern. diff --git a/deduce/utils.py b/deduce/utils.py index 94068f2d..b8822abd 100644 --- a/deduce/utils.py +++ b/deduce/utils.py @@ -1,28 +1,18 @@ import importlib +import inspect +import json import re -from typing import Any, Optional +from pathlib import Path +from typing import Optional +import docdeid as dd +from docdeid import Tokenizer from rapidfuzz.distance import DamerauLevenshtein -def any_in_text(match_list: list[str], term: str) -> bool: - """ - Check if any of the strings in matchlist are in the term. - - Args: - match_list: A list of strings to match. - term: A string to match against. - - Returns: - ``True`` if any of the terms in match list are contained in the term, - ``False`` otherwise. - """ - return any(m in term for m in match_list) - - def str_match(str_1: str, str_2: str, max_edit_distance: Optional[int] = None) -> bool: """ - Match two strings. + Match two strings, potentially in a fuzzy way. Args: str_1: The first string. @@ -42,7 +32,7 @@ def str_match(str_1: str, str_2: str, max_edit_distance: Optional[int] = None) - return str_1 == str_2 -def class_for_name(module_name: str, class_name: str) -> Any: +def class_for_name(module_name: str, class_name: str) -> type: """ Will import and return the class by name. @@ -58,26 +48,27 @@ def class_for_name(module_name: str, class_name: str) -> Any: return getattr(module, class_name) -def import_and_initialize(args: dict, extras: dict) -> Any: +def initialize_class(cls: type, args: dict, extras: dict) -> object: """ - Import and initialize a module as defined in the args config. This dictionary should - contain a ``module`` and ``class`` key, which is imported. Any other arguments in - args are passed to the class initializer. Any items in extras are passed to the - class initializer if they are present. + Initialize a class. Any arguments in args are passed to the class initializer. Any + items in extras are passed to the class initializer if they are present. Args: + cls: The class to initialze. args: The arguments to pass to the initalizer. extras: A superset of arguments that should be passed to the initializer. Will be checked against the class. Returns: - An instantiated class, with the relevant argumetns and extras. + An instantiated class, with the relevant arguments and extras. """ - cls = class_for_name(args.pop("module"), args.pop("class")) + cls_params = inspect.signature(cls).parameters for arg_name, arg in extras.items(): - if arg_name in cls.__init__.__code__.co_varnames: + + if arg_name in cls_params: + args[arg_name] = arg return cls(**args) @@ -128,11 +119,12 @@ def repl_segments(s: str, matches: list[tuple]) -> list[list[str]]: Args: s: The input string. matches: A list of matches, consisting of a tuple with start- and end char, - followed by a list of options for that substring, e.g. - (5, 8, ["Mr.", "Meester"]). + followed by a list of options for that substring, e.g. + (5, 8, ["Mr.", "Meester"]). - Returns: A list of options that together sgement the entire string, e.g. [["Prof.", - "Professor"], [" "], ["Meester", "Mr."], [" Lievenslaan"]]. + Returns: + A list of options that together segement the entire string, e.g. [["Prof.", + "Professor"], [" "], ["Meester", "Mr."], [" Lievenslaan"]]. """ if len(matches) == 0: @@ -199,3 +191,93 @@ def str_variations(s: str, repl: dict[str, list[str]]) -> list[str]: variations = new_variations return variations + + +def apply_transform(items: set[str], transform_config: dict) -> set[str]: + """ + Applies a transformation to a set of items. + + Args: + items: The input items. + transform_config: The transformation, including configuration (see + transform.json for examples). + + Returns: The transformed items. + """ + + strip_lines = transform_config.get("strip_lines", True) + transforms = transform_config.get("transforms", {}) + + for _, transform in transforms.items(): + + to_add = [] + + for item in items: + to_add += str_variations(item, transform) + + items.update(to_add) + + if strip_lines: + items = {i.strip() for i in items} + + return items + + +def optional_load_items(path: Path) -> Optional[set[str]]: + """ + Load items (lines) from a textfile, returning None if file does not exist. + + Args: + path: The full path to the file. + + Returns: The lines of the file as a set if the file exists, None otherwise. + """ + + try: + with open(path, "r", encoding="utf-8") as file: + items = {line.strip() for line in file.readlines()} + except FileNotFoundError: + return None + + return items + + +def optional_load_json(path: Path) -> Optional[dict]: + """ + Load json, returning None if file does not exist. + + Args: + path: The full path to the file. + + Returns: The json data as a dict if the file exists, None otherwise. + """ + + try: + with open(path, "r", encoding="utf-8") as file: + data = json.load(file) + except FileNotFoundError: + return None + + return data + + +def lookup_set_to_trie( + lookup_set: dd.ds.LookupSet, tokenizer: Tokenizer +) -> dd.ds.LookupTrie: + """ + Converts a LookupSet into an equivalent LookupTrie. + + Args: + lookup_set: The input LookupSet + tokenizer: The tokenizer used to create sequences + + Returns: A LookupTrie with the same items and matching pipeline as the + input LookupSet. + """ + + trie = dd.ds.LookupTrie(matching_pipeline=lookup_set.matching_pipeline) + + for item in lookup_set.items(): + trie.add_item([token.text for token in tokenizer.tokenize(item)]) + + return trie diff --git a/docs/emojize.py b/docs/emojize.py new file mode 100644 index 00000000..ea88b104 --- /dev/null +++ b/docs/emojize.py @@ -0,0 +1,34 @@ +""" +Small script to emojize html files. + +Inspired by: https://bitbucket.org/lbesson/bin/src/master/emojize.py +""" + +import glob +import re +from sys import argv + +from emoji import emojize + + +def match_to_emoji(m: re.Match) -> str: + return emojize(m.group(), language="alias") + + +def emojize_all(s: str) -> str: + return re.sub(r":([0-9a-z_-]+):", match_to_emoji, s) + + +if __name__ == "__main__": + + dir = argv[1] + + for file in glob.glob(dir + "/*.html"): + + with open(file, "r") as f: + html = f.readlines() + + html = [emojize_all(line) for line in html] + + with open(file, "w") as f: + f.write("".join(html)) diff --git a/docs/source/migrating.md b/docs/source/migrating.md index 2d24f327..9bef181b 100644 --- a/docs/source/migrating.md +++ b/docs/source/migrating.md @@ -1,3 +1,49 @@ +# Migrating to version `3.0.0` + +Version `3.0.0` of `deduce` includes many optimizations that allow more accurate de-identification, some already included in `2.1.0` - `2.5.0.` It also includes some structural optimizations. Version `3.0.0` should be backwards compatible, but some functionality is scheduled for removal in `3.1.0`. Those changes are listed below. + +## Custom config + +Adding a custom config is now possible as a `dict` or as a filename pointing to a `json`. Both should be presented to `deduce` with the `config` keyword, e.g.: + +```python +deduce = Deduce(config='my_own_config.json') +deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'}) +``` + +The `config_file` keyword is no longer used, please use `config` instead. + +## Lookup structure names + +For consistency, lookup structures names are now all in singular form: + +| **Old name** | **New name** | +|-------------------------|------------------------| +| prefixes | prefix | +| first_names | first_name | +| interfixes | interfixes | +| interfix_surnames | interfix_surname | +| surnames | surname | +| streets | street | +| placenames | placename | +| hospitals | hospital | +| healthcare_institutions | healthcare_institution | + +Additionally, the `first_name_exceptions` and `surname_exceptions` list are removed. The exception items are now simply removed from the original list in a more structured way, so there is no need to explicitly filter exceptions in patterns, etc. + +## The `annotator_type` field in config + +In a config, each each annotator should specify `annotator_type`, so `Deduce` knows what annotator to load. In `3.0.0` we simplified this a bit. In most cases, the `annotator_type` field should be set to `module.Class` of the annotator that should be loaded, and `Deduce` will handle the rest (sometimes with a little bit of magic, so all arguments are presented with the right type). You should make the following changes: + +| **annotator_type** | **Change** | +|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| multi_token | `docdeid.process.MultiTokenLookupAnnotator` | +| dd_token_pattern | This used to load `docdeid.process.TokenPatternAnnotator`, but this is now replaced by `deduce.annotator.TokenPatternAnnotator`. The latter is more poweful, but needs a different pattern. A `docdeid.process.TokenPatternAnnotator` can no longer be loaded through config, although adding it manually to `Deduce.processors` is always possible. | +| token_pattern | `deduce.annotator.TokenPatternAnnotator` | +| annotation_context | `deduce.annotator.ContextAnnotator` | +| custom | Use `module.Class` directly, where `module` and `class` fields used to be specified in `args`. They should be removed there. | +| regexp | `docdeid.process.RegexpAnnotator` | + # Migrating to version `2.0.0` Version `2.0.0` of `deduce` sees a major refactor that enables speedup, configuration, customization, and more. With it, the interface to apply `deduce` to text changes slightly. Updating your code to the new interface should not take more than a few minutes. The details are outlined below. diff --git a/docs/source/tutorial.md b/docs/source/tutorial.md index 39944ceb..69b894aa 100644 --- a/docs/source/tutorial.md +++ b/docs/source/tutorial.md @@ -4,7 +4,7 @@ It's useful to note that from version `2.0.0`, `deduce` is built using `docdeid`([docs](https://docdeid.readthedocs.io/en/latest/), [GitHub](https://github.com/vmenger/docdeid)), a small framework that helps build de-identifiers. Before you start customizing `deduce`, checking the `docdeid` docs will probably make it easier still. -In case you get stuck with applying or modifying `deduce`, its always possible to as for help, by creating an issue in our [issue tracker](https://github.com/vmenger/deduce/issues)! +In case you get stuck with applying or modifying `deduce`, its always possible to ask for help, by creating an issue in our [issue tracker](https://github.com/vmenger/deduce/issues)! ```{include} ../../README.md :start-after: @@ -13,103 +13,124 @@ In case you get stuck with applying or modifying `deduce`, its always possible t ## Included components -A `docdeid` de-identifier is made up of document processors, such as annotators, annotation processors, and redactors, that are applied sequentially in a pipeline. The most important components that make up `deduce` are described below. +A `docdeid` de-identifier is made up of document processors, such as annotators, annotation processors, and redactors, that are applied sequentially in a pipeline. The most important components that make up `deduce` are described below. ### Annotators The `Annotator` is responsible for tagging pieces of information in the text as sensitive information that needs to be removed. `deduce` includes various annotators, described below: -| **Group** | **Annotator name** | **Annotator type** | **Matches** | -|-----------------|--------------------------|--------------------|--------------------------------------------------------------------------------------------| -| names | prefix_with_name | pattern | A prefix followed by a word starting with an uppercase | -| | interfix_with_name | pattern | An interfix followed by a word starting with an uppercase | -| | initial_with_capital | pattern | An initial followed by a word starting with an uppercase | -| | initial_interfix | pattern | An initial followed by an interfix and a word starting with an uppercase | -| | first_name_lookup | pattern | A first name based on builtin lookup lists | -| | surname_lookup | pattern | A surname based on builtin lookup lists | -| | person_first_name | pattern | First name of the patient, based on metadata (fuzzy) | -| | person_initial_from_name | pattern | Initial of patient, based on first names in metadata | -| | person_initials | pattern | Initials of patient, based on metadata | -| | person_surname | pattern | Surname of patient, based on metadata (fuzzy) | -| | annotation_context | context pattern | Multiple based on context, e.g. an annotation of a name followed by another word starting with an uppercase | -| institutions | institution | multi token lookup | Institutions, based on builtin lookup lists | -| locations | residence | multi token lookup | Residences, based on builtin lookup lists | -| | street_with_number | regexp | Street names, with optionally a house number | -| | postal_code | regexp | Postal codes | -| | postbus | regexp | Postbussen | -| dates | date_dmy_1 | regexp | Dates dmy format (pattern 1) | -| | date_dmy_2 | regexp | Dates dmy format (pattern 2) | -| | date_ymd_1 | regexp | Dates ymd format (pattern 1) | -| | date_ymd_2 | regexp | Dates ymd format (pattern 2) | -| ages | age | regexp | Ages | -| identifiers | identifier | regexp | Identifiers (7+ digit numbers) | -| | bsn | custom | BSN-numbers (9 digits + specific 'elfproef') | -| phone_numbers | phone | regexp | Phone numbers | -| email_addresses | email | regexp | E-mail addresses | -| urls | url | regexp | URLs | - -It's possible to add, remove, apply subsets or implement custom annotators, those options are described further down under [customizing deduce](#customizing-deduce). +| Group | Annotator Name | Annotator Type | Explanation | +|-----------------|----------------------|---------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| names | prefix_with_initial | `deduce.annotator.TokenPatternAnnotator` | Matches a prefix followed by initial(s) | +| | prefix_with_interfix | `deduce.annotator.TokenPatternAnnotator` | Matches a prefix followed by an interfix and something that resembles a name | +| | prefix_with_name | `deduce.annotator.TokenPatternAnnotator` | Matches a prefix followed by something that resembles a name | +| | interfix_with_name | `deduce.annotator.TokenPatternAnnotator` | Matches an interfix followed by something that resembles a name | +| | initial_with_name | `deduce.annotator.TokenPatternAnnotator` | Matches an initial followed by something that resembles a name | +| | initial_interfix | `deduce.annotator.TokenPatternAnnotator` | Matches an initial followed by an interfix and something that resembles a name | +| | first_name_lookup | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on first names from Voornamenbank (Meertens Instituut) | +| | surname_lookup | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on surnames from Familienamenbank (Meertens Instituut) | +| | patient_name | `deduce.annotator.PatientNameAnnotator` | Custom logic to match patient name, if supplied in document metadata | +| | name_context | `deduce.annotator.ContextAnnotator` | Matches names based on annotations found above, with the following context patterns: `interfix_right`: An interfix and something that resembles a name, when preceded by a detected initial or name `initial_left`: An initial, when followed by a detected initial, name or interfix `naam_left`: Something that resembles a name, when followed by a name `naam_right`: Something that resembles a name, when preceded by a name `prefix_left`: A prefix, when followed by a prefix, initial, name or interfix | +| | eponymous_disease | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on eponymous diseases, which will be tagged with `pseudo_name` and removed later (along with any overlap) | +| locations | placename | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a compiled list of regions, provinces, municipalities and residences | +| | street_pattern | `docdeid.process.RegexpAnnotator` | Matches streetnames based on a pattern (ending in straat, plein, dam, etc.) | +| | street_lookup | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a list of streetnames from Basisadministratie Gemeenten | +| | housenumber | `deduce.annotator.ContextAnnotator` | Matches housenumber and housenumberletters, based on the following context patterns: `housenumber_right`: a 1-4 digit number, preceded by a streetname `housenumber_housenumberletter_right`: a 1-4 digit number and a single letter, preceded by a streetname `housenumberletter_right`: a single letter, preceded by a housenumber | +| | postal_code | `docdeid.process.RegexpAnnotator` | Matches Dutch postal codes, i.e. four digits followed by two letters | +| | postbus | `docdeid.process.RegexpAnnotator` | Matches postbus, i.e. 'Postbus' followed by a 1-5 digit number, optionally with periods between them. | +| institution | hospital | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a list of hospitals. | +| | institution | `docdeid.process.MultiTokenLookupAnnotator` | Lookup based on a list of healthcare institutions, based on Zorgkaart Nederland. | +| dates | date_dmy_1 | `docdeid.process.RegexpAnnotator` | Matches dates in dmy format, e.g. 01-01-2012 | +| | date_dmy_2 | `docdeid.process.RegexpAnnotator` | Matches dates in dmy format, e.g. 01 jan 2012 | +| | date_ymd_1 | `docdeid.process.RegexpAnnotator` | Matches dates in ymd format, e.g. 2012-01-01 | +| | date_ymd_2 | `docdeid.process.RegexpAnnotator` | Matches dates in ymd format, e.g. 2012 jan 01 | +| ages | age | `deduce.annotator.RegexpPseudoAnnotator` | Matches ages based on a number of digit patterns followed by jaar/jaar oud. Excludes matches that are preceded/followed by one of the `pre_pseudo` / `post_pseudo` words, e.g. 'sinds 10 jaar` | +| identifiers | bsn | `deduce.annotator.BsnAnnotator` | Matches Dutch social security numbers (BSN), based on a 9-digit pattern that also passes the 'elfproef' | +| | identifier | `docdeid.process.RegexpAnnotator` | Matches any 7+ digit number as identifier | +| phone_numbers | phone | `deduce.annotator.PhoneNumberAnnotator` | Matches phone numbers, based on regular expression pattern, optionally with a digit too few or a digit too much (common typos) | +| email_addresses | email | `docdeid.process.RegexpAnnotator` | Matches e-mail addresses, based on regular expression pattern | +| urls | url | `docdeid.process.RegexpAnnotator` | Matches urls, based on regular expression pattern | + +It's possible to add, remove, apply subsets, or to implement custom annotators, those options are described further down under [customizing `deduce`](#customizing-deduce). ### Other processors In addition to annotators, a `docdeid` de-identifier contains annotation processors, which do some operation to the set of annotations generated previously, and redactors, which take the annotation and replace them in the text. Other processors included in `deduce` are listed below: -| **Name** | **Group** | **Description** | -|----------------------------|-----------------|-------------------------------------------------------------------------------------------------------| -| overlap_resolver | post_processing | Makes sure overlap among annotations is resolved. | -| merge_adjacent_annotations | post_processing | If there are any adjacent annotations with the same tag, they are merged into a single annotation. | -| redactor | post_processing | Takes care of replacing the annotated PHIs with `` (e.g. ``, ``) | +| **Name** | **Group** | **Description** | +|-----------------------------|-----------------|-------------------------------------------------------------------------------------------------------| +| person_annotation_converter | names | Maps name tags to either PERSON or PATIENT, and removes overlap with 'pseudo_name'. | +| remove_street_tags | locations | Removes any matched street names that are not followed by a housenumber | +| clean_street_tags | locations | Cleans up street tags, e.g. straat+huisnummer -> locatie | +| overlap_resolver | post_processing | Makes sure overlap among annotations is resolved. | +| merge_adjacent_annotations | post_processing | If there are any adjacent annotations with the same tag, they are merged into a single annotation. | +| redactor | post_processing | Takes care of replacing the annotated PHIs with `[TAG]` (e.g. `[LOCATION-1]`, `[DATE-2]`) | ### Lookup sets In order to match tokens to known identifiable words or concepts, `deduce` has the following builtin lookup sets: -| **Name** | **Size** | **Examples** | -|-------------------|----------|--------------------------------------------------| -| first_names | 9010 | Laurentia, Janny, Chantall | -| surnames | 7767 | Bosland, Winkler, Lunenburg | -| interfixes | 274 | Bij 't, Onder 't, Bij de | -| interfix_surnames | 1920 | Geldorp, Haaster, Overbeek | -| prefixes | 23 | ggn, mr, pt | -| whitelist | 1176 | delen, temesta, lepel | -| institutions | 827 | slingeland ziekenhuis, slingeland zkh, maliebaan | -| residences | 2504 | Oude Wetering, Noordeinde, Jelsum | +| **Name** | **Size** | **Examples** | +|------------------------|----------|----------------------------------------------------------------------------------------| +| prefix | 45 | bc., dhr., mijnheer | +| initial | 54 | Q, I, U | +| interfix | 44 | van de, von, v/d | +| first_name | 14690 | Martin, Alco, Wieke | +| interfix_surname | 2384 | Rijke, Butter, Agtmaal | +| surname | 10346 | Kosters, Hilderink, Kogelman | +| hospital | 9283 | Oude en Nieuwe Gasthuis, sint Jans zkh., Dijklander | +| hospital_abbr | 21 | UMCG, WKZ, PMC | +| healthcare_institution | 244342 | Gezondheidscentrum Wesselerbrink, Fysiotherapie Heer, Ergotherapie Tilburg-Waalwyk eo. | +| placename | 12049 | De Plaats, Diefdijk (U), Het Haantje (DR) | +| street | 769569 | Ds. Van Diemenstraat, Jac. v den Eyndestr, Matenstr | +| eponymous_disease | 22512 | tumor van Brucellosis, Lobomycosis reactie, syndroom van Alagille | +| common_word | 1008 | al, tuin, brengen | +| medical_term | 6939 | bevattingsvermogen, iliacaal, oor | +| stop_word | 101 | kan, heb, dat | ## Customizing deduce -We highly recommend making some effort to customize `deduce`, as even some basic effort will almost surely increase accuracy. Below are outlined some ways to achieve this, including: making changes to `config.json`, adding/removing custom pipeline components, and modifying the builtin lookup sets. +We highly recommend making some effort to customize `deduce`, as even some basic effort will almost surely increase accuracy. Below are outlined some ways to achieve this, including: making changes to the config, adding/removing custom pipeline components, and modifying the builtin lookup sets. -### Changing `config.json` +### Adding a custom config -A default `config.json` ([source on GitHub](https://github.com/vmenger/deduce/blob/main/config.json)) file is packaged with `deduce`. Among with some basic settings, it defines all annotators (also listed above). It's possible to add, modify or delete annotators here (e.g. changing regular expressions). After modifying `config.json`, you should save the modified `.json` and pass the path as argument when initializing `Deduce`: +A default `base_config.json` ([source on GitHub](https://github.com/vmenger/deduce/blob/main/base_config.json)) file is packaged with `deduce`. Among with some basic settings, it defines all annotators (also listed above). Override settings, by providing an additional user config to Deduce, either as a file or as a dict: ```python from deduce import Deduce -deduce = Deduce(config_file="path/to/custom_config.json") +deduce = Deduce(config='my_own_config.json') +deduce = Deduce(config={'redactor_open_char': '**', 'redactor_close_char': '**'}) ``` -Note that some more basic configuration options can be adjusted in the config file, however, more config options will be added in the future. +This will only override settings that are explicitly set in the user config, all other settings are kept as is. If you want to add or delete annotators (e.g. changing regular expressions), it's easiest to make a copy of `base_config.json`, and load it as follows: +```python +from deduce import Deduce + +deduce = Deduce(load_base_config=False, config='my_own_config.json') +``` + +Note that you will now miss out on any updates to the base config that are packaged with new versions of Deduce. For that reason, a better way to add/remove processors is to [interact with `Deduce.processors` directly](#implementing-custom-components) after creating the model. ### Using `disabled` keyword to disable components -It's possible to disable specific (groups of) annotators or processors when deidentifying a text. For example, to apply all annotators, except those in the dates group: +It's possible to disable specific (groups of) annotators or processors when deidentifying a text. For example, to apply all annotators, except those in the identifiers group: ```python from deduce import Deduce deduce = Deduce() -deduce.deidentify(text, disabled={'dates'}) +deduce.deidentify(text, disabled={'identifiers'}) ``` -Or, to disable one specific URL annotator in the URLs group, but keeping the other URL patterns: +Or, to disable one specific date annotator in the dates group, but keeping the other date patterns: ```python from deduce import Deduce deduce = Deduce() -deduce.deidentify("text", disabled={'urls_1'}) +deduce.deidentify("text", disabled={'date_dmy_1'}) ``` ### Using `enabled` keyword @@ -121,7 +142,7 @@ from deduce import Deduce deduce = Deduce() deduce.deidentify("text", enabled={ - 'urls', # annotator group, with annotators: + 'email-addresses', # annotator group, with annotators: 'email', 'post_processing', # post processing group, with processors: 'overlap_resolver', @@ -130,7 +151,7 @@ deduce.deidentify("text", enabled={ }) ``` -The following example however will apply **no annotators**, as the `email` annotator is enabled, but its' group `urls` is not: +The following example however will apply **no annotators**, as the `email` annotator is enabled, but its' group `email-addresses` is not: ```python from deduce import Deduce @@ -165,7 +186,7 @@ Note that by default, processors are applied in the order they are added to the #### Changing tokenizer -There might be a case where you want to add a custom annotator to `deduce` that requires its own tokenizing logic. Although replacing the builtin tokenizer is not recommended, as builtin annotators depend on it, it's possible to add more tokenizers as follows: +There might be a case where you want to add a custom annotator to `deduce` that requires its own tokenizing logic. Replacing the builtin tokenizer is not recommended, as builtin annotators depend on it, but it's possible to add more tokenizers as follows: ```python from deduce import Deduce @@ -183,23 +204,33 @@ def annotate(doc: dd.Document): tokens = doc.get_tokens("my_custom_tokenizer") ``` -### Tailoring lookup sets +### Tailoring lookup structures -Updating the builtin lookup sets is a very useful and straightforward way to tailor `deduce`. Changes can be made directly from the `Deduce.lookup_sets` attribute, as such: +Updating the builtin lookup sets and tries is a very useful and straightforward way to tailor `deduce`. Changes can be made directly from the `Deduce.lookup_structs` attribute, as such: ```python from deduce import Deduce deduce = Deduce() -deduce.lookup_sets['first_names'].add_items_from_iterable(["naam", "andere_naam"]) -deduce.lookup_sets['whitelist'].add_items_from_iterable(["woord", "ander_woord"]) -deduce.lookup_sets['residences'].add_items_from_iterable(["kleine plaats in de regio"]) -deduce.lookup_sets['institutions'].add_items_from_iterable(["verzorgingstehuis hier om de hoek"]) +# sets +deduce.lookup_structs['first_names'].add_items_from_iterable(["naam", "andere_naam"]) +deduce.lookup_structs['whitelist'].add_items_from_iterable(["woord", "ander_woord"]) + +# tries +deduce.lookup_structs['residences'].add_items(["kleine", "plaats", "in", "de", "regio"]) +deduce.lookup_structs['institutions'].add_items_from_iterable(["verzorgingstehuis", "hier", "om", "de", "hoek"]) -# need to re-initialize the doc processors, so they know about updated lookup_sets -deduce.initialize_doc_processors() +``` + +Full documentation on sets and tries, and how to modify them, is available in the [docdeid API](https://docdeid.readthedocs.io/en/latest/api/docdeid.ds.html#docdeid.ds.lookup.LookupSet). + +Larger changes may also be made by copying the source files and modifying them directly, by pointing `deduce` to the directory with modified sources: + +```python +from deduce import Deduce +deduce = Deduce(lookup_data_path="/my/path") ``` -After making changes to `lookup_sets`, it's important to call `Deduce.initialize_doc_processors`, so that the changes get picked up by the annotators. Full documentation on lookup sets and how to modify them is available in the [docdeid API](https://docdeid.readthedocs.io/en/latest/api/docdeid.ds.html#docdeid.ds.lookup.LookupSet). +It's important to copy the directory, or your changes will be overwritten with the next `deduce` update. Currently, there is no additional documentation available on how to structure and transform the lookup items in the directory, other than inspecting the pre-packaged files. Also remember that any updates to lookup values in next releases of Deduce will not be applied if `deduce` loads items from a copy, differences need to be tracked manually with each release. diff --git a/poetry.lock b/poetry.lock index f44be7bb..013b2689 100644 --- a/poetry.lock +++ b/poetry.lock @@ -97,13 +97,13 @@ uvloop = ["uvloop (>=0.15.2)"] [[package]] name = "certifi" -version = "2023.7.22" +version = "2023.11.17" description = "Python package for providing Mozilla's CA Bundle." optional = false python-versions = ">=3.6" files = [ - {file = "certifi-2023.7.22-py3-none-any.whl", hash = "sha256:92d6037539857d8206b8f6ae472e8b77db8058fec5937a1ef3f54304089edbb9"}, - {file = "certifi-2023.7.22.tar.gz", hash = "sha256:539cc1d13202e33ca466e88b2807e29f4c13049d6d87031a3c110744495cb082"}, + {file = "certifi-2023.11.17-py3-none-any.whl", hash = "sha256:e036ab49d5b79556f99cfc2d9320b34cfbe5be05c5871b51de9329f0603b0474"}, + {file = "certifi-2023.11.17.tar.gz", hash = "sha256:9b469f3a900bf28dc19b8cfbf8019bf47f7fdd1a65a1d4ffb98fc14166beb4d1"}, ] [[package]] @@ -297,6 +297,23 @@ tomli = {version = "*", optional = true, markers = "python_full_version <= \"3.1 [package.extras] toml = ["tomli"] +[[package]] +name = "deprecated" +version = "1.2.14" +description = "Python @deprecated decorator to deprecate old python classes, functions or methods." +optional = false +python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" +files = [ + {file = "Deprecated-1.2.14-py2.py3-none-any.whl", hash = "sha256:6fac8b097794a90302bdbb17b9b815e732d3c4720583ff1b198499d78470466c"}, + {file = "Deprecated-1.2.14.tar.gz", hash = "sha256:e5323eb936458dccc2582dc6f9c322c852a775a27065ff2b0c4970b9d53d01b3"}, +] + +[package.dependencies] +wrapt = ">=1.10,<2" + +[package.extras] +dev = ["PyTest", "PyTest-Cov", "bump2version (<1)", "sphinx (<2)", "tox"] + [[package]] name = "dill" version = "0.3.7" @@ -313,16 +330,17 @@ graph = ["objgraph (>=1.7.2)"] [[package]] name = "docdeid" -version = "0.1.10" +version = "1.0.0" description = "Create your own document de-identifier using docdeid, a simple framework independent of language or domain." optional = false python-versions = ">=3.9,<4.0" files = [ - {file = "docdeid-0.1.10-py3-none-any.whl", hash = "sha256:49d9ed79c42f90498726b85f1c85c72600041c7fcbaa5399c9618f0b6772594c"}, - {file = "docdeid-0.1.10.tar.gz", hash = "sha256:b96d4dc3ca045a605193becd5e3f2466be525204b27e282dfad2a58d4a957790"}, + {file = "docdeid-1.0.0-py3-none-any.whl", hash = "sha256:d5d93ec3fbd8557a9cd41b56ec3774bc3a86575d8dc6a3becd486cdf2190993b"}, + {file = "docdeid-1.0.0.tar.gz", hash = "sha256:fea630e1dff140eb939c6474df8fcebe428c28c94eed5a5b9ae5c218205b0948"}, ] [package.dependencies] +frozendict = ">=2.3.10,<3.0.0" numpy = ">=1.23.1,<2.0.0" [[package]] @@ -354,15 +372,29 @@ files = [ {file = "docutils-0.19.tar.gz", hash = "sha256:33995a6753c30b7f577febfc2c50411fec6aac7f7ffeb7c4cfe5991072dcf9e6"}, ] +[[package]] +name = "emoji" +version = "2.9.0" +description = "Emoji for Python" +optional = false +python-versions = ">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*" +files = [ + {file = "emoji-2.9.0-py2.py3-none-any.whl", hash = "sha256:17b0d53e1d9f787307a4c65aa19badb0a1ffdbc89b3a3cd851fc77821cdaced2"}, + {file = "emoji-2.9.0.tar.gz", hash = "sha256:5f4a15b7caa9c67fc11be9d90a822e3fa26aeb4e5b7bd2ded754b394d9c47869"}, +] + +[package.extras] +dev = ["coverage", "coveralls", "pytest"] + [[package]] name = "exceptiongroup" -version = "1.1.3" +version = "1.2.0" description = "Backport of PEP 654 (exception groups)" optional = false python-versions = ">=3.7" files = [ - {file = "exceptiongroup-1.1.3-py3-none-any.whl", hash = "sha256:343280667a4585d195ca1cf9cef84a4e178c4b6cf2274caef9859782b567d5e3"}, - {file = "exceptiongroup-1.1.3.tar.gz", hash = "sha256:097acd85d473d75af5bb98e41b61ff7fe35efe6675e4f9370ec6ec5126d160e9"}, + {file = "exceptiongroup-1.2.0-py3-none-any.whl", hash = "sha256:4bfd3996ac73b41e9b9628b04e079f193850720ea5945fc96a08633c66912f14"}, + {file = "exceptiongroup-1.2.0.tar.gz", hash = "sha256:91f5c769735f051a4290d52edd0858999b57e5876e9f85937691bd4c9fa3ed68"}, ] [package.extras] @@ -416,15 +448,61 @@ TOMLi = {version = "*", markers = "python_version < \"3.11\""} [package.extras] dev = ["pyTest", "pyTest-cov"] +[[package]] +name = "frozendict" +version = "2.3.10" +description = "A simple immutable dictionary" +optional = false +python-versions = ">=3.6" +files = [ + {file = "frozendict-2.3.10-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:df2d2afa5af41bfa09dc9d5a8e6d73ae39b677a8572200c65a5ea353387ffccd"}, + {file = "frozendict-2.3.10-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:b10df7f5d8637b1af319434f99dc25ca6f5537e28b293e4c405ebfb4bf9581fa"}, + {file = "frozendict-2.3.10-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:da22a3e873f365f97445c49afc1e6d5198ed6d172f3efaf0e9fde0edcca3cea1"}, + {file = "frozendict-2.3.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:89218738e2122b50bf8a0444083dbe2de280402e9c2ef0929c0db0f93ff11271"}, + {file = "frozendict-2.3.10-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:aa11add43a71fd47523fbd011be5cc011df79e25ec0b0339fc0d728623aaa7ec"}, + {file = "frozendict-2.3.10-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:af267bd6d98cbc10580105dc76f28f7156856fa48a5bbcadd40edb85f93657ae"}, + {file = "frozendict-2.3.10-cp310-cp310-win_amd64.whl", hash = "sha256:c112024df64b8926a315d7e36b860967fcad8aae0c592b9f117589391373e893"}, + {file = "frozendict-2.3.10-cp310-cp310-win_arm64.whl", hash = "sha256:a0065db2bc76628853dd620bd08c1ca44ad0b711e92e89b4156493153add6f9d"}, + {file = "frozendict-2.3.10-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:93634af5a6d71762aebc7d78bdce92890b7e612588faf887c9eaf752dc7ccdb1"}, + {file = "frozendict-2.3.10-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7b4d05e231dc1a2ec874f847fd7348cbee469555468efb875a89994ecde31a81"}, + {file = "frozendict-2.3.10-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:6d40d0644f19365fc6cc428db31c0f113fa550bd15920262f9d77ccf6556d87b"}, + {file = "frozendict-2.3.10-cp36-cp36m-musllinux_1_1_aarch64.whl", hash = "sha256:12b40526219f9583b30690011288bca4d6cce8724cda96b3c3ab08b67c5a7f09"}, + {file = "frozendict-2.3.10-cp36-cp36m-musllinux_1_1_x86_64.whl", hash = "sha256:6b552fffeba8e41b43ce10cc0fc467e048a7c9a71ae3241057510342132555b9"}, + {file = "frozendict-2.3.10-cp36-cp36m-win_amd64.whl", hash = "sha256:07208e4718cb70aa259ac886c19b96a4aad1cf00e9199f211746f738951bbf7c"}, + {file = "frozendict-2.3.10-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:e8bec6d11f7254e405290cb1b081caffa0c18b6aa779130da9a546349c56be83"}, + {file = "frozendict-2.3.10-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:b089c7e8c95d8b043e82e7da26e165f4220d7310efaad5e94445db7e3bc8321e"}, + {file = "frozendict-2.3.10-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:08a5829d708657c9d5ad58f4a7e4baa73a3d57290f9613bdd909d481fc203a3a"}, + {file = "frozendict-2.3.10-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:1c015852dacf144dbeadf203673d8c714f788fcc2b810a36504994b3c4f5a436"}, + {file = "frozendict-2.3.10-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:bb9f15a5ed924be2b1cb3654b7ea3b7bae265ff39e2b5784d42bd4a6e1353e45"}, + {file = "frozendict-2.3.10-cp37-cp37m-win_amd64.whl", hash = "sha256:809bb9c6c657bded925710a309bb2a2350bdbfdc9371df427f1a93cb8ab7ec3e"}, + {file = "frozendict-2.3.10-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:ff7a9cca3a3a1e584349e859d028388bd96a5475f76721471b73797472c6db17"}, + {file = "frozendict-2.3.10-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:0cdd496933ddb428f3854bea9ffdce0245bb27c27909f663ad396409fb4dffb5"}, + {file = "frozendict-2.3.10-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:9df392b655fadaa0174c1923e6205b30ad1ccca248e8e146e63a8147a355ee01"}, + {file = "frozendict-2.3.10-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7901828700f36fe12486705afe7afc5583434390c8f69b5419de1b6c566fb00d"}, + {file = "frozendict-2.3.10-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:c9aa28ce48d848ee520409533fd0254de4caf025c5cf1b9f27c98c1dd8cf90aa"}, + {file = "frozendict-2.3.10-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:0856af4f5b4288b2270e0b74078fad5cbaf4f799326b82183865f6f367008b2c"}, + {file = "frozendict-2.3.10-cp38-cp38-win_amd64.whl", hash = "sha256:ac41c671ff33cbefc0f06c4b2a630d18ab59f5256f45f57d5632252ae4a8c07a"}, + {file = "frozendict-2.3.10-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:893205dc5a4e5c4b24e5822ceb21ef14fed8ca4afae7ac688e2fc24294c85225"}, + {file = "frozendict-2.3.10-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:e78c5ac5d71f3b73f07ff9d9e3cc32dfbf7954f2c57c2d0e1fe8f1600e980b40"}, + {file = "frozendict-2.3.10-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8c4ca4cc42bc30b20476616411d4b49aae6084760b99251f1cbdfed879ae53ea"}, + {file = "frozendict-2.3.10-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:c865962216f7cfd6dac8693f4de431a9d98a7225185ff23613ecd10c42423adc"}, + {file = "frozendict-2.3.10-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:99b2f47b292cc4d68f6679918e8e9e6dc5e816924d8369d07018be56b93fb20f"}, + {file = "frozendict-2.3.10-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:8e7abf4539b73c8e5680dd2fdbd19ca4fc3e2b2f3666f80f022217839bb859fd"}, + {file = "frozendict-2.3.10-cp39-cp39-win_amd64.whl", hash = "sha256:901e774629fc63f84d24b5e46b59de1eed22392ee98b7f92e694a127d541edac"}, + {file = "frozendict-2.3.10-cp39-cp39-win_arm64.whl", hash = "sha256:6f8681c0ffe92be9aba40c9b9960c48f0ae7f6ea585af2b93fc9542cc3865969"}, + {file = "frozendict-2.3.10-py3-none-any.whl", hash = "sha256:66cded65f144393b4226bda9fe9ac2f42451d2d603e8a486015744bb566a7008"}, + {file = "frozendict-2.3.10.tar.gz", hash = "sha256:aadc83510ce82751a0bb3575231f778bc37cbb373f5f05a52b888e26cbb92f79"}, +] + [[package]] name = "idna" -version = "3.4" +version = "3.6" description = "Internationalized Domain Names in Applications (IDNA)" optional = false python-versions = ">=3.5" files = [ - {file = "idna-3.4-py3-none-any.whl", hash = "sha256:90b77e79eaa3eba6de819a0c442c0b4ceefc341a7a2ab77d7562bf49f425c5c2"}, - {file = "idna-3.4.tar.gz", hash = "sha256:814f528e8dead7d329833b91c5faa87d60bf71824cd12a7530b5526063d02cb4"}, + {file = "idna-3.6-py3-none-any.whl", hash = "sha256:c05567e9c24a6b9faaa835c4821bad0590fbb9d5779e7caa6e1cc4978e7eb24f"}, + {file = "idna-3.6.tar.gz", hash = "sha256:9ecdbbd083b06798ae1e86adcbfe8ab1479cf864e4ee30fe4e46a003d12491ca"}, ] [[package]] @@ -440,20 +518,20 @@ files = [ [[package]] name = "importlib-metadata" -version = "6.8.0" +version = "6.9.0" description = "Read metadata from Python packages" optional = false python-versions = ">=3.8" files = [ - {file = "importlib_metadata-6.8.0-py3-none-any.whl", hash = "sha256:3ebb78df84a805d7698245025b975d9d67053cd94c79245ba4b3eb694abe68bb"}, - {file = "importlib_metadata-6.8.0.tar.gz", hash = "sha256:dbace7892d8c0c4ac1ad096662232f831d4e64f4c4545bd53016a3e9d4654743"}, + {file = "importlib_metadata-6.9.0-py3-none-any.whl", hash = "sha256:1c8dc6839ddc9771412596926f24cb5a553bbd40624ee2c7e55e531542bed3b8"}, + {file = "importlib_metadata-6.9.0.tar.gz", hash = "sha256:e8acb523c335a91822674e149b46c0399ec4d328c4d1f6e49c273da5ff0201b9"}, ] [package.dependencies] zipp = ">=0.5" [package.extras] -docs = ["furo", "jaraco.packaging (>=9)", "jaraco.tidelift (>=1.4)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-lint"] +docs = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "rst.linker (>=1.9)", "sphinx (<7.2.5)", "sphinx (>=3.5)", "sphinx-lint"] perf = ["ipython"] testing = ["flufl.flake8", "importlib-resources (>=1.3)", "packaging", "pyfakefs", "pytest (>=6)", "pytest-black (>=0.3.7)", "pytest-checkdocs (>=2.4)", "pytest-cov", "pytest-enabler (>=2.2)", "pytest-mypy (>=0.9.1)", "pytest-perf (>=0.9.2)", "pytest-ruff"] @@ -804,17 +882,18 @@ files = [ [[package]] name = "pygments" -version = "2.16.1" +version = "2.17.2" description = "Pygments is a syntax highlighting package written in Python." optional = false python-versions = ">=3.7" files = [ - {file = "Pygments-2.16.1-py3-none-any.whl", hash = "sha256:13fc09fa63bc8d8671a6d247e1eb303c4b343eaee81d861f3404db2935653692"}, - {file = "Pygments-2.16.1.tar.gz", hash = "sha256:1daff0494820c69bc8941e407aa20f577374ee88364ee10a98fdbe0aece96e29"}, + {file = "pygments-2.17.2-py3-none-any.whl", hash = "sha256:b27c2826c47d0f3219f29554824c30c5e8945175d888647acd804ddd04af846c"}, + {file = "pygments-2.17.2.tar.gz", hash = "sha256:da46cec9fd2de5be3a8a784f434e4c4ab670b4ff54d605c4c2717e9d49c4c367"}, ] [package.extras] plugins = ["importlib-metadata"] +windows-terminal = ["colorama (>=0.4.6)"] [[package]] name = "pylint" @@ -1185,17 +1264,17 @@ use-chardet-on-py3 = ["chardet (>=3.0.2,<6)"] [[package]] name = "setuptools" -version = "68.2.2" +version = "69.0.2" description = "Easily download, build, install, upgrade, and uninstall Python packages" optional = false python-versions = ">=3.8" files = [ - {file = "setuptools-68.2.2-py3-none-any.whl", hash = "sha256:b454a35605876da60632df1a60f736524eb73cc47bbc9f3f1ef1b644de74fd2a"}, - {file = "setuptools-68.2.2.tar.gz", hash = "sha256:4ac1475276d2f1c48684874089fefcd83bd7162ddaafb81fac866ba0db282a87"}, + {file = "setuptools-69.0.2-py3-none-any.whl", hash = "sha256:1e8fdff6797d3865f37397be788a4e3cba233608e9b509382a2777d25ebde7f2"}, + {file = "setuptools-69.0.2.tar.gz", hash = "sha256:735896e78a4742605974de002ac60562d286fa8051a7e2299445e8e8fbb01aa6"}, ] [package.extras] -docs = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "pygments-github-lexers (==0.0.5)", "rst.linker (>=1.9)", "sphinx (>=3.5)", "sphinx-favicon", "sphinx-hoverxref (<2)", "sphinx-inline-tabs", "sphinx-lint", "sphinx-notfound-page (>=1,<2)", "sphinx-reredirects", "sphinxcontrib-towncrier"] +docs = ["furo", "jaraco.packaging (>=9.3)", "jaraco.tidelift (>=1.4)", "pygments-github-lexers (==0.0.5)", "rst.linker (>=1.9)", "sphinx (<7.2.5)", "sphinx (>=3.5)", "sphinx-favicon", "sphinx-inline-tabs", "sphinx-lint", "sphinx-notfound-page (>=1,<2)", "sphinx-reredirects", "sphinxcontrib-towncrier"] testing = ["build[virtualenv]", "filelock (>=3.4.0)", "flake8-2020", "ini2toml[lite] (>=0.9)", "jaraco.develop (>=7.21)", "jaraco.envs (>=2.2)", "jaraco.path (>=3.2.0)", "pip (>=19.1)", "pytest (>=6)", "pytest-black (>=0.3.7)", "pytest-checkdocs (>=2.4)", "pytest-cov", "pytest-enabler (>=2.2)", "pytest-mypy (>=0.9.1)", "pytest-perf", "pytest-ruff", "pytest-timeout", "pytest-xdist", "tomli-w (>=1.0.0)", "virtualenv (>=13.0.0)", "wheel"] testing-integration = ["build[virtualenv] (>=1.0.3)", "filelock (>=3.4.0)", "jaraco.envs (>=2.2)", "jaraco.path (>=3.2.0)", "packaging (>=23.1)", "pytest", "pytest-enabler", "pytest-xdist", "tomli", "virtualenv (>=13.0.0)", "wheel"] @@ -1419,6 +1498,85 @@ brotli = ["brotli (>=1.0.9)", "brotlicffi (>=0.8.0)"] socks = ["pysocks (>=1.5.6,!=1.5.7,<2.0)"] zstd = ["zstandard (>=0.18.0)"] +[[package]] +name = "wrapt" +version = "1.16.0" +description = "Module for decorators, wrappers and monkey patching." +optional = false +python-versions = ">=3.6" +files = [ + {file = "wrapt-1.16.0-cp310-cp310-macosx_10_9_x86_64.whl", hash = "sha256:ffa565331890b90056c01db69c0fe634a776f8019c143a5ae265f9c6bc4bd6d4"}, + {file = "wrapt-1.16.0-cp310-cp310-macosx_11_0_arm64.whl", hash = "sha256:e4fdb9275308292e880dcbeb12546df7f3e0f96c6b41197e0cf37d2826359020"}, + {file = "wrapt-1.16.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:bb2dee3874a500de01c93d5c71415fcaef1d858370d405824783e7a8ef5db440"}, + {file = "wrapt-1.16.0-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:2a88e6010048489cda82b1326889ec075a8c856c2e6a256072b28eaee3ccf487"}, + {file = "wrapt-1.16.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:ac83a914ebaf589b69f7d0a1277602ff494e21f4c2f743313414378f8f50a4cf"}, + {file = "wrapt-1.16.0-cp310-cp310-musllinux_1_1_aarch64.whl", hash = "sha256:73aa7d98215d39b8455f103de64391cb79dfcad601701a3aa0dddacf74911d72"}, + {file = "wrapt-1.16.0-cp310-cp310-musllinux_1_1_i686.whl", hash = "sha256:807cc8543a477ab7422f1120a217054f958a66ef7314f76dd9e77d3f02cdccd0"}, + {file = "wrapt-1.16.0-cp310-cp310-musllinux_1_1_x86_64.whl", hash = "sha256:bf5703fdeb350e36885f2875d853ce13172ae281c56e509f4e6eca049bdfb136"}, + {file = "wrapt-1.16.0-cp310-cp310-win32.whl", hash = "sha256:f6b2d0c6703c988d334f297aa5df18c45e97b0af3679bb75059e0e0bd8b1069d"}, + {file = "wrapt-1.16.0-cp310-cp310-win_amd64.whl", hash = "sha256:decbfa2f618fa8ed81c95ee18a387ff973143c656ef800c9f24fb7e9c16054e2"}, + {file = "wrapt-1.16.0-cp311-cp311-macosx_10_9_x86_64.whl", hash = "sha256:1a5db485fe2de4403f13fafdc231b0dbae5eca4359232d2efc79025527375b09"}, + {file = "wrapt-1.16.0-cp311-cp311-macosx_11_0_arm64.whl", hash = "sha256:75ea7d0ee2a15733684badb16de6794894ed9c55aa5e9903260922f0482e687d"}, + {file = "wrapt-1.16.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a452f9ca3e3267cd4d0fcf2edd0d035b1934ac2bd7e0e57ac91ad6b95c0c6389"}, + {file = "wrapt-1.16.0-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:43aa59eadec7890d9958748db829df269f0368521ba6dc68cc172d5d03ed8060"}, + {file = "wrapt-1.16.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:72554a23c78a8e7aa02abbd699d129eead8b147a23c56e08d08dfc29cfdddca1"}, + {file = "wrapt-1.16.0-cp311-cp311-musllinux_1_1_aarch64.whl", hash = "sha256:d2efee35b4b0a347e0d99d28e884dfd82797852d62fcd7ebdeee26f3ceb72cf3"}, + {file = "wrapt-1.16.0-cp311-cp311-musllinux_1_1_i686.whl", hash = "sha256:6dcfcffe73710be01d90cae08c3e548d90932d37b39ef83969ae135d36ef3956"}, + {file = "wrapt-1.16.0-cp311-cp311-musllinux_1_1_x86_64.whl", hash = "sha256:eb6e651000a19c96f452c85132811d25e9264d836951022d6e81df2fff38337d"}, + {file = "wrapt-1.16.0-cp311-cp311-win32.whl", hash = "sha256:66027d667efe95cc4fa945af59f92c5a02c6f5bb6012bff9e60542c74c75c362"}, + {file = "wrapt-1.16.0-cp311-cp311-win_amd64.whl", hash = "sha256:aefbc4cb0a54f91af643660a0a150ce2c090d3652cf4052a5397fb2de549cd89"}, + {file = "wrapt-1.16.0-cp312-cp312-macosx_10_9_x86_64.whl", hash = "sha256:5eb404d89131ec9b4f748fa5cfb5346802e5ee8836f57d516576e61f304f3b7b"}, + {file = "wrapt-1.16.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:9090c9e676d5236a6948330e83cb89969f433b1943a558968f659ead07cb3b36"}, + {file = "wrapt-1.16.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:94265b00870aa407bd0cbcfd536f17ecde43b94fb8d228560a1e9d3041462d73"}, + {file = "wrapt-1.16.0-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:f2058f813d4f2b5e3a9eb2eb3faf8f1d99b81c3e51aeda4b168406443e8ba809"}, + {file = "wrapt-1.16.0-cp312-cp312-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:98b5e1f498a8ca1858a1cdbffb023bfd954da4e3fa2c0cb5853d40014557248b"}, + {file = "wrapt-1.16.0-cp312-cp312-musllinux_1_1_aarch64.whl", hash = "sha256:14d7dc606219cdd7405133c713f2c218d4252f2a469003f8c46bb92d5d095d81"}, + {file = "wrapt-1.16.0-cp312-cp312-musllinux_1_1_i686.whl", hash = "sha256:49aac49dc4782cb04f58986e81ea0b4768e4ff197b57324dcbd7699c5dfb40b9"}, + {file = "wrapt-1.16.0-cp312-cp312-musllinux_1_1_x86_64.whl", hash = "sha256:418abb18146475c310d7a6dc71143d6f7adec5b004ac9ce08dc7a34e2babdc5c"}, + {file = "wrapt-1.16.0-cp312-cp312-win32.whl", hash = "sha256:685f568fa5e627e93f3b52fda002c7ed2fa1800b50ce51f6ed1d572d8ab3e7fc"}, + {file = "wrapt-1.16.0-cp312-cp312-win_amd64.whl", hash = "sha256:dcdba5c86e368442528f7060039eda390cc4091bfd1dca41e8046af7c910dda8"}, + {file = "wrapt-1.16.0-cp36-cp36m-macosx_10_9_x86_64.whl", hash = "sha256:d462f28826f4657968ae51d2181a074dfe03c200d6131690b7d65d55b0f360f8"}, + {file = "wrapt-1.16.0-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:a33a747400b94b6d6b8a165e4480264a64a78c8a4c734b62136062e9a248dd39"}, + {file = "wrapt-1.16.0-cp36-cp36m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:b3646eefa23daeba62643a58aac816945cadc0afaf21800a1421eeba5f6cfb9c"}, + {file = "wrapt-1.16.0-cp36-cp36m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3ebf019be5c09d400cf7b024aa52b1f3aeebeff51550d007e92c3c1c4afc2a40"}, + {file = "wrapt-1.16.0-cp36-cp36m-musllinux_1_1_aarch64.whl", hash = "sha256:0d2691979e93d06a95a26257adb7bfd0c93818e89b1406f5a28f36e0d8c1e1fc"}, + {file = "wrapt-1.16.0-cp36-cp36m-musllinux_1_1_i686.whl", hash = "sha256:1acd723ee2a8826f3d53910255643e33673e1d11db84ce5880675954183ec47e"}, + {file = "wrapt-1.16.0-cp36-cp36m-musllinux_1_1_x86_64.whl", hash = "sha256:bc57efac2da352a51cc4658878a68d2b1b67dbe9d33c36cb826ca449d80a8465"}, + {file = "wrapt-1.16.0-cp36-cp36m-win32.whl", hash = "sha256:da4813f751142436b075ed7aa012a8778aa43a99f7b36afe9b742d3ed8bdc95e"}, + {file = "wrapt-1.16.0-cp36-cp36m-win_amd64.whl", hash = "sha256:6f6eac2360f2d543cc875a0e5efd413b6cbd483cb3ad7ebf888884a6e0d2e966"}, + {file = "wrapt-1.16.0-cp37-cp37m-macosx_10_9_x86_64.whl", hash = "sha256:a0ea261ce52b5952bf669684a251a66df239ec6d441ccb59ec7afa882265d593"}, + {file = "wrapt-1.16.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7bd2d7ff69a2cac767fbf7a2b206add2e9a210e57947dd7ce03e25d03d2de292"}, + {file = "wrapt-1.16.0-cp37-cp37m-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:9159485323798c8dc530a224bd3ffcf76659319ccc7bbd52e01e73bd0241a0c5"}, + {file = "wrapt-1.16.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:a86373cf37cd7764f2201b76496aba58a52e76dedfaa698ef9e9688bfd9e41cf"}, + {file = "wrapt-1.16.0-cp37-cp37m-musllinux_1_1_aarch64.whl", hash = "sha256:73870c364c11f03ed072dda68ff7aea6d2a3a5c3fe250d917a429c7432e15228"}, + {file = "wrapt-1.16.0-cp37-cp37m-musllinux_1_1_i686.whl", hash = "sha256:b935ae30c6e7400022b50f8d359c03ed233d45b725cfdd299462f41ee5ffba6f"}, + {file = "wrapt-1.16.0-cp37-cp37m-musllinux_1_1_x86_64.whl", hash = "sha256:db98ad84a55eb09b3c32a96c576476777e87c520a34e2519d3e59c44710c002c"}, + {file = "wrapt-1.16.0-cp37-cp37m-win32.whl", hash = "sha256:9153ed35fc5e4fa3b2fe97bddaa7cbec0ed22412b85bcdaf54aeba92ea37428c"}, + {file = "wrapt-1.16.0-cp37-cp37m-win_amd64.whl", hash = "sha256:66dfbaa7cfa3eb707bbfcd46dab2bc6207b005cbc9caa2199bcbc81d95071a00"}, + {file = "wrapt-1.16.0-cp38-cp38-macosx_10_9_x86_64.whl", hash = "sha256:1dd50a2696ff89f57bd8847647a1c363b687d3d796dc30d4dd4a9d1689a706f0"}, + {file = "wrapt-1.16.0-cp38-cp38-macosx_11_0_arm64.whl", hash = "sha256:44a2754372e32ab315734c6c73b24351d06e77ffff6ae27d2ecf14cf3d229202"}, + {file = "wrapt-1.16.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:8e9723528b9f787dc59168369e42ae1c3b0d3fadb2f1a71de14531d321ee05b0"}, + {file = "wrapt-1.16.0-cp38-cp38-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:dbed418ba5c3dce92619656802cc5355cb679e58d0d89b50f116e4a9d5a9603e"}, + {file = "wrapt-1.16.0-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:941988b89b4fd6b41c3f0bfb20e92bd23746579736b7343283297c4c8cbae68f"}, + {file = "wrapt-1.16.0-cp38-cp38-musllinux_1_1_aarch64.whl", hash = "sha256:6a42cd0cfa8ffc1915aef79cb4284f6383d8a3e9dcca70c445dcfdd639d51267"}, + {file = "wrapt-1.16.0-cp38-cp38-musllinux_1_1_i686.whl", hash = "sha256:1ca9b6085e4f866bd584fb135a041bfc32cab916e69f714a7d1d397f8c4891ca"}, + {file = "wrapt-1.16.0-cp38-cp38-musllinux_1_1_x86_64.whl", hash = "sha256:d5e49454f19ef621089e204f862388d29e6e8d8b162efce05208913dde5b9ad6"}, + {file = "wrapt-1.16.0-cp38-cp38-win32.whl", hash = "sha256:c31f72b1b6624c9d863fc095da460802f43a7c6868c5dda140f51da24fd47d7b"}, + {file = "wrapt-1.16.0-cp38-cp38-win_amd64.whl", hash = "sha256:490b0ee15c1a55be9c1bd8609b8cecd60e325f0575fc98f50058eae366e01f41"}, + {file = "wrapt-1.16.0-cp39-cp39-macosx_10_9_x86_64.whl", hash = "sha256:9b201ae332c3637a42f02d1045e1d0cccfdc41f1f2f801dafbaa7e9b4797bfc2"}, + {file = "wrapt-1.16.0-cp39-cp39-macosx_11_0_arm64.whl", hash = "sha256:2076fad65c6736184e77d7d4729b63a6d1ae0b70da4868adeec40989858eb3fb"}, + {file = "wrapt-1.16.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:c5cd603b575ebceca7da5a3a251e69561bec509e0b46e4993e1cac402b7247b8"}, + {file = "wrapt-1.16.0-cp39-cp39-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl", hash = "sha256:b47cfad9e9bbbed2339081f4e346c93ecd7ab504299403320bf85f7f85c7d46c"}, + {file = "wrapt-1.16.0-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:f8212564d49c50eb4565e502814f694e240c55551a5f1bc841d4fcaabb0a9b8a"}, + {file = "wrapt-1.16.0-cp39-cp39-musllinux_1_1_aarch64.whl", hash = "sha256:5f15814a33e42b04e3de432e573aa557f9f0f56458745c2074952f564c50e664"}, + {file = "wrapt-1.16.0-cp39-cp39-musllinux_1_1_i686.whl", hash = "sha256:db2e408d983b0e61e238cf579c09ef7020560441906ca990fe8412153e3b291f"}, + {file = "wrapt-1.16.0-cp39-cp39-musllinux_1_1_x86_64.whl", hash = "sha256:edfad1d29c73f9b863ebe7082ae9321374ccb10879eeabc84ba3b69f2579d537"}, + {file = "wrapt-1.16.0-cp39-cp39-win32.whl", hash = "sha256:ed867c42c268f876097248e05b6117a65bcd1e63b779e916fe2e33cd6fd0d3c3"}, + {file = "wrapt-1.16.0-cp39-cp39-win_amd64.whl", hash = "sha256:eb1b046be06b0fce7249f1d025cd359b4b80fc1c3e24ad9eca33e0dcdb2e4a35"}, + {file = "wrapt-1.16.0-py3-none-any.whl", hash = "sha256:6906c4100a8fcbf2fa735f6059214bb13b97f75b1a61777fcf6432121ef12ef1"}, + {file = "wrapt-1.16.0.tar.gz", hash = "sha256:5f370f952971e7d17c7d1ead40e49f32345a7f7a5373571ef44d800d06b1899d"}, +] + [[package]] name = "zipp" version = "3.17.0" @@ -1437,4 +1595,4 @@ testing = ["big-O", "jaraco.functools", "jaraco.itertools", "more-itertools", "p [metadata] lock-version = "2.0" python-versions = "^3.9" -content-hash = "bd6a2bdd8680a1c1c11a216d2274b32045e28f7f08b452878cd25b8028b3c443" +content-hash = "c7ff1e47b6f1206bf584e2b34a612de375f6815072c649d9d27c6d8b2e6133a2" diff --git a/pyproject.toml b/pyproject.toml index 1ba9416d..b2dd4ac8 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [tool.poetry] name = "deduce" -version = "2.5.0" +version = "3.0.0" description = "Deduce: de-identification method for Dutch medical text" authors = ["Vincent Menger "] maintainers = ["Vincent Menger "] @@ -19,7 +19,7 @@ classifiers = [ "Topic :: Text Processing", "Topic :: Text Processing :: Linguistic", ] -include = ["deduce-data/lookup_lists/**/*", "config.json"] +include = ["config.json"] [tool.sphinx] author = "Vincent Menger" @@ -27,8 +27,10 @@ author = "Vincent Menger" [tool.poetry.dependencies] python = "^3.9" rapidfuzz = "^2.11.1" -docdeid = "0.1.10" +docdeid = "1.0.0" regex = "^2022.9.13" +frozendict = "^2.3.10" +deprecated = "^1.2.14" [tool.poetry.group.dev] optional = false @@ -49,6 +51,7 @@ toml = "^0.10.2" sphinx = "^5.3.0" myst-parser = "^0.18.1" karma-sphinx-theme = "^0.0.8" +emoji = "^2.9.0" [build-system] requires = ["poetry-core>=1.0.0"] diff --git a/tests/conftest.py b/tests/conftest.py index 928bda05..e7139230 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -5,4 +5,4 @@ @pytest.fixture(scope="session") def model(): - return Deduce() + return Deduce(build_lookup_structs=True) diff --git a/tests/data/lookup/cache/lookup_structs.pickle b/tests/data/lookup/cache/lookup_structs.pickle new file mode 100644 index 00000000..7f7289db Binary files /dev/null and b/tests/data/lookup/cache/lookup_structs.pickle differ diff --git a/tests/data/lookup/src/lst_test/exceptions.txt b/tests/data/lookup/src/lst_test/exceptions.txt new file mode 100644 index 00000000..7259a680 --- /dev/null +++ b/tests/data/lookup/src/lst_test/exceptions.txt @@ -0,0 +1 @@ +Wolter \ No newline at end of file diff --git a/tests/data/lookup/src/lst_test/items.txt b/tests/data/lookup/src/lst_test/items.txt new file mode 100644 index 00000000..461ecbac --- /dev/null +++ b/tests/data/lookup/src/lst_test/items.txt @@ -0,0 +1,4 @@ +de Vries +Pieters +Sijbrand +Wolter \ No newline at end of file diff --git a/tests/data/lookup/src/lst_test/transform.json b/tests/data/lookup/src/lst_test/transform.json new file mode 100644 index 00000000..86e85aff --- /dev/null +++ b/tests/data/lookup/src/lst_test/transform.json @@ -0,0 +1,20 @@ +{ + "transforms": { + "prop": { + "de": [ + "de", + "De" + ] + }, + "spell": { + "y": [ + "y", + "ij" + ], + "ij": [ + "ij", + "y" + ] + } + } +} \ No newline at end of file diff --git a/tests/data/lookup/src/lst_test_nested/items.txt b/tests/data/lookup/src/lst_test_nested/items.txt new file mode 100644 index 00000000..0a207c06 --- /dev/null +++ b/tests/data/lookup/src/lst_test_nested/items.txt @@ -0,0 +1,2 @@ +a +b \ No newline at end of file diff --git a/tests/data/lookup/src/lst_test_nested/lst_sublist/items.txt b/tests/data/lookup/src/lst_test_nested/lst_sublist/items.txt new file mode 100644 index 00000000..7abb43b5 --- /dev/null +++ b/tests/data/lookup/src/lst_test_nested/lst_sublist/items.txt @@ -0,0 +1,2 @@ +c +d \ No newline at end of file diff --git a/tests/regression/data/ages.json b/tests/data/regression_cases/ages.json similarity index 100% rename from tests/regression/data/ages.json rename to tests/data/regression_cases/ages.json diff --git a/tests/regression/data/dates.json b/tests/data/regression_cases/dates.json similarity index 100% rename from tests/regression/data/dates.json rename to tests/data/regression_cases/dates.json diff --git a/tests/regression/data/emails.json b/tests/data/regression_cases/emails.json similarity index 88% rename from tests/regression/data/emails.json rename to tests/data/regression_cases/emails.json index 3eb06439..63c02106 100644 --- a/tests/regression/data/emails.json +++ b/tests/data/regression_cases/emails.json @@ -8,7 +8,7 @@ "text": "een-afdeling@prinsesmaximacentrum.nl", "start_char": 0, "end_char": 36, - "tag": "email" + "tag": "emailadres" } ] }, @@ -20,7 +20,7 @@ "text": "poli-specialisme@umcutrecht.nl", "start_char": 0, "end_char": 30, - "tag": "email" + "tag": "emailadres" } ] }, @@ -32,7 +32,7 @@ "text": "ditmailadres@umcutrecht.nl", "start_char": 10, "end_char": 36, - "tag": "email" + "tag": "emailadres" } ] }, @@ -44,7 +44,7 @@ "text": "ABC@umcutrecht.nl", "start_char": 0, "end_char": 17, - "tag": "email" + "tag": "emailadres" } ] }, @@ -56,7 +56,7 @@ "text": "Poli.Opname@umcutrecht.nl", "start_char": 0, "end_char": 25, - "tag": "email" + "tag": "emailadres" } ] }, @@ -68,7 +68,7 @@ "text": "persoon@gmail.com", "start_char": 4, "end_char": 21, - "tag": "email" + "tag": "emailadres" } ] }, @@ -80,7 +80,7 @@ "text": "a.b.c.naam@umcutrecht.nl", "start_char": 0, "end_char": 24, - "tag": "email" + "tag": "emailadres" } ] }, @@ -92,7 +92,7 @@ "text": "a.b.c.naam-12@umcutrecht.nl", "start_char": 0, "end_char": 27, - "tag": "email" + "tag": "emailadres" } ] }, @@ -104,7 +104,7 @@ "text": "A.B.C.Naam@umcutrecht.nl", "start_char": 0, "end_char": 24, - "tag": "email" + "tag": "emailadres" } ] }, @@ -116,7 +116,7 @@ "text": "henk1993@outlook.com", "start_char": 0, "end_char": 20, - "tag": "email" + "tag": "emailadres" } ] }, @@ -128,7 +128,7 @@ "text": "achternaam_voornaam@gmail.com", "start_char": 0, "end_char": 29, - "tag": "email" + "tag": "emailadres" } ] } diff --git a/tests/regression/data/identifiers.json b/tests/data/regression_cases/identifiers.json similarity index 100% rename from tests/regression/data/identifiers.json rename to tests/data/regression_cases/identifiers.json diff --git a/tests/regression/data/institutions.json b/tests/data/regression_cases/institutions.json similarity index 86% rename from tests/regression/data/institutions.json rename to tests/data/regression_cases/institutions.json index 9754fe82..8cc630f4 100644 --- a/tests/regression/data/institutions.json +++ b/tests/data/regression_cases/institutions.json @@ -8,7 +8,7 @@ "text": "UMC Utrecht", "start_char": 0, "end_char": 11, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -20,7 +20,7 @@ "text": "UMC UTRECHT", "start_char": 0, "end_char": 11, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -32,7 +32,7 @@ "text": "UMCU", "start_char": 0, "end_char": 4, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -49,7 +49,7 @@ "text": "Academisch Ziekenhuis Utrecht", "start_char": 0, "end_char": 29, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -61,7 +61,7 @@ "text": "ACADEMISCH ZIEKENHUIS UTRECHT", "start_char": 0, "end_char": 29, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -73,7 +73,7 @@ "text": "AZU", "start_char": 0, "end_char": 3, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -85,13 +85,13 @@ "text": "Prinses Maxima Centrum", "start_char": 0, "end_char": 22, - "tag": "instelling" + "tag": "ziekenhuis" }, { "text": "Maxima", "start_char": 8, "end_char": 14, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -103,13 +103,13 @@ "text": "Prinses M\u00e1xima Centrum", "start_char": 0, "end_char": 22, - "tag": "instelling" + "tag": "ziekenhuis" }, { "text": "M\u00e1xima", "start_char": 8, "end_char": 14, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -121,7 +121,7 @@ "text": "PMC", "start_char": 0, "end_char": 3, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -133,7 +133,7 @@ "text": "Wilhelmina Kinderziekenhuis", "start_char": 0, "end_char": 27, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -145,7 +145,7 @@ "text": "WKZ", "start_char": 0, "end_char": 3, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -157,13 +157,13 @@ "text": "Militair Hospitaal", "start_char": 9, "end_char": 27, - "tag": "instelling" + "tag": "ziekenhuis" }, { "text": "Centraal Militair Hospitaal", "start_char": 0, "end_char": 27, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -175,7 +175,7 @@ "text": "CMH", "start_char": 0, "end_char": 3, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -187,13 +187,13 @@ "text": "Meander Medisch Centrum", "start_char": 0, "end_char": 23, - "tag": "instelling" + "tag": "ziekenhuis" }, { "text": "Meander", "start_char": 0, "end_char": 7, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -205,8 +205,15 @@ "text": "Meander", "start_char": 0, "end_char": 7, - "tag": "instelling" + "tag": "ziekenhuis" + }, + { + "text": "Meander", + "start_char": 0, + "end_char": 7, + "tag": "zorginstelling" } + ] }, { @@ -217,7 +224,7 @@ "text": "MMC", "start_char": 0, "end_char": 3, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -229,13 +236,13 @@ "text": "Antoniusziekenhuis", "start_char": 5, "end_char": 23, - "tag": "instelling" + "tag": "ziekenhuis" }, { "text": "Sint Antoniusziekenhuis", "start_char": 0, "end_char": 23, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -247,7 +254,7 @@ "text": "Antonius", "start_char": 5, "end_char": 13, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -259,7 +266,7 @@ "text": "Antonius", "start_char": 4, "end_char": 12, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -291,7 +298,7 @@ "text": "Isala", "start_char": 0, "end_char": 5, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -303,7 +310,7 @@ "text": "Isala ziekenhuis", "start_char": 0, "end_char": 16, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -315,7 +322,7 @@ "text": "Isala kliniek", "start_char": 0, "end_char": 13, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -327,7 +334,7 @@ "text": "Erasmus Medisch Centrum", "start_char": 0, "end_char": 23, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -339,7 +346,7 @@ "text": "Erasmus MC", "start_char": 0, "end_char": 10, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -351,13 +358,13 @@ "text": "Lange Land Ziekenhuis", "start_char": 4, "end_char": 25, - "tag": "instelling" + "tag": "ziekenhuis" }, { "text": "Het Lange Land Ziekenhuis", "start_char": 0, "end_char": 25, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -369,13 +376,13 @@ "text": "'t Lange Land Ziekenhuis", "start_char": 0, "end_char": 24, - "tag": "instelling" + "tag": "ziekenhuis" }, { "text": "Lange Land Ziekenhuis", "start_char": 3, "end_char": 24, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -387,7 +394,7 @@ "text": "Academisch Ziekenhuis Nymegen", "start_char": 0, "end_char": 29, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -399,7 +406,7 @@ "text": "Alryne Ziekenhuis", "start_char": 0, "end_char": 17, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -411,13 +418,13 @@ "text": "Canisius-Wilhelmina Ziekenhuis", "start_char": 0, "end_char": 30, - "tag": "instelling" + "tag": "ziekenhuis" }, { "text": "Wilhelmina Ziekenhuis", "start_char": 9, "end_char": 30, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -429,13 +436,13 @@ "text": "Canisius Wilhelmina Ziekenhuis", "start_char": 0, "end_char": 30, - "tag": "instelling" + "tag": "ziekenhuis" }, { "text": "Wilhelmina Ziekenhuis", "start_char": 9, "end_char": 30, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -447,7 +454,7 @@ "text": "P.W. Janssen Ziekenhuis", "start_char": 0, "end_char": 23, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -459,7 +466,7 @@ "text": "PW Janssen Ziekenhuis", "start_char": 0, "end_char": 21, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -471,7 +478,7 @@ "text": "UMCU", "start_char": 26, "end_char": 30, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -483,7 +490,7 @@ "text": "UMCU", "start_char": 3, "end_char": 7, - "tag": "instelling" + "tag": "ziekenhuis" } ] }, @@ -500,13 +507,13 @@ "text": "De Hoogstraat", "start_char": 0, "end_char": 13, - "tag": "instelling" + "tag": "zorginstelling" }, { "text": "Hoogstraat", "start_char": 3, "end_char": 13, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -518,7 +525,7 @@ "text": "Hoogstraat", "start_char": 0, "end_char": 10, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -530,13 +537,13 @@ "text": "De Hoogstraat", "start_char": 0, "end_char": 13, - "tag": "instelling" + "tag": "zorginstelling" }, { "text": "Hoogstraat", "start_char": 3, "end_char": 13, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -548,7 +555,7 @@ "text": "Altrecht", "start_char": 0, "end_char": 8, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -560,7 +567,7 @@ "text": "Altrecht Bipolair", "start_char": 0, "end_char": 17, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -572,7 +579,7 @@ "text": "Cant\u00e9 praktijk voor Pedagogiek en Psychologie", "start_char": 0, "end_char": 45, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -584,7 +591,7 @@ "text": "Cante praktijk voor Pedagogiek en Psychologie", "start_char": 0, "end_char": 45, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -596,7 +603,7 @@ "text": "Careyn", "start_char": 0, "end_char": 6, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -608,7 +615,7 @@ "text": "Careijn", "start_char": 0, "end_char": 7, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -655,7 +662,7 @@ "text": "Compas Huisartsenpraktijk", "start_char": 0, "end_char": 25, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -667,7 +674,7 @@ "text": "Compas huisartsenpraktijk", "start_char": 0, "end_char": 25, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -679,7 +686,7 @@ "text": "Compas Huisartspraktijk", "start_char": 0, "end_char": 23, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -691,7 +698,7 @@ "text": "Compas huisartspraktijk", "start_char": 0, "end_char": 23, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -703,7 +710,7 @@ "text": "Zon en Schild", "start_char": 0, "end_char": 13, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -715,7 +722,7 @@ "text": "Daan & Van Ardenne Huisartsen", "start_char": 0, "end_char": 29, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -727,7 +734,7 @@ "text": "Daan en Van Ardenne Huisartsen", "start_char": 0, "end_char": 30, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -739,7 +746,7 @@ "text": "De Kind- en Jeugdspecialist", "start_char": 0, "end_char": 27, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -751,7 +758,7 @@ "text": "De Kind en Jeugdspecialist", "start_char": 0, "end_char": 26, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -763,7 +770,7 @@ "text": "De Koperhorst", "start_char": 0, "end_char": 13, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -775,7 +782,7 @@ "text": "de Koperhorst", "start_char": 0, "end_char": 13, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -787,7 +794,7 @@ "text": "Altrecht", "start_char": 22, "end_char": 30, - "tag": "instelling" + "tag": "zorginstelling" } ] }, @@ -799,7 +806,7 @@ "text": "Altrecht", "start_char": 4, "end_char": 12, - "tag": "instelling" + "tag": "zorginstelling" } ] } diff --git a/tests/regression/data/locations.json b/tests/data/regression_cases/locations.json similarity index 100% rename from tests/regression/data/locations.json rename to tests/data/regression_cases/locations.json diff --git a/tests/regression/data/names.json b/tests/data/regression_cases/names.json similarity index 98% rename from tests/regression/data/names.json rename to tests/data/regression_cases/names.json index 51b815d5..348cebb3 100644 --- a/tests/regression/data/names.json +++ b/tests/data/regression_cases/names.json @@ -1114,6 +1114,21 @@ "tag": "persoon" } ] + }, + { + "id": 124, + "text": "ziekte van Zoon", + "annotations": [] + }, + { + "id": 125, + "text": "Turner's ziekte", + "annotations": [] + }, + { + "id": 126, + "text": "Pierre-Robin sequentie", + "annotations": [] } ] } \ No newline at end of file diff --git a/tests/regression/data/phone_numbers.json b/tests/data/regression_cases/phone_numbers.json similarity index 100% rename from tests/regression/data/phone_numbers.json rename to tests/data/regression_cases/phone_numbers.json diff --git a/tests/regression/data/urls.json b/tests/data/regression_cases/urls.json similarity index 100% rename from tests/regression/data/urls.json rename to tests/data/regression_cases/urls.json diff --git a/tests/data/small.json b/tests/data/small.json new file mode 100644 index 00000000..58c2f5d4 --- /dev/null +++ b/tests/data/small.json @@ -0,0 +1,3 @@ +{ + "test": true +} \ No newline at end of file diff --git a/tests/pipeline/test_deduce.py b/tests/pipeline/test_deduce.py index 35e2590e..3bc80771 100644 --- a/tests/pipeline/test_deduce.py +++ b/tests/pipeline/test_deduce.py @@ -31,7 +31,7 @@ def test_annotate(self, model): text="j.JNSEN.123@gmail.com", start_char=247, end_char=268, - tag="email", + tag="emailadres", ), dd.Annotation( text="J. Jansen", start_char=64, end_char=73, tag="patient" @@ -48,7 +48,7 @@ def test_annotate(self, model): text="Utrecht", start_char=106, end_char=113, tag="locatie" ), dd.Annotation( - text="UMCU", start_char=202, end_char=206, tag="instelling" + text="UMCU", start_char=202, end_char=206, tag="ziekenhuis" ), ] ) @@ -60,11 +60,11 @@ def test_deidentify(self, model): doc = model.deidentify(text, metadata=metadata) expected_deidentified = ( - "betreft: , bsn , patnr . De patient is " - " jaar oud en woonachtig in . Hij werd op " - " door arts ontslagen van de kliniek van het " - ". Voor nazorg kan hij worden bereikt via " - "of ." + "betreft: [PATIENT], bsn [BSN-1], patnr [ID-1]. De patient [PATIENT] is " + "[LEEFTIJD-1] jaar oud en woonachtig in [LOCATIE-1]. Hij werd op " + "[DATUM-1] door arts [PERSOON-1] ontslagen van de kliniek van het " + "[ZIEKENHUIS-1]. Voor nazorg kan hij worden bereikt via [EMAILADRES-1] " + "of [TELEFOONNUMMER-1]." ) assert doc.deidentified_text == expected_deidentified @@ -79,8 +79,8 @@ def test_annotate_intext(self, model): "64 jaar oud en woonachtig in Utrecht" ". Hij werd op 10 oktober 2018 door arts " "Peter de Visser ontslagen van de kliniek van het " - "UMCU. Voor nazorg kan hij worden bereikt " - "via j.JNSEN.123@gmail.com of " + "UMCU. Voor nazorg kan hij worden bereikt " + "via j.JNSEN.123@gmail.com of " "(06)12345678." ) diff --git a/tests/regression/test_regression.py b/tests/regression/test_regression.py index 2294dba8..3bdb442e 100644 --- a/tests/regression/test_regression.py +++ b/tests/regression/test_regression.py @@ -3,9 +3,11 @@ from docdeid import Annotation, AnnotationSet +from deduce import Deduce + def regression_test( - model, + model: Deduce, examples_file: str, enabled: set[str], known_failures: Optional[set[int]] = None, @@ -32,95 +34,70 @@ def regression_test( assert failures == known_failures +def annotators_from_group(model: Deduce, group: str) -> set[str]: + return {name for name, _ in model.processors[group]}.union({group}) + + class TestRegression: def test_regression_name(self, model): regression_test( model=model, - examples_file="tests/regression/data/names.json", - enabled={ - "names", - "prefix_with_initial", - "prefix_with_name", - "interfix_with_name", - "initial_with_capital", - "initial_interfix", - "first_name_lookup", - "surname_lookup", - "person_first_name", - "person_initial_from_name", - "person_initials", - "person_surname", - "name_context", - "person_annotation_converter", - }, + examples_file="tests/data/regression_cases/names.json", + enabled=annotators_from_group(model, "names"), ) def test_regression_location(self, model): regression_test( model=model, - examples_file="tests/regression/data/locations.json", - enabled={ - "locations", - "placename", - "street_pattern", - "street_lookup", - "housenumber", - "postal_code", - "postbus", - "remove_street_tags", - "clean_street_tags", - }, + examples_file="tests/data/regression_cases/locations.json", + enabled=annotators_from_group(model, "locations"), ) def test_regression_institution(self, model): regression_test( model=model, - examples_file="tests/regression/data/institutions.json", - enabled={ - "institutions", - "hospital", - "institution", - }, + examples_file="tests/data/regression_cases/institutions.json", + enabled=annotators_from_group(model, "institutions"), ) def test_regression_date(self, model): regression_test( model=model, - examples_file="tests/regression/data/dates.json", - enabled={"dates", "date_dmy_1", "date_dmy_2", "date_ymd_1", "date_ymd_2"}, + examples_file="tests/data/regression_cases/dates.json", + enabled=annotators_from_group(model, "dates"), ) def test_regression_age(self, model): regression_test( model=model, - examples_file="tests/regression/data/ages.json", - enabled={"ages", "age"}, + examples_file="tests/data/regression_cases/ages.json", + enabled=annotators_from_group(model, "ages"), ) def test_regression_identifier(self, model): regression_test( model=model, - examples_file="tests/regression/data/identifiers.json", - enabled={"identifiers", "bsn", "identifier"}, + examples_file="tests/data/regression_cases/identifiers.json", + enabled=annotators_from_group(model, "identifiers"), ) def test_regression_phone(self, model): regression_test( model=model, - examples_file="tests/regression/data/phone_numbers.json", - enabled={"phone_numbers", "phone"}, + examples_file="tests/data/regression_cases/phone_numbers.json", + enabled=annotators_from_group(model, "phone_numbers"), ) def test_regression_email(self, model): regression_test( model=model, - examples_file="tests/regression/data/emails.json", - enabled={"email_addresses", "email"}, + examples_file="tests/data/regression_cases/emails.json", + enabled=annotators_from_group(model, "email_addresses"), ) def test_regression_url(self, model): regression_test( model=model, - examples_file="tests/regression/data/urls.json", - enabled={"urls", "url"}, + examples_file="tests/data/regression_cases/urls.json", + enabled=annotators_from_group(model, "urls"), ) diff --git a/tests/unit/pattern/test_name_patient.py b/tests/unit/pattern/test_name_patient.py deleted file mode 100644 index dbc800af..00000000 --- a/tests/unit/pattern/test_name_patient.py +++ /dev/null @@ -1,153 +0,0 @@ -from unittest.mock import patch - -import docdeid as dd - -from deduce.pattern.name_patient import ( - PersonFirstNamePattern, - PersonInitialFromNamePattern, - PersonInitialsPattern, - PersonSurnamePattern, -) -from deduce.person import Person -from deduce.tokenizer import DeduceTokenizer -from tests.helpers import linked_tokens - - -class TestPersonFirstNamePattern: - pattern = PersonFirstNamePattern(tag="_") - - def test_match_first_name_multiple(self): - metadata = {"patient": Person(first_names=["Jan", "Adriaan"])} - tokens = linked_tokens(["Jan", "Adriaan"]) - - assert self.pattern.match(tokens[0], metadata=metadata) == ( - tokens[0], - tokens[0], - ) - assert self.pattern.match(tokens[1], metadata=metadata) == ( - tokens[1], - tokens[1], - ) - - def test_match_first_name_fuzzy(self): - metadata = {"patient": Person(first_names=["Adriaan"])} - tokens = linked_tokens(["Adriana"]) - - assert self.pattern.match(tokens[0], metadata=metadata) == ( - tokens[0], - tokens[0], - ) - - def test_match_first_name_fuzzy_short(self): - metadata = {"patient": Person(first_names=["Jan"])} - tokens = linked_tokens(["Dan"]) - - assert self.pattern.match(tokens[0], metadata=metadata) is None - - -class TestPersonInitialFromNamePattern: - pattern = PersonInitialFromNamePattern(tag="_") - - def test_match(self): - metadata = {"patient": Person(first_names=["Jan", "Adriaan"])} - - tokens = linked_tokens(["A", "J"]) - - assert self.pattern.match(tokens[0], metadata=metadata) == ( - tokens[0], - tokens[0], - ) - assert self.pattern.match(tokens[1], metadata=metadata) == ( - tokens[1], - tokens[1], - ) - - def test_match_with_period(self): - metadata = {"patient": Person(first_names=["Jan", "Adriaan"])} - tokens = linked_tokens(["J", ".", "A", "."]) - - assert self.pattern.match(tokens[0], metadata=metadata) == ( - tokens[0], - tokens[1], - ) - assert self.pattern.match(tokens[2], metadata=metadata) == ( - tokens[2], - tokens[3], - ) - - def test_no_match(self): - metadata = {"patient": Person(first_names=["Jan", "Adriaan"])} - tokens = linked_tokens(["F", "T"]) - - assert self.pattern.match(tokens[0], metadata=metadata) is None - assert self.pattern.match(tokens[1], metadata=metadata) is None - - -class TestPersonInitialsPattern: - pattern = PersonInitialsPattern(tag="_") - - def test_match(self): - metadata = {"patient": Person(initials="AFTH")} - tokens = linked_tokens(["AFTH", "THFA"]) - - assert self.pattern.match(tokens[0], metadata=metadata) == ( - tokens[0], - tokens[0], - ) - assert self.pattern.match(tokens[1], metadata=metadata) is None - - -class TestPersonSurnamePattern: - surname = "Van der Heide-Ginkel" - surname_pattern = linked_tokens(["Van der", "Heide", "-", "Ginkel"]) - - tokenizer = DeduceTokenizer() - patch.object(tokenizer, "tokenize", return_value=surname_pattern).start() - - pattern = PersonSurnamePattern(tokenizer=tokenizer, tag="_") - - def test_doc_precondition(self): - metadata = {"patient": Person(surname=self.surname)} - doc = dd.Document(text="_", metadata=metadata) - self.pattern.doc_precondition(doc) - - assert metadata["surname_pattern"] == self.surname_pattern - - def test_match_equal(self): - metadata = {"surname_pattern": self.surname_pattern} - tokens = linked_tokens(["Van der", "Heide", "-", "Ginkel", "is", "de", "naam"]) - - assert self.pattern.match(tokens[0], metadata=metadata) == ( - tokens[0], - tokens[3], - ) - - def test_match_longer_than_tokens(self): - metadata = {"surname_pattern": self.surname_pattern} - tokens = linked_tokens(["Van der", "Heide"]) - - assert self.pattern.match(tokens[0], metadata=metadata) is None - - def test_match_fuzzy(self): - metadata = {"surname_pattern": self.surname_pattern} - tokens = linked_tokens(["Van der", "Heijde", "-", "Ginkle", "is", "de", "naam"]) - - assert self.pattern.match(tokens[0], metadata=metadata) == ( - tokens[0], - tokens[3], - ) - - def test_match_unequal_first(self): - metadata = {"surname_pattern": self.surname_pattern} - tokens = linked_tokens(["v/der", "Heide", "-", "Ginkel", "is", "de", "naam"]) - - assert self.pattern.match(tokens[0], metadata=metadata) is None - - def test_match_unequal_first_fuzzy(self): - metadata = {"surname_pattern": self.surname_pattern} - tokens = linked_tokens(["Van den", "Heide", "-", "Ginkel", "is", "de", "naam"]) - - assert self.pattern.match(tokens[0], metadata=metadata) == ( - tokens[0], - tokens[3], - ) diff --git a/tests/unit/test_annotation_proc.py b/tests/unit/test_annotation_processor.py similarity index 92% rename from tests/unit/test_annotation_proc.py rename to tests/unit/test_annotation_processor.py index c50f6a00..8e3ce8f5 100644 --- a/tests/unit/test_annotation_proc.py +++ b/tests/unit/test_annotation_processor.py @@ -1,6 +1,6 @@ import docdeid as dd -from deduce.annotation_processing import ( +from deduce.annotation_processor import ( CleanAnnotationTag, DeduceMergeAdjacentAnnotations, PersonAnnotationConverter, @@ -160,6 +160,25 @@ def test_mixed_with_overlap(self): assert proc.process_annotations(annotations, text) == expected_annotations + def test_pseudo(self): + + proc = PersonAnnotationConverter() + text = "Henoch Schonlein" + + annotations = dd.AnnotationSet( + [ + dd.Annotation(text="Henoch", start_char=0, end_char=6, tag="voornaam"), + dd.Annotation( + text="Henoch Schonlein", + start_char=0, + end_char=16, + tag="pseudo_naam", + ), + ] + ) + + assert proc.process_annotations(annotations, text) == dd.AnnotationSet() + class TestRemoveAnnotations: def test_remove_annotations(self): diff --git a/tests/unit/test_annotator.py b/tests/unit/test_annotator.py index bbc64ef7..2f7daec3 100644 --- a/tests/unit/test_annotator.py +++ b/tests/unit/test_annotator.py @@ -1,4 +1,5 @@ import re +from unittest.mock import patch import docdeid as dd import pytest @@ -6,12 +7,15 @@ from deduce.annotator import ( BsnAnnotator, ContextAnnotator, + PatientNameAnnotator, PhoneNumberAnnotator, RegexpPseudoAnnotator, TokenPatternAnnotator, _PatternPositionMatcher, ) +from deduce.person import Person from deduce.tokenizer import DeduceTokenizer +from tests.helpers import linked_tokens @pytest.fixture @@ -31,19 +35,24 @@ def ds(): @pytest.fixture -def regexp_pseudo_doc(): +def tokenizer(): + return DeduceTokenizer() + + +@pytest.fixture +def regexp_pseudo_doc(tokenizer): return dd.Document( text="De patient is Na 12 jaar gestopt met medicijnen.", - tokenizers={"default": DeduceTokenizer()}, + tokenizers={"default": tokenizer}, ) @pytest.fixture -def pattern_doc(): +def pattern_doc(tokenizer): return dd.Document( text="De man heet Andries Meijer-Heerma, voornaam Andries.", - tokenizers={"default": DeduceTokenizer()}, + tokenizers={"default": tokenizer}, ) @@ -68,11 +77,16 @@ def phone_number_doc(): ) +@pytest.fixture +def surname_pattern(): + return linked_tokens(["Van der", "Heide", "-", "Ginkel"]) + + def token(text: str): return dd.Token(text=text, start_char=0, end_char=len(text)) -class TestPatternPositionMatcher: +class TestPositionMatcher: def test_equal(self): assert _PatternPositionMatcher.match({"equal": "test"}, token=token("test")) assert not _PatternPositionMatcher.match({"equal": "_"}, token=token("test")) @@ -90,17 +104,22 @@ def test_re_match(self): {"re_match": "[a-z]"}, token=token("123abc") ) - def test_match_is_initial(self): - pattern_position = {"is_initial": True} + def test_is_initials(self): - assert _PatternPositionMatcher.match(pattern_position, token=token("A")) - assert _PatternPositionMatcher.match(pattern_position, token=token("Ch")) - assert _PatternPositionMatcher.match(pattern_position, token=token("Chr")) - assert _PatternPositionMatcher.match(pattern_position, token=token("Ph")) - assert _PatternPositionMatcher.match(pattern_position, token=token("Th")) - assert not _PatternPositionMatcher.match(pattern_position, token=token("a")) - assert not _PatternPositionMatcher.match(pattern_position, token=token("Ah")) - assert not _PatternPositionMatcher.match(pattern_position, token=token("Abcd")) + assert _PatternPositionMatcher.match({"is_initials": True}, token=token("A")) + assert _PatternPositionMatcher.match({"is_initials": True}, token=token("AB")) + assert _PatternPositionMatcher.match({"is_initials": True}, token=token("ABC")) + assert _PatternPositionMatcher.match({"is_initials": True}, token=token("ABCD")) + assert not _PatternPositionMatcher.match( + {"is_initials": True}, token=token("ABCDE") + ) + assert not _PatternPositionMatcher.match({"is_initials": True}, token=token("")) + assert not _PatternPositionMatcher.match( + {"is_initials": True}, token=token("abcd") + ) + assert not _PatternPositionMatcher.match( + {"is_initials": True}, token=token("abcde") + ) def test_match_like_name(self): pattern_position = {"like_name": True} @@ -154,28 +173,6 @@ def test_match_neg_lookup(self, ds): {"neg_lookup": "surnames"}, token=token("smit"), ds=ds ) - def test_match_lowercase_lookup(self, ds): - assert _PatternPositionMatcher.match( - {"lowercase_lookup": "first_names"}, token=token("Pieter"), ds=ds - ) - assert _PatternPositionMatcher.match( - {"lowercase_lookup": "first_names"}, token=token("pieter"), ds=ds - ) - assert not _PatternPositionMatcher.match( - {"lowercase_lookup": "first_names"}, token=token("smit"), ds=ds - ) - - def test_match_lowercase_neg_lookup(self, ds): - assert _PatternPositionMatcher.match( - {"lowercase_neg_lookup": "first_names"}, token=token("Andries"), ds=ds - ) - assert _PatternPositionMatcher.match( - {"lowercase_neg_lookup": "first_names"}, token=token("andries"), ds=ds - ) - assert not _PatternPositionMatcher.match( - {"lowercase_neg_lookup": "first_names"}, token=token("pieter"), ds=ds - ) - def test_match_and(self): assert _PatternPositionMatcher.match( {"and": [{"equal": "Abcd"}, {"like_name": True}]}, @@ -216,11 +213,13 @@ def test_match_sequence(self, pattern_doc, ds): tpa = TokenPatternAnnotator(pattern=[{}], ds=ds, tag="_") assert tpa._match_sequence( - pattern_doc, start_token=pattern_doc.get_tokens()[3], pattern=pattern + pattern_doc.text, start_token=pattern_doc.get_tokens()[3], pattern=pattern ) == dd.Annotation(text="Andries Meijer", start_char=12, end_char=26, tag="_") assert ( tpa._match_sequence( - pattern_doc, start_token=pattern_doc.get_tokens()[7], pattern=pattern + pattern_doc.text, + start_token=pattern_doc.get_tokens()[7], + pattern=pattern, ) is None ) @@ -231,7 +230,7 @@ def test_match_sequence_left(self, pattern_doc, ds): tpa = TokenPatternAnnotator(pattern=[{}], ds=ds, tag="_") assert tpa._match_sequence( - pattern_doc, + pattern_doc.text, start_token=pattern_doc.get_tokens()[4], pattern=pattern, direction="left", @@ -239,7 +238,7 @@ def test_match_sequence_left(self, pattern_doc, ds): assert ( tpa._match_sequence( - pattern_doc, + pattern_doc.text, start_token=pattern_doc.get_tokens()[8], direction="left", pattern=pattern, @@ -253,17 +252,17 @@ def test_match_sequence_skip(self, pattern_doc, ds): tpa = TokenPatternAnnotator(pattern=[{}], ds=ds, tag="_") assert tpa._match_sequence( - pattern_doc, + pattern_doc.text, start_token=pattern_doc.get_tokens()[4], pattern=pattern, skip={"-"}, ) == dd.Annotation(text="Meijer-Heerma", start_char=20, end_char=33, tag="_") assert ( tpa._match_sequence( - pattern_doc, + pattern_doc.text, start_token=pattern_doc.get_tokens()[4], pattern=pattern, - skip=[], + skip=set(), ) is None ) @@ -296,7 +295,7 @@ def test_apply_context_pattern(self, pattern_doc): ) assert annotator._apply_context_pattern( - pattern_doc, + pattern_doc.text, annotations, { "pattern": [{"like_name": True}], @@ -332,7 +331,7 @@ def test_apply_context_pattern_left(self, pattern_doc): ) assert annotator._apply_context_pattern( - pattern_doc, + pattern_doc.text, annotations, { "pattern": [{"like_name": True}], @@ -368,7 +367,7 @@ def test_apply_context_pattern_skip(self, pattern_doc): ) assert annotator._apply_context_pattern( - pattern_doc, + pattern_doc.text, annotations, { "pattern": [{"like_name": True}], @@ -420,7 +419,7 @@ def test_annotate_multiple(self, pattern_doc): ] ) - assert annotator._annotate(pattern_doc, annotations) == dd.AnnotationSet( + assert annotator._annotate(pattern_doc.text, annotations) == dd.AnnotationSet( { dd.Annotation( text="Andries Meijer-Heerma", @@ -437,7 +436,7 @@ def test_annotate_iterative(self, pattern_doc): "pattern": [{"like_name": True}], "direction": "right", "skip": ["-"], - "pre_tag": "naam", + "pre_tag": ["naam", "voornaam"], "tag": "{tag}+naam", } ] @@ -457,7 +456,7 @@ def test_annotate_iterative(self, pattern_doc): ] ) - assert annotator._annotate(pattern_doc, annotations) == dd.AnnotationSet( + assert annotator._annotate(pattern_doc.text, annotations) == dd.AnnotationSet( { dd.Annotation( text="Andries Meijer-Heerma", @@ -469,6 +468,287 @@ def test_annotate_iterative(self, pattern_doc): ) +class TestPatientNameAnnotator: + def test_match_first_name_multiple(self, tokenizer): + + metadata = {"patient": Person(first_names=["Jan", "Adriaan"])} + tokens = linked_tokens(["Jan", "Adriaan"]) + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + assert ann._match_first_names(doc=doc, token=tokens[0]) == ( + tokens[0], + tokens[0], + ) + + assert ann._match_first_names(doc=doc, token=tokens[1]) == ( + tokens[1], + tokens[1], + ) + + def test_match_first_name_fuzzy(self, tokenizer): + + metadata = {"patient": Person(first_names=["Adriaan"])} + tokens = linked_tokens(["Adriana"]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + assert ann._match_first_names(doc=doc, token=tokens[0]) == ( + tokens[0], + tokens[0], + ) + + def test_match_first_name_fuzzy_short(self, tokenizer): + + metadata = {"patient": Person(first_names=["Jan"])} + tokens = linked_tokens(["Dan"]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + assert ann._match_first_names(doc=doc, token=tokens[0]) is None + + def test_match_initial_from_name(self, tokenizer): + + metadata = {"patient": Person(first_names=["Jan", "Adriaan"])} + tokens = linked_tokens(["A", "J"]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + assert ann._match_initial_from_name(doc=doc, token=tokens[0]) == ( + tokens[0], + tokens[0], + ) + + assert ann._match_initial_from_name(doc=doc, token=tokens[1]) == ( + tokens[1], + tokens[1], + ) + + def test_match_initial_from_name_with_period(self, tokenizer): + + metadata = {"patient": Person(first_names=["Jan", "Adriaan"])} + tokens = linked_tokens(["J", ".", "A", "."]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + assert ann._match_initial_from_name(doc=doc, token=tokens[0]) == ( + tokens[0], + tokens[1], + ) + + assert ann._match_initial_from_name(doc=doc, token=tokens[2]) == ( + tokens[2], + tokens[3], + ) + + def test_match_initial_from_name_no_match(self, tokenizer): + + metadata = {"patient": Person(first_names=["Jan", "Adriaan"])} + tokens = linked_tokens(["F", "T"]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + assert ann._match_initial_from_name(doc=doc, token=tokens[0]) is None + assert ann._match_initial_from_name(doc=doc, token=tokens[1]) is None + + def test_match_initials(self, tokenizer): + + metadata = {"patient": Person(initials="AFTH")} + tokens = linked_tokens(["AFTH", "THFA"]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + assert ann._match_initials(doc=doc, token=tokens[0]) == (tokens[0], tokens[0]) + assert ann._match_initials(doc=doc, token=tokens[1]) is None + + def test_match_surname_equal(self, tokenizer, surname_pattern): + + metadata = {"surname_pattern": surname_pattern} + tokens = linked_tokens(["Van der", "Heide", "-", "Ginkel", "is", "de", "naam"]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + with patch.object(tokenizer, "tokenize", return_value=surname_pattern): + + assert ann._match_surname(doc=doc, token=tokens[0]) == ( + tokens[0], + tokens[3], + ) + + def test_match_surname_longer_than_tokens(self, tokenizer, surname_pattern): + + metadata = {"surname_pattern": surname_pattern} + tokens = linked_tokens(["Van der", "Heide"]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + with patch.object(tokenizer, "tokenize", return_value=surname_pattern): + + assert ann._match_surname(doc=doc, token=tokens[0]) is None + + def test_match_surname_fuzzy(self, tokenizer, surname_pattern): + + metadata = {"surname_pattern": surname_pattern} + tokens = linked_tokens(["Van der", "Heijde", "-", "Ginkle", "is", "de", "naam"]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + with patch.object(tokenizer, "tokenize", return_value=surname_pattern): + + assert ann._match_surname(doc=doc, token=tokens[0]) == ( + tokens[0], + tokens[3], + ) + + def test_match_surname_unequal_first(self, tokenizer, surname_pattern): + + metadata = {"surname_pattern": surname_pattern} + tokens = linked_tokens(["v/der", "Heide", "-", "Ginkel", "is", "de", "naam"]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + with patch.object(tokenizer, "tokenize", return_value=surname_pattern): + + assert ann._match_surname(doc=doc, token=tokens[0]) is None + + def test_match_surname_unequal_first_fuzzy(self, tokenizer, surname_pattern): + + metadata = {"surname_pattern": surname_pattern} + tokens = linked_tokens(["Van den", "Heide", "-", "Ginkel", "is", "de", "naam"]) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text="_", metadata=metadata) + + with patch.object(tokenizer, "tokenize", return_value=surname_pattern): + + assert ann._match_surname(doc=doc, token=tokens[0]) == ( + tokens[0], + tokens[3], + ) + + def test_annotate_first_name(self, tokenizer): + + metadata = { + "patient": Person( + first_names=["Jan", "Johan"], initials="JJ", surname="Jansen" + ) + } + text = "De patient heet Jan" + tokens = tokenizer.tokenize(text) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text=text, metadata=metadata) + + with patch.object(doc, "get_tokens", return_value=tokens): + with patch.object( + tokenizer, "tokenize", return_value=linked_tokens(["Jansen"]) + ): + annotations = ann.annotate(doc) + + assert annotations == [ + dd.Annotation( + text="Jan", + start_char=16, + end_char=19, + tag="voornaam_patient", + ) + ] + + def test_annotate_initials_from_name(self, tokenizer): + + metadata = { + "patient": Person( + first_names=["Jan", "Johan"], initials="JJ", surname="Jansen" + ) + } + text = "De patient heet JJ" + tokens = tokenizer.tokenize(text) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text=text, metadata=metadata) + + with patch.object(doc, "get_tokens", return_value=tokens): + with patch.object( + tokenizer, "tokenize", return_value=linked_tokens(["Jansen"]) + ): + annotations = ann.annotate(doc) + + assert annotations == [ + dd.Annotation( + text="JJ", + start_char=16, + end_char=18, + tag="initiaal_patient", + ) + ] + + def test_annotate_initial(self, tokenizer): + + metadata = { + "patient": Person( + first_names=["Jan", "Johan"], initials="JJ", surname="Jansen" + ) + } + text = "De patient heet J." + tokens = tokenizer.tokenize(text) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text=text, metadata=metadata) + + with patch.object(doc, "get_tokens", return_value=tokens): + with patch.object( + tokenizer, "tokenize", return_value=linked_tokens(["Jansen"]) + ): + annotations = ann.annotate(doc) + + assert annotations == [ + dd.Annotation( + text="J.", + start_char=16, + end_char=18, + tag="initiaal_patient", + ) + ] + + def test_annotate_surname(self, tokenizer): + + metadata = { + "patient": Person( + first_names=["Jan", "Johan"], initials="JJ", surname="Jansen" + ) + } + text = "De patient heet Jansen" + tokens = tokenizer.tokenize(text) + + ann = PatientNameAnnotator(tokenizer=tokenizer, tag="_") + doc = dd.Document(text=text, metadata=metadata) + + with patch.object(doc, "get_tokens", return_value=tokens): + with patch.object( + tokenizer, "tokenize", return_value=linked_tokens(["Jansen"]) + ): + annotations = ann.annotate(doc) + + assert annotations == [ + dd.Annotation( + text="Jansen", + start_char=16, + end_char=22, + tag="achternaam_patient", + ) + ] + + class TestRegexpPseudoAnnotator: def test_is_word_char(self): diff --git a/tests/unit/test_lookup_struct.py b/tests/unit/test_lookup_struct.py new file mode 100644 index 00000000..978ac121 --- /dev/null +++ b/tests/unit/test_lookup_struct.py @@ -0,0 +1,121 @@ +import io +from pathlib import Path +from unittest.mock import patch + +import docdeid as dd + +from deduce.lookup_structs import ( + cache_lookup_structs, + load_lookup_structs_from_cache, + load_raw_itemset, + load_raw_itemsets, + validate_lookup_struct_cache, +) + +DATA_PATH = Path(".").cwd() / "tests" / "data" / "lookup" + + +class TestLookupStruct: + def test_load_raw_itemset(self): + + raw_itemset = load_raw_itemset(DATA_PATH / "src" / "lst_test") + + assert len(raw_itemset) == 5 + assert "de Vries" in raw_itemset + assert "De Vries" in raw_itemset + assert "Sijbrand" in raw_itemset + assert "Sybrand" in raw_itemset + assert "Pieters" in raw_itemset + assert "Wolter" not in raw_itemset + + def test_load_raw_itemset_nested(self): + + raw_itemset = load_raw_itemset(DATA_PATH / "src" / "lst_test_nested") + + assert raw_itemset == {"a", "b", "c", "d"} + + def test_load_raw_itemsets(self): + + raw_itemsets = load_raw_itemsets( + base_path=DATA_PATH, subdirs=["lst_test", "lst_test_nested"] + ) + + assert "test" in raw_itemsets + assert len(raw_itemsets["test"]) == 5 + assert "test_nested" in raw_itemsets + assert len(raw_itemsets["test_nested"]) == 4 + + def test_validate_lookup_struct_cache_valid(self): + + cache = { + "deduce_version": "2.5.0", + "saved_datetime": "2023-12-06 10:19:39.198133", + "lookup_structs": "_", + } + + class MockStats: + st_mtime = 1000000000 # way in the past + + with patch("pathlib.Path.glob", return_value=[1, 2, 3]): + with patch("os.stat", return_value=MockStats()): + assert validate_lookup_struct_cache( + cache=cache, base_path=DATA_PATH, deduce_version="2.5.0" + ) + + def test_validate_lookup_struct_cache_file_changes(self): + + cache = { + "deduce_version": "2.5.0", + "saved_datetime": "2023-12-06 10:19:39.198133", + "lookup_structs": "_", + } + + class MockStats: + st_mtime = 2000000000 # way in the future + + with patch("pathlib.Path.glob", return_value=[1, 2, 3]): + with patch("os.stat", return_value=MockStats()): + assert not validate_lookup_struct_cache( + cache=cache, base_path=DATA_PATH, deduce_version="2.5.0" + ) + + @patch("deduce.lookup_structs.validate_lookup_struct_cache", return_value=True) + def test_load_lookup_structs_from_cache(self, _): + + ds_collection = load_lookup_structs_from_cache( + base_path=DATA_PATH, deduce_version="_" + ) + + assert len(ds_collection) == 2 + assert "test" in ds_collection + assert "test_nested" in ds_collection + + @patch("deduce.lookup_structs.validate_lookup_struct_cache", return_value=True) + def test_load_lookup_structs_from_cache_nofile(self, _): + + ds_collection = load_lookup_structs_from_cache( + base_path=DATA_PATH / "non_existing_dir", deduce_version="_" + ) + + assert ds_collection is None + + @patch("deduce.lookup_structs.validate_lookup_struct_cache", return_value=False) + def test_load_lookup_structs_from_cache_invalid(self, _): + + ds_collection = load_lookup_structs_from_cache( + base_path=DATA_PATH, deduce_version="_" + ) + + assert ds_collection is None + + @patch("builtins.open", return_value=io.BytesIO()) + @patch("pickle.dump") + def test_cache_lookup_structs(self, _, mock_pickle_dump): + + cache_lookup_structs( + lookup_structs=dd.ds.DsCollection(), + base_path=DATA_PATH, + deduce_version="2.5.0", + ) + + assert mock_pickle_dump.called_once() diff --git a/tests/unit/test_redact.py b/tests/unit/test_redactor.py similarity index 98% rename from tests/unit/test_redact.py rename to tests/unit/test_redactor.py index c5b60ef4..96607b8c 100644 --- a/tests/unit/test_redact.py +++ b/tests/unit/test_redactor.py @@ -1,6 +1,6 @@ import docdeid as dd -from deduce.redact import DeduceRedactor +from deduce.redactor import DeduceRedactor class TestDeduceRedactor: diff --git a/tests/unit/test_utils.py b/tests/unit/test_utils.py index 52c1e4ce..0f0dff1a 100644 --- a/tests/unit/test_utils.py +++ b/tests/unit/test_utils.py @@ -1,20 +1,13 @@ +from pathlib import Path + +import docdeid as dd import pytest from deduce import utils +from deduce.annotator import TokenPatternAnnotator -class TestUtils: - def test_any_in_text(self): - assert utils.any_in_text(["hans", "piet", "karel"], "ik heet hans") - assert utils.any_in_text(["hans", "piet", "karel"], "ik heet piet") - assert utils.any_in_text(["hans", "piet", "karel"], "ik heet karel") - assert utils.any_in_text( - ["hans", "piet", "karel"], "wij heten hans, piet en karel" - ) - assert not utils.any_in_text(["hans", "piet", "karel"], "ik heet peter") - assert utils.any_in_text(["hans", "piet", "karel"], "wat een leuk hansopje") - assert utils.any_in_text(["hans", "piet", "karel"], "mijn oom heet pieter") - +class TestStrMatch: def test_str_match(self): assert utils.str_match("a", "a") assert utils.str_match("willem", "willem") @@ -34,6 +27,50 @@ def test_str_match_fuzzy(self): assert not utils.str_match("willem", "klaas", max_edit_distance=1) +class TestClassForName: + def test_class_for_name(self): + assert ( + utils.class_for_name( + module_name="deduce.annotator", class_name="TokenPatternAnnotator" + ) + == TokenPatternAnnotator + ) + + +class TestInitializeClass: + def test_initialize_class(self): + + cls = TokenPatternAnnotator + + tag = "_" + pattern = [{"key": "value"}] + + annotator = utils.initialize_class( + cls, args={"tag": tag, "pattern": pattern}, extras={} + ) + + assert annotator.tag == tag + assert annotator.pattern == pattern + + def test_initialize_class_with_extras(self): + + cls = TokenPatternAnnotator + + tag = "_" + pattern = [{"key": "value"}] + ds = dd.ds.DsCollection() + + annotator = utils.initialize_class( + cls, + args={"tag": tag, "pattern": pattern}, + extras={"ds": ds, "unused_argument": "_"}, + ) + + assert annotator.tag == tag + assert annotator.pattern == pattern + assert annotator.ds is ds + + class TestOverwriteDict: def test_empty(self): for add in [{}, {"a": 1}, {"a": 1, "b": {}}, {"a": 1, "b": {"c": 2}}]: @@ -55,6 +92,17 @@ def test_nonempty_with_nesting(self): } +class TestHasOverlap: + def test_has_overlap(self): + + assert not utils.has_overlap([]) + assert not utils.has_overlap([(0, 10)]) + assert utils.has_overlap([(0, 10), (5, 15)]) + assert not utils.has_overlap([(0, 10), (10, 15)]) + assert not utils.has_overlap([(0, 10), (15, 25)]) + assert not utils.has_overlap([(15, 25), (0, 10)]) + + class TestStrVariations: def test_has_overlap(self): @@ -150,3 +198,57 @@ def test_str_variations_regexp(self): variations = utils.str_variations(s, repl) assert variations == ["Van Bevanstraat", "van Bevanstraat"] + + def test_apply_transform(self): + + s = {"Prof. Lieflantlaan"} + repl = {"Prof.": ["Prof.", "Professor"]} + + transform_config = {"transforms": {"prefix": repl}} + variations = utils.apply_transform(s, transform_config) + + assert variations == {"Prof. Lieflantlaan", "Professor Lieflantlaan"} + + def test_apply_transform2(self): + + items = {"den Burg", "Rotterdam"} + transform = {"transforms": {"name": {"den": ["den", ""]}}} + + transformed_items = utils.apply_transform(items, transform) + + assert transformed_items == {"den Burg", "Burg", "Rotterdam"} + + def test_apply_transform_no_strip_lines(self): + + items = {"den Burg", "Rotterdam"} + transform = {"transforms": {"name": {"den": ["den", ""]}}, "strip_lines": False} + + transformed_items = utils.apply_transform(items, transform) + + assert transformed_items == {"den Burg", " Burg", "Rotterdam"} + + +class TestOptionalLoad: + def test_optional_load_items(self): + + path = Path("tests/data/lookup/src/lst_test_nested/items.txt") + + assert utils.optional_load_items(path) == {"a", "b"} + + def test_optional_load_items_nonexisting(self): + + path = Path("tests/data/non/existing/file.txt") + + assert utils.optional_load_items(path) is None + + def test_optional_load_json(self): + + path = Path("tests/data/small.json") + + assert utils.optional_load_json(path) == {"test": True} + + def test_optional_load_json_nonexisting(self): + + path = Path("tests/data/non/existing/file.json") + + assert utils.optional_load_json(path) is None