Merge pull request #129 from vmenger/v3

v3.0.0
vmenger · Dec 20, 2023 · 8e0c7fa · 8e0c7fa
2 parents e43b062 + 287cc0e
commit 8e0c7fa
Show file tree

Hide file tree

Showing 95 changed files with 5,589 additions and 1,037,094 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,8 @@
 # General file exclusion 
+deduce/data/lookup/cache/*
+
+# Top level scripts ignored by default
+/*.py
 
 # Exclude the following filetypes
 *.sav

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,15 +5,47 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## 3.0.0 (2023-12-20)
+
+### Added
+- speed optimizations, ~250%
+- pseudo-annotating eponymous diseases (e.g. Creutzfeldt-Jakob)
+- `PatientNameAnnotator`, which replaces `deduce.pattern`
+- a structured way for loading and building lookup structures (lists and tries), including caching
+- `pre_match_words` for some regexp annotators, speeding up the annotating
+- option to present a user config as dict (using `config` keyword)
+
+### Changed
+- speedup for `TokenPatternAnnotator`
+- some internals of `ContextPatternAnnotator`
+- initials now detected by lookup list, rather than pattern
+- redactor open and close chars from `<` `>` to `[` `]`, as previous chars caused issues in html (so deidentified text now shows `[PATIENT]`, `[LOCATIE]`, etc.)
+- names of lookup structures to singular (`prefix`, rather than `prefixes`)
+- `INSTELLING` tag to `ZIEKENHUIS` and `ZORGINSTELLING`
+- refactored and simplified annotator loading, specifically the `annotator_type` config keyword now accepts references to classes (e.g `deduce.annotator.TokenPatternAnnotator`)
+- renamed `interfix_with_capital` annotator to `interfix_with_name` 
+
+### Deprecated
+- the `config_file` keyword, now replaced by `config` which accepts both filenames and dicts
+- old lookup list names, e.g. `prefixes` now replaced by `prefix`
+- annotator types 'custom', 'regexp', 'token_pattern', 'dd_token_pattern' and 'annotation_context', all replaced by setting class directly as annotator_type 
+
+### Removed
+- automated coverage reporting on coveralls.io
+- options `lowercase_lookup`, `lowercase_neg_lookup` for token patterns
+- everything in `deduce.pattern`, patient patterns now replaced by `PatientNameAnnotator`
+- `utils.any_in_text` 
+
+### Fixed
+- some small additions/removals for specific lookup lists
+- smaller bugs related to overlapping matches
+
 ## 2.5.0 (2023-11-28)
 
 ### Added
 - the `RegexpPseudoAnnotator` component for filtering regexp matches based on preceding/following words
 - a `prefix_with_interfix` pattern for names, detecting e.g. `Dr. van Loon`
 
-### Fixed
-- a bug with `BsnAnnotator` with non-digit characters in regexp
-
 ### Changed
 - the age detection component, with improved logic and pseudo patterns
 - annotations are no longer counted adjacent when separated by a comma
@@ -22,6 +54,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - extended the postbus pattern for `xx.xxx` format (old notation)
 - some smaller optimizations and exceptions for institution, hospital, placename, residence, medical term, first name, and last name lookup lists
 
+### Fixed
+- a bug with `BsnAnnotator` with non-digit characters in regexp
+
 ## 2.4.2 (2023-11-22)
 
 ### Changed
@@ -98,15 +133,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   - detects year-month-day format in addition to (day-month-year)
 - loading a custom config now only replaces the config options that are explicitly set, using defaults for those not included in the custom config
 
-### Fixed
-- annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged)
+### Deprecated
+- backwards compatibility, which was temporary added to transition from v1 to v2
 
 ### Removed
 - a separate patient identifier tag, now superseded by a generic tag
 - detection of day/month combinations for dates, as this caused many false positives (e.g. lab values, numeric scores) 
 
-### Deprecated
-- backwards compatibility, which was temporary added to transition from v1 to v2
+### Fixed
+- annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged)
 
 ## 2.0.3 (2023-04-06)
 
@@ -125,6 +160,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## 2.0.0 (2022-12-05)
 
+### Added
+- introduced new interface for deidentification, using `Deduce()` class
+- a separate documentation page, with tutorial and migration guide
+- support for python 3.10 and 3.11
+
 ### Changed
 - major refactor that touches pretty much every line of code
 - use `docdeid` package for logic
@@ -134,12 +174,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - refactor annotators into separate classes, using structured annotations
 - guidelines for contributing
 
-### Added
-- introduced new interface for deidentification, using `Deduce()` class
-- a separate documentation page, with tutorial and migration guide
-- support for python 3.10 and 3.11
-
-
 ### Removed
 - the `annotate_text` and `deidentify_annotations` functions
 - all in-text annotation (under the hood) and associated functions
@@ -152,14 +186,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## 1.0.8 (2021-11-29)
 
+### Added
+- warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation
+
 ### Fixed
 - various modifications related to adding or subtracting spaces in annotated texts
 - remove the lowercasing of institutions' names
 - therefore, all structured annotations have texts matching the original text in the same span
 
-### Added
-- warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation
-
 ## 1.0.7 (2021-11-03)
 
 ### Changed

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -13,23 +13,29 @@ Before starting, some things to consider:
 * This project uses poetry for package management. Install it with ```pip install poetry```
 * Set up the environment is easy, just use ```poetry install```
 * The makefile contains some useful commands when developing:
-  * `make test` runs the tests (including coverage)
   * `make format` formats the package code
   * `make lint` runs the linters (check the output)
   * `make clean` removes build/test artifacts, etc
 * And for docs:
   * `make build-docs` builds the docs
-  * `make clean-docs` removes docs build
+
+## Runing the tests
+
+```bash
+pytest .
+```
 
 ## PR checlist
 
 * Verify that tests are passing
 * Verify that tests are updated/added according to changes
 * Run the formatters (`make format`)
-* Run the linters (`make lint`) and check the output for anything preventable
+* Run the linters (`make lint`)
 * Add a section to the changelog
 * Add a description to your PR
 
+If all the steps above are followed, this ensures a quick review and release of your contribution. 
+
 ## Releasing
 * Readthedocs has a webhook connected to pushes on the main branch. It will trigger and update automatically. 
 * Create a [release on github](https://github.com/vmenger/docdeid/releases/new), create a tag with the right version, manually copy and paste from the changelog