Skip to content

Commit

Permalink
Merge pull request #129 from vmenger/v3
Browse files Browse the repository at this point in the history
v3.0.0
  • Loading branch information
vmenger authored Dec 20, 2023
2 parents e43b062 + 287cc0e commit 8e0c7fa
Show file tree
Hide file tree
Showing 95 changed files with 5,589 additions and 1,037,094 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
# General file exclusion
deduce/data/lookup/cache/*

# Top level scripts ignored by default
/*.py

# Exclude the following filetypes
*.sav
Expand Down
66 changes: 50 additions & 16 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,47 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## 3.0.0 (2023-12-20)

### Added
- speed optimizations, ~250%
- pseudo-annotating eponymous diseases (e.g. Creutzfeldt-Jakob)
- `PatientNameAnnotator`, which replaces `deduce.pattern`
- a structured way for loading and building lookup structures (lists and tries), including caching
- `pre_match_words` for some regexp annotators, speeding up the annotating
- option to present a user config as dict (using `config` keyword)

### Changed
- speedup for `TokenPatternAnnotator`
- some internals of `ContextPatternAnnotator`
- initials now detected by lookup list, rather than pattern
- redactor open and close chars from `<` `>` to `[` `]`, as previous chars caused issues in html (so deidentified text now shows `[PATIENT]`, `[LOCATIE]`, etc.)
- names of lookup structures to singular (`prefix`, rather than `prefixes`)
- `INSTELLING` tag to `ZIEKENHUIS` and `ZORGINSTELLING`
- refactored and simplified annotator loading, specifically the `annotator_type` config keyword now accepts references to classes (e.g `deduce.annotator.TokenPatternAnnotator`)
- renamed `interfix_with_capital` annotator to `interfix_with_name`

### Deprecated
- the `config_file` keyword, now replaced by `config` which accepts both filenames and dicts
- old lookup list names, e.g. `prefixes` now replaced by `prefix`
- annotator types 'custom', 'regexp', 'token_pattern', 'dd_token_pattern' and 'annotation_context', all replaced by setting class directly as annotator_type

### Removed
- automated coverage reporting on coveralls.io
- options `lowercase_lookup`, `lowercase_neg_lookup` for token patterns
- everything in `deduce.pattern`, patient patterns now replaced by `PatientNameAnnotator`
- `utils.any_in_text`

### Fixed
- some small additions/removals for specific lookup lists
- smaller bugs related to overlapping matches

## 2.5.0 (2023-11-28)

### Added
- the `RegexpPseudoAnnotator` component for filtering regexp matches based on preceding/following words
- a `prefix_with_interfix` pattern for names, detecting e.g. `Dr. van Loon`

### Fixed
- a bug with `BsnAnnotator` with non-digit characters in regexp

### Changed
- the age detection component, with improved logic and pseudo patterns
- annotations are no longer counted adjacent when separated by a comma
Expand All @@ -22,6 +54,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- extended the postbus pattern for `xx.xxx` format (old notation)
- some smaller optimizations and exceptions for institution, hospital, placename, residence, medical term, first name, and last name lookup lists

### Fixed
- a bug with `BsnAnnotator` with non-digit characters in regexp

## 2.4.2 (2023-11-22)

### Changed
Expand Down Expand Up @@ -98,15 +133,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- detects year-month-day format in addition to (day-month-year)
- loading a custom config now only replaces the config options that are explicitly set, using defaults for those not included in the custom config

### Fixed
- annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged)
### Deprecated
- backwards compatibility, which was temporary added to transition from v1 to v2

### Removed
- a separate patient identifier tag, now superseded by a generic tag
- detection of day/month combinations for dates, as this caused many false positives (e.g. lab values, numeric scores)

### Deprecated
- backwards compatibility, which was temporary added to transition from v1 to v2
### Fixed
- annotations can no longer be counted as adjacent when separated by newline or tab (and will thus not be merged)

## 2.0.3 (2023-04-06)

Expand All @@ -125,6 +160,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## 2.0.0 (2022-12-05)

### Added
- introduced new interface for deidentification, using `Deduce()` class
- a separate documentation page, with tutorial and migration guide
- support for python 3.10 and 3.11

### Changed
- major refactor that touches pretty much every line of code
- use `docdeid` package for logic
Expand All @@ -134,12 +174,6 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- refactor annotators into separate classes, using structured annotations
- guidelines for contributing

### Added
- introduced new interface for deidentification, using `Deduce()` class
- a separate documentation page, with tutorial and migration guide
- support for python 3.10 and 3.11


### Removed
- the `annotate_text` and `deidentify_annotations` functions
- all in-text annotation (under the hood) and associated functions
Expand All @@ -152,14 +186,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## 1.0.8 (2021-11-29)

### Added
- warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation

### Fixed
- various modifications related to adding or subtracting spaces in annotated texts
- remove the lowercasing of institutions' names
- therefore, all structured annotations have texts matching the original text in the same span

### Added
- warn if there are any structured annotations whose annotated text does not match the original text in the span denoted by the structured annotation

## 1.0.7 (2021-11-03)

### Changed
Expand Down
12 changes: 9 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,23 +13,29 @@ Before starting, some things to consider:
* This project uses poetry for package management. Install it with ```pip install poetry```
* Set up the environment is easy, just use ```poetry install```
* The makefile contains some useful commands when developing:
* `make test` runs the tests (including coverage)
* `make format` formats the package code
* `make lint` runs the linters (check the output)
* `make clean` removes build/test artifacts, etc
* And for docs:
* `make build-docs` builds the docs
* `make clean-docs` removes docs build

## Runing the tests

```bash
pytest .
```

## PR checlist

* Verify that tests are passing
* Verify that tests are updated/added according to changes
* Run the formatters (`make format`)
* Run the linters (`make lint`) and check the output for anything preventable
* Run the linters (`make lint`)
* Add a section to the changelog
* Add a description to your PR

If all the steps above are followed, this ensures a quick review and release of your contribution.

## Releasing
* Readthedocs has a webhook connected to pushes on the main branch. It will trigger and update automatically.
* Create a [release on github](https://github.com/vmenger/docdeid/releases/new), create a tag with the right version, manually copy and paste from the changelog
Expand Down
Loading

0 comments on commit 8e0c7fa

Please sign in to comment.