generated from ministryofjustice/template-repository
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Stick old data product definitions in an archive folder (#24)
We have various ideas around Data as a Product floating about, the earliest being this repository: https://github.com/ministryofjustice/data-platform-products We are no longer actively developing this idea, but we may reuse some of our earlier ideas when deciding what custom metadata we want to capture when registering metadata in the catalogue, so I want to gather these examples in one place and then get rid of the old repo.
- Loading branch information
Showing
33 changed files
with
23,197 additions
and
0 deletions.
There are no files selected for viewing
30 changes: 30 additions & 0 deletions
30
archive/data_product_examples/2023-05-example_prison_data_product/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
This is an early exploration of what a "Data Product" may look like, | ||
as of May 2023. | ||
Migrated from https://github.com/ministryofjustice/data-platform-products | ||
|
||
# Data products | ||
|
||
## Purpose | ||
|
||
A data product exists to help with interoperability, to make that process simpler. The intention of data products and the resulting data mesh is to eliminate the struggle to get timely access to data, and a consequent loss of trust. The purpose of the data product is to serve the consumer's needs. | ||
|
||
A Data Product is created and owned by a Data Product Owner, a person with comprehensive domain knowledge. The Data Product Owner is not part of the Data Platform team, they are people that operate on different teams and they have a system that contains data that they would like to share. This is because it is essential to know a domain before creating a Data Product, and it would be impossible for the Data Platform teams to have deep knowledge of every domain that uses the platform. | ||
|
||
## Goals | ||
|
||
Our goals are: | ||
|
||
- Make data easily discoverable by users who wish to use that data. We do this by encouraging the owners of data products to supply high quality [metadata](https://en.wikipedia.org/wiki/Metadata) | ||
- Make data more usable, whatever the purpose, by applying product thinking to our data, and clearly describing the lineage and transformations of our data products | ||
- Make it easier for us to provide governance for data, for example by identfying owners, protective markings and retention periods. | ||
|
||
## Defining a data product | ||
|
||
A data product will have a unique name, and is defined using a collection of YAML files. | ||
|
||
| File name | Purpose | Documentation | | ||
| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- | | ||
| `00-specification.yml` | Aids data discoverability by providing a name, description and tags for a data product. It also contains contact details of the data product owner. | [Data product specification](./_docs/product-specification.md) | | ||
| `01-governance.yml` | Contains protective marking, retention period, expected update frequency and data lineage. | [Defining product governance](./_docs/product-governance.md) | | ||
| `02-data-dictionary.yml` | Contains the field and column (or domain, attribute and value) defintions comprising the data product, along with type information and user-friendly names. | [Data dictionary guidance](./_docs/data-dictionary.md) | | ||
| `03-transformations.yml` | Describes the [cleaning](./_docs/cleansing-definitions.md) and [transformation](./_docs/transform-definitions.md) data will undergo before it is made available to consumers. | [Cleaning](./_docs/cleansing-definitions.md) and [transformation](./_docs/transform-definitions.md) | |
99 changes: 99 additions & 0 deletions
99
...uct_examples/2023-05-example_prison_data_product/_docs/cleansing-definitions.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# Describing data cleansing | ||
|
||
When describing what has happened to your data before users get to interact with it, you should describe any data cleaning processes which you have used, either at a table or column level. Cleansing should be documented in the `03-transformations.yml` file in your data product definition folder. | ||
|
||
See also our guidance on describing [transformations](./transform-definitions.md) your data may have applied to it. | ||
|
||
## Types of data cleansing | ||
|
||
These have been derived from the contributions made to the [Data Management Wiki](https://datamanagement.wiki/data_quality_management_system/data_cleansing). <!--Some of these contain US spellings - we also accept the UK equivalent (for example "normalisation" and "normalization" are both accepted).--> | ||
|
||
Please use the identifier (ID) for the cleaning types when populating your `03-transformations.yml` file. | ||
|
||
| ID | Method | Description | | ||
| ----------------------------- | ---------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| abbreviation-expansion | Abbreviation expansion | Abbreviation expansion transforms abbreviations into their full form. There are different kinds of abbreviation. These can be shortened words such as "mgmt", or acronyms, where the abbreviation consists of a prefix of the original data value. E.g. "MDT" stands for "mandatory drug testing". | | ||
| cross-checking | Cross-checking with a validated data set | Some data cleansing solutions will clean data by cross-checking with a validated data set. In these cases you should reference the dataset used for cross-checking. | | ||
| correct-typos | Correct typographical errors | A typographical error (often shortened to typo), is a mistake (such as a spelling mistake) made in the data | | ||
| drop-or-impute-missing-values | Drop or impute missing values | Missing values are data or data points of a variable that are missing. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. | | ||
| group-duplicates | Group duplicates | Duplicates are data points that are repeated in your dataset. Every duplicate detection method proposed requires an algorithm for determine whether two or more records are duplicate representations of the same entity. | | ||
| parsing | Parsing | Parsing is a method where one string of data gets converted into a different type of data. Parsing in data cleansing is performed for the detection of syntax errors. | | ||
| remove-duplicates | Remove duplicates | Duplicates are data points that are repeated in your dataset. Every duplicate detection method proposed requires an algorithm for determine whether two or more records are duplicate representations of the same entity. | | ||
| remove-inconsistency | Remove inconsistency | Data inconsistency is a condition that occurs between tables when we keep similar data in different formats in two different tables. Data inconsistency creates unreliable information, because it will be difficult to determine which version of the information is correct. | | ||
| remove-irrelevant-data | Remove irrelevant data | Irrelevant data are the data that are not actually needed, and don’t fit under the context of the problem we’re trying to solve. | | ||
| remove-outliers | Remove outliers | Outliers are values that are significantly different from all other observations. They should not be removed unless there is a good reason for that, since users may need to see them. | | ||
| standardisation | Standardisation | Standardisation transforms data into a standard form - for example if the same information is represented slightly differently across different systems, you may wish to standardise those values so they match with other data products. | | ||
| statistical-methods | Statistical methods | Statistical methods are used to identify data issues and provide cleaning or flagging of data issues. | | ||
| suppress-small-values | Suppress small values | Suppression is when small values are removed are replaced to avoid identifying individuals. | | ||
| redaction | Redaction | Redaction is the removal of sensitive or other restricted information. | | ||
| type-conversion | Type conversion | Type conversion (also called casting) is an operation that converts a piece of data of one data type to another data type. Type conversion can be used to make sure that numbers are stored as numerical data types and that a date should be stored as a date object. | | ||
|
||
## Examples | ||
|
||
You have a table called `prison_releases` defined in `02-data-dictionary.yml` which contains anonymised, row-level data from a case management system about individuals released from prison. Since an individual can be "released" from more than one prison sentence at the same time, or because of data entry errors, the raw data may appear to, or actually, contain duplicates. You know your users only want a single record per release, so you choose to de-duplicate the data by deleting all but one instance of each duplicate. This is a table level cleaning process which would be documented in `03-transformations.yml` like this: | ||
|
||
```yaml | ||
cleansing: | ||
prison_releases: | ||
- type: "remove-duplicates" | ||
description: "Duplicate releases removed by choosing the duplicate with the highest numbered record ID and discarding the others" | ||
``` | ||
Your users aren't happy with that, so instead you apply a `GROUP BY` to your dataset, so that there is only one row per release but you retain information which may otherwise be lost: | ||
|
||
```yaml | ||
cleansing: | ||
prison_releases: | ||
- type: "group-duplicates" | ||
description: "Duplicate releases removed by concatenating multiple sentences into one string" | ||
``` | ||
|
||
Multiple cleaning steps can be added, so for example you might want to clearly show any enhancements you added as part of the above `group-duplicates`: | ||
|
||
```yaml | ||
cleansing: | ||
prison_releases: | ||
- type: "group-duplicates" | ||
description: "Duplicate releases removed by concatenating multiple sentences into one string" | ||
- type: "data-enhancement" | ||
description: "Add a flag column to indicate which records were affected by the de-duplication process" | ||
``` | ||
|
||
Furthermore, there may be a column of data which contains jargon or abbrevations which you wish to make more user-friendly. Assuming you cannot take the preferred approach of providing a reference table for this, you may decide to expand or replace these as part of your cleansing. For example there's a column which contains some information about whether the release was early, on time, or late, but users of the case management system have taken to entering "E", "OT", "L", "VL", and there's no reference data or data entry validation for this. You use your knowledge of the system to provide an enhancement to expand those abbreviations. | ||
|
||
```yaml | ||
cleansing: | ||
prison_releases: | ||
- type: "group-duplicates" | ||
description: "Duplicate releases removed by concatenating multiple sentences into one string" | ||
- type: "data-enhancement" | ||
description: "Add a flag column to indicate which records were affected by the de-duplication process" | ||
columns: | ||
timely_release: | ||
- type: "abbreviation-expansion" | ||
description: "Replace data entry shorthand with expanded text (e.g. 'VL' => 'very late'); or 'unknown'" | ||
``` | ||
|
||
Our [example data product](./_example/) contains a few more examples. | ||
|
||
### Data with no cleansing applied | ||
|
||
If your data has not undergone any cleansing, we suggest you explicitly inform users of this by adding `type: none` to make this clear to users. Otherwise we will assume there has been undocumented cleansing. An exception is made for tables [defined as reference data](./_example/02-data-dictionary.yml#L67) - here we assume no cleansing has been applied if there is no cleansing specified. | ||
|
||
## Considerations | ||
|
||
Wherever possible, it is better to have clean data captured at source. If you are regularly applying certain types of cleaning, consider whether it is possible and practical for additional data validation to be added to the originating system(s). | ||
|
||
<!--## Template generation | ||
|
||
Our roadmap contains plans for tools to aid in template generation - for example generating a skeleton `03-transformations.yml` given a `02-data-dictionary.yml` as input.--> | ||
|
||
## Suggesting changes | ||
|
||
If you wish to suggest additions or improvements to the cleansing types, please [follow our guidance](https://github.com/ministryofjustice/data-platform-products) on submitting a pull request. | ||
|
||
## Further reading | ||
|
||
[Index of documention for data product defintion](../README.md#defining-a-data-product) | ||
|
||
[Example data product](../_example/) |
Oops, something went wrong.