Exact Deduplication Transform

Exact deduplication transform identifies and removes identical documents in a dataset by comparing them hash-for-hash to ensure exact matching. Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Contributors

Boris Lublinsky (blublinsk@ibm.com)

Description

This Python implementation of the exact deduplication transform uses "streaming" deduplication based on a central hash. As shown below, it relies on a distributed hash cache and data processors that read documents, generate hashes, coordinate with the cache to remove duplicates, and store unique documents in the data plane.

Mapping this model to the transform model is complicated by the need for a hash cache, which the transform model does not recognize. The solution is to have the transform runtime create the hash cache and pass it as a parameter to the transforms. The transform runtime handles hash cache creation and enhances statistics with details about cache size and utilization.

Incremental Execution and Snapshotting

The current implementation includes snapshotting, where the hash cache is saved to storage (local disk or S3) at the end of execution. This enables incremental deduplication: you can run deduplication on existing files, save the hash cache, and later load the snapshot to deduplicate only new files, avoiding reprocessing the entire dataset.

Input Columns Used by This Transform

Input Column Name	Data Type	Description
Column specified by the contents_column configuration argument	str	Column that stores document text
Column specified by the document_id_column configuration argument	int64	Column that stores document ID

Output Columns Annotated by This Transform

This transform does not perform any annotations; it only filters out the documents that are marked as duplicates.

Configuration

The set of dictionary keys holding EdedupTransform configuration for values (common for Python and Ray) are as follows:

doc_column - specifies name of the column containing documents
doc_id_column - specifies the name of the column containing a document id
use_snapshot - specifies that ededup execution starts with a set of pre-existing hashes, enabling incremental execution
snapshot_directory - specifies the directory for reading snapshots. If not provided, the default is output_folder/snapshot

Usage

The following command line arguments (corresponding to the configuration keys described above) are available in addition to the options provided by the python launcher.

  --ededup_doc_column EDEDUP_DOC_COLUMN
                        name of the column containing document
  --ededup_doc_id_column EDEDUP_DOC_ID_COLUMN
                        name of the column containing document id
  --ededup_use_snapshot EDEDUP_USE_SNAPSHOT
                        flag to continue from snapshot
  --ededup_snapshot_directory EDEDUP_SNAPSHOT_DIRECTORY
                        location of snapshot files

Code example

notebook

Transforming data using the transform image

To use the transform image to transform your data, please refer to the running images quickstart, substituting the name of this transform image and runtime as appropriate.

Testing

Following the testing strategy of data-processing-lib

Currently we have:

Unit test
Integration test

Exact Dedup Ray Annotator

Please see the set of transform project conventions for details on general project conventions, transform configuration, testing and IDE set up.

Additional parameters

In addition to common ededup parameters, Ray implementation provides two additional ones

hash_cpu - specifies amount of CPU per hash actor
num_hashes - specifies number of hash actors

Additional support

We also provide an estimate to roughly determine cluster size for running transformer.

Running the samples

To run the samples, use the following make target

run-cli-sample - runs dpk_ededup/ray/transform.py using command line args

This target will activate the virtual environment and set up any configuration needed. Use the -n option of make to see the detail of what is done to run the sample.

For example,

make run-cli-sample
...

Then

ls output

To see results of the transform.

Code example

notebook

Launched Command Line Options

When running the transform with the Ray launcher (i.e., RayTransformLauncher), these additional command line arguments are available the options provided by the launcher.

  --ededup_hash_cpu EDEDUP_HASH_CPU
                        number of CPUs per hash
  --ededup_num_hashes EDEDUP_NUM_HASHES
                        number of hash actors to use
  --ededup_doc_column EDEDUP_DOC_COLUMN
                        name of the column containing document
  --ededup_doc_id_column EDEDUP_DOC_ID_COLUMN
                        name of the column containing document id
  --ededup_use_snapshot EDEDUP_USE_SNAPSHOT
                        flag to continue from snapshot
  --ededup_snapshot_directory EDEDUP_SNAPSHOT_DIRECTORY
                        location of snapshot files

These correspond to the configuration keys described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Exact Deduplication Transform

Contributors

Description

Incremental Execution and Snapshotting

Input Columns Used by This Transform

Output Columns Annotated by This Transform

Configuration

Usage

Code example

Transforming data using the transform image

Testing

Exact Dedup Ray Annotator

Additional parameters

Additional support

Running the samples

Code example

Launched Command Line Options

Files

README.md

Latest commit

History

README.md

File metadata and controls

Exact Deduplication Transform

Contributors

Description

Incremental Execution and Snapshotting

Input Columns Used by This Transform

Output Columns Annotated by This Transform

Configuration

Usage

Code example

Transforming data using the transform image

Testing

Exact Dedup Ray Annotator

Additional parameters

Additional support

Running the samples

Code example

Launched Command Line Options