Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bulk method, revamp documentation structure #7

Merged
merged 6 commits into from
Nov 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,5 @@ The quickstart provides information on how to install the SDK and set up an API
quickstart/getting_started
parse/index
redact/index
redact/api
parse/api
1 change: 0 additions & 1 deletion docs/source/parse/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,4 +15,3 @@ To learn more about how to use Textual to redact entities within text and files
parsing_files
pipelines
working_with_parsed_output
api
48 changes: 32 additions & 16 deletions docs/source/redact/index.rst
Original file line number Diff line number Diff line change
@@ -1,28 +1,39 @@
Redact
=============

The Textual redact functionality allows you to identify entities in files, and then optionally redact/synthesize these entities to create a safe version of your unstructured text. This functionality works on both raw strings and files, including PDF, DOCX, XLSX, and other formats.
The Textual redact functionality allows you to identify entities in files, and then optionally tokenize/synthesize these entities to create a safe version of your unstructured text. This functionality works on both raw strings and files, including PDF, DOCX, XLSX, and other formats.

Before you can use these functions, read the :doc:`Getting started <../quickstart/getting_started>` guide and create an API key.

Redacting strings
Redacting Text
-----------------

To identify entities in a raw string, call the **redact** function.
You can redact text directly in a variety of formats such as plain text, json, xml, and html. All redaction requests return a response which includes the original text, redacted text, a list of found entities and their locations. Additionally all redact functions allow you to specify which entities are tokenized and which are synthesized.

.. code-block:: python

from tonic_textual.redact_api import TextualNer
The common set of inputs to are redact functions are:

textual = TonicTextual("https://textual.tonic.ai")

raw_redaction = textual.redact("My name is John, and today I am demo-ing Textual, a software product created by Tonic")
* **generator_default**
The default operation performed on an entity. The options are 'Redact', 'Synthesis', and 'Off'
* **generator_config**
A dictionary whose keys are entity labels and values are how to redact the entity. The options are 'Redact', 'Synthesis', and 'Off'.

Example: {'NAME_GIVEN': 'Synthesis'}
* **label_allow_lists**
A dictionary whose keys are entity labels and values are lists of regexes. If a piece of text matches a regex it is flagged as that entity type.

Example: {'HEALTHCARE_ID': [r'[a-zA-zZ]{3}\\d{6,}']
* **label_block_lists**
A dictionary whose keys are entity labels and values are lists of regexes. If a piece of text matches a regex it is ignored for that entity type.

Example: {'NUMERIC_VALUE': [r'\\d{3}']

The response provides a list of identified entities, with information about each entity.
The JSON and XML redact functions also have additional inputs which you can read about in their respective sections.

It also returns a redacted string that replaces the found entities with tokens. You can configure how to handle each type of entities - whether to redact or synthesize them.
.. toctree::
:hidden:
:maxdepth: 2

To learn more about to redact raw strings, go to :doc:`Redacting text <redacting_text>`.
redacting_text

Redacting files
---------------
Expand Down Expand Up @@ -51,6 +62,12 @@ To generated redacted/synthesized files:

To learn more about how to generate redacted and synthesized files, go to :doc:`Redacting files <redacting_files>`.

.. toctree::
:maxdepth: 2
:hidden:

redacting_files

Working with datasets
---------------------

Expand All @@ -60,8 +77,7 @@ To help automate workflows, you can work with datasets directly from the SDK. To


.. toctree::

redacting_text
redacting_files
:maxdepth: 2
:hidden:

datasets
api
214 changes: 144 additions & 70 deletions docs/source/redact/redacting_text.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,3 @@
🅰 Text
=========================

Redact raw text
---------------
To redact sensitive information from a text string, pass the string to the `redact` method:
Expand All @@ -18,7 +15,7 @@ This produces the following output:

.. code-block:: console

My name is Alfonzo, and today I am demoing Textual, a software product created by New Ignition Worldwide
My name is [NAME_GIVEN_HI1h7], and [DATE_TIME_4hKfrH] I am demoing Textual, a software product created by [ORGANIZATION_P5XLAH]
{
"start": 11,
"end": 15,
Expand All @@ -28,97 +25,94 @@ This produces the following output:
"text": "John",
"score": 0.9,
"language": "en",
"new_text": "[NAME_GIVEN_dySb5]"
"new_text": "[NAME_GIVEN_HI1h7]"
}
{
"start": 79,
"end": 84,
"new_start": 93,
"new_end": 114,
"label": "ORGANIZATION",
"text": "Tonic",
"score": 0.9,
"language": "en",
"new_text": "[ORGANIZATION_5Ve7OH]"
}

Synthesize raw text
-------------------
The following example passes the same string to the `redact` method, but sets some categories to `Synthesis`, which indicates to use realistic replacement values:

.. code-block:: python

from tonic_textual.redact_api import TextualNer

textual = TextualNer()
generator_config = {"NAME_GIVEN":"Synthesis", "ORGANIZATION":"Synthesis"}
raw_synthesis = textual.redact(
"My name is John, and today I am demoing Textual, a software product created by Tonic",
generator_config=generator_config)
print(raw_synthesis.describe())

This produces the following output:

.. code-block:: console

My name is Alfonzo, and today I am demoing Textual, a software product created by New Ignition Worldwide
{
"start": 11,
"end": 15,
"new_start": 11,
"new_end": 18,
"label": "NAME_GIVEN",
"text": "John",
"start": 21,
"end": 26,
"new_start": 35,
"new_end": 53,
"label": "DATE_TIME",
"text": "today",
"score": 0.9,
"language": "en",
"new_text": "Alfonzo"
"new_text": "[DATE_TIME_4hKfrH]"
}
{
"start": 79,
"end": 84,
"new_start": 82,
"new_end": 104,
"new_start": 106,
"new_end": 127,
"label": "ORGANIZATION",
"text": "Tonic",
"score": 0.9,
"language": "en",
"new_text": "New Ignition Worldwide"
}
"new_text": "[ORGANIZATION_P5XLAH]"
}

Using LLM synthesis
-------------------
The following example passes the same string to the `llm_synthesis` method:
Bulk redact raw text
---------------------
In the same way that our `redact` method can be used to redact strings our `redact_bulk` method allows you to redact many strings at once. Each string is individually redacted, meaning individual strings are fed into our model independently and cannot affect each other. To redact sensitive information from a list of text strings, pass the list to the `redact_bulk` method:

.. code-block:: python

from tonic_textual.redact_api import TextualNer

textual = TextualNer()

raw_synthesis = textual.llm_synthesis("My name is John, and today I am demoing Textual, a software product created by Tonic")
print(raw_synthesis.describe())
raw_redaction = textual.redact_bulk(["Tonic was founded in 2018", "John Smith is a person"])
print(raw_redaction.describe())

This produces the following output:

.. code-block:: console

My name is Matthew, and today I am demoing Textual, a software product created by Google.
{
"start": 11,
"end": 15,
"label": "NAME_GIVEN",
"text": "John",
"score": 0.9
}
{
"start": 79,
"end": 84,
"label": "ORGANIZATION",
"text": "Tonic",
"score": 0.9
}

Note that LLM Synthesis is non-deterministic — you will likely get different results each time you run.
[ORGANIZATION_5Ve7OH] was founded in [DATE_TIME_DnuC1]
{
"start": 0,
"end": 5,
"new_start": 0,
"new_end": 21,
"label": "ORGANIZATION",
"text": "Tonic",
"score": 0.9,
"language": "en",
"new_text": "[ORGANIZATION_5Ve7OH]"
}
{
"start": 21,
"end": 25,
"new_start": 37,
"new_end": 54,
"label": "DATE_TIME",
"text": "2018",
"score": 0.9,
"language": "en",
"new_text": "[DATE_TIME_DnuC1]"
}
[NAME_GIVEN_dySb5] [NAME_FAMILY_7w4Db3] is a person
{
"start": 0,
"end": 4,
"new_start": 0,
"new_end": 18,
"label": "NAME_GIVEN",
"text": "John",
"score": 0.9,
"language": "en",
"new_text": "[NAME_GIVEN_dySb5]"
}
{
"start": 5,
"end": 10,
"new_start": 19,
"new_end": 39,
"label": "NAME_FAMILY",
"text": "Smith",
"score": 0.9,
"language": "en",
"new_text": "[NAME_FAMILY_7w4Db3]"
}

Redact JSON data
----------------
Expand Down Expand Up @@ -220,3 +214,83 @@ To redact sensitive information from HTML, pass the HTML document string to the
xml_redaction = textual.redact_html(html_content)

The response includes entity level information, including the XPATH at which the sensitive entity is found. The start and end positions are relative to the beginning of thhe XPATH location where the entity is found.

Choosing tokenization or synthesis raw text
----------------------------------------------
You can choose whether a given entitiy is synthesized or tokenized. By default all entities are tokenized. You can specify which entities you wish to synthesize/tokenize by using the `generator_config` parameter. This works the same for all of our `redact` functions.

The following example passes the same string to the `redact` method, but sets some entities to `Synthesis`, which indicates to use realistic replacement values:

.. code-block:: python

from tonic_textual.redact_api import TextualNer

textual = TextualNer()
generator_config = {"NAME_GIVEN":"Synthesis", "ORGANIZATION":"Synthesis"}
raw_synthesis = textual.redact(
"My name is John, and today I am demoing Textual, a software product created by Tonic",
generator_config=generator_config)
print(raw_synthesis.describe())

This produces the following output:

.. code-block:: console

My name is Alfonzo, and today I am demoing Textual, a software product created by New Ignition Worldwide
{
"start": 11,
"end": 15,
"new_start": 11,
"new_end": 18,
"label": "NAME_GIVEN",
"text": "John",
"score": 0.9,
"language": "en",
"new_text": "Alfonzo"
}
{
"start": 79,
"end": 84,
"new_start": 82,
"new_end": 104,
"label": "ORGANIZATION",
"text": "Tonic",
"score": 0.9,
"language": "en",
"new_text": "New Ignition Worldwide"
}

Using LLM synthesis
-------------------
The following example passes the same string to the `llm_synthesis` method:

.. code-block:: python

from tonic_textual.redact_api import TextualNer

textual = TextualNer()

raw_synthesis = textual.llm_synthesis("My name is John, and today I am demoing Textual, a software product created by Tonic")
print(raw_synthesis.describe())

This produces the following output:

.. code-block:: console

My name is Matthew, and today I am demoing Textual, a software product created by Google.
{
"start": 11,
"end": 15,
"label": "NAME_GIVEN",
"text": "John",
"score": 0.9
}
{
"start": 79,
"end": 84,
"label": "ORGANIZATION",
"text": "Tonic",
"score": 0.9
}

Note that LLM Synthesis is non-deterministic — you will likely get different results each time you run.
Loading