Skip to content

Commit

Permalink
Merge pull request #4 from TonicAI/f/readmup-update
Browse files Browse the repository at this point in the history
README Updates
  • Loading branch information
akamor authored Nov 15, 2024
2 parents a6549d6 + d0ae423 commit d68746d
Show file tree
Hide file tree
Showing 2 changed files with 92 additions and 86 deletions.
178 changes: 92 additions & 86 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,64 +1,56 @@
<a id="readme-top"></a>

<h1 align="center">
<img style="vertical-align:middle" height="200" src="https://raw.githubusercontent.com/TonicAI/textual/main/images/textual-logo.png">
</h1>

<p align="center">Unblock AI initiatives by maximizing your free-text assets through realistic data de-identification and high quality data extraction 🚀</p>

<p align="center">
<a href="https://www.python.org/">
<img alt="Build" src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple">
<img alt="Build" src="https://img.shields.io/badge/Made%20with-Python-1f425f.svg?color=purple">
</a>
<a href="https://github.com/tonicai/textual_sdk_internal/blob/master/LICENSE">
<img alt="License" src="https://img.shields.io/badge/license-MIT-blue">
<img alt="License" src="https://img.shields.io/badge/license-MIT-blue">
</a>
<a href='https://tonic-ai-textual-sdk.readthedocs-hosted.com/en/latest/?badge=latest'>
<img src='https://readthedocs.com/projects/tonic-ai-textual-sdk/badge/?version=latest' alt='Documentation Status' />
</a>
</p>

<p align="center">
<a href="https://tonic-ai-textual-sdk.readthedocs-hosted.com/en/latest/">Documentation</a>
|
<a href="https://textual.tonic.ai/signup">Get an API key</a>
|
<a href="https://github.com/tonicai/textual_sdk/issues/new?labels=bug&template=bug-report---.md">Report a bug</a>
|
<a href="https://github.com/tonicai/textual_sdk/issues/new?labels=enhancement&template=feature-request---.md">Request a feature</a>
</p>

<!-- PROJECT LOGO -->
<br />
<div align="center">
<a href="https://github.com/tonicai/textual_sdk">
<img src="https://raw.githubusercontent.com/TonicAI/textual/main/images/tonic-textual.svg" alt="Logo" width="80" height="80">
</a>
<h1 align="center">Tonic Textual</h1>
Textual makes it easy to build safe AI models and applications on sensitive customer data. It is used across industries, with a primary focus on finance, healthcare, and customer support. Build safe models by using Textual to identify customer PII/PHI, then generate synthetic text and documents that you can use to train your models without inadvertently embedding PII/PHI into your model weights.

<h3 align="center">Tonic Textual SDK for Python</h3>
<p align="center">
<p>AI-ready data, with privacy at the core. Unblock AI initiatives by maximizing your free-text assets through realistic data de-identification and high quality data extraction</p>
</p>
<br />
<a href="https://tonic-ai-textual-sdk.readthedocs-hosted.com/en/latest/"><strong>Explore the docs »</strong></a>
<br />
<br />
<a href="https://textual.tonic.ai/signup">Get an API Key</a>
·
<a href="https://github.com/tonicai/textual_sdk/issues/new?labels=bug&template=bug-report---.md">Report Bug</a>
·
<a href="https://github.com/tonicai/textual_sdk/issues/new?labels=enhancement&template=feature-request---.md">Request Feature</a>
</p>
</div>
Textual comes with a built-in data pipeline functionality so that it scales with you. Use our SDK to redact text or to extract relevant information from complex documents before you build your data pipelines.


## Key Features

- 🔎 NER. Our models are fast and accurate. Use them on real-world, complex, and messy unstructured data to find the exact entities that you care about.
- 🧬 Synthesis. We don't just find sensitive data. We also synthesize it, to provide you with a new version of your data that
is suitable for model training and AI development.
- ⛏️ Extraction. We support a variety of file formats in addition to txt. We can extract interesting data from PDFs, DOCX files, images, and more.


<!-- TABLE OF CONTENTS -->

## Table of Contents
## Contents
<ol>
<li>
<a href="#getting-started">Getting Started</a>
<ul>
<li><a href="#prerequisites">Prerequisites</a></li>
<li><a href="#installation">Installation</a></li>
</ul>
</li>
<li>
<a href="#usage">Usage</a>
<ul>
<li><a href="#ner_usage">NER Usage</a></li>
<li><a href="#parse_usage">Parse Usage</a></li>
<li><a href="#ui_automation">UI Automation</a></li>
</ul>
</li>
<li><a href="#roadmap">Bug Reports and Feature Requests</a></li>
<li><a href="#prerequisites">Prerequisites</a></li>
<li><a href="#getting-started">Getting started</a></li>
<li><a href="#ner_usage">NER usage</a></li>
<li><a href="#parse_usage">Parse usage</a></li>
<li><a href="#ui_automation">UI automation</a></li>
<li><a href="#roadmap">Bug reports and feature requests</a></li>
<li><a href="#contributing">Contributing</a></li>
<li><a href="#license">License</a></li>
<li><a href="#contact">Contact</a></li>
Expand All @@ -69,23 +61,26 @@
<!-- GETTING STARTED -->
## Prerequisites

1. Get a free API Key at [Textual](https://textual.tonic.ai)
1. Get a free API key at [Textual.](https://textual.tonic.ai).
2. Install the package from PyPI
```sh
pip install tonic-textual
```
3. Your API Key can be passed as an argument directly into SDK calls or you can save it to your environment
3. You can pass your API key as an argument directly into SDK calls, or you can save it to your environment.
```sh
export TONIC_TEXTUAL_API_KEY=<API Key>
```

<p align="right">(<a href="#readme-top">back to top</a>)</p>

### Getting Started
## Getting started

This library supports two different workflows, NER detection (along with entity tokenization and synthesis) and data extraction of unstructured files like PDF and Office documents (docx, xlsx).
This library supports the following workflows:

Each workflow, has its own respective client. Each client, supports the same set of constructor arguments.
* NER detection, along with entity tokenization and synthesis
* Data extraction of unstructured files such as PDFs and Office documents (docx, xlsx).

Each workflow has its own client. Each client supports the same set of constructor arguments.

```
from tonic_textual.redact_api import TextualNer
Expand All @@ -95,29 +90,31 @@ textual_ner = TextualNer()
textual_parse = TextualParse()
```

Both clients support the following optional arguments
Both clients support the following optional arguments:

1. base_url - The URL of the server, hosting Tonic Textual. Defaults to https://textual.tonic.ai
- ```base_url``` - The URL of the server that hosts Tonic Textual. Defaults to https://textual.tonic.ai

2. api_key - Your API key. If not specified you must set the TONIC_TEXTUAL_API_KEY in your environment
- ```api_key``` - Your API key. If not specified, you must set TONIC_TEXTUAL_API_KEY in your environment.

3. verify - Whether SSL Certification verification is performed. Default is enabled.
- ```verify``` - Whether to verify SSL certification. Default is true.



<!-- USAGE -->

<!-- NER USAGE -->
## NER Usage

Textual can identify entities within free text. It works on both raw text and on content found within files such as pdf, docx, xlsx, images, txt, and csv files. For raw text,
## NER usage

Textual can identify entities within free text. It works on raw text and on content from files, including pdf, docx, xlsx, images, txt, and csv files.

### Free text

```python
raw_redaction = textual_ner.redact("My name is John and I live in Atlanta.")
```

The ```raw_redaction``` returns a response like the following:
```raw_redaction``` returns a response similar to the following:

```json
{
Expand Down Expand Up @@ -151,23 +148,23 @@ The ```raw_redaction``` returns a response like the following:
}
```

The ```redacted_text``` property provides the new text, with identified entities replaced with tokenized values. Each identified entity will be listed in the ```de_identify_results``` array.
The ```redacted_text``` property provides the new text. In the new text, identified entities are replaced with tokenized values. Each identified entity is listed in the ```de_identify_results``` array.

In addition to tokenizing entities, they can also be synthesized. To synthesize specific entities use the optional ```generator_config``` argument.
You can also choose to synthesize entities instead of tokenizing them. To synthesize specific entities, use the optional ```generator_config``` argument.

```python
raw_redaction = textual_ner.redact("My name is John and I live in Atlanta.", generator_config={'LOCATION_CITY':'Synthesis', 'NAME_GIVEN':'Synthesis'})
```

This will generate a new ```redacted_text``` value in the response with synthetic entites. For example, it could look like
In the response, this generates a new ```redacted_text``` value that contains the synthetic entities. For example:

| My name is Alfonzo and I live in Wilkinsburg.

### Files

Textual can also identify, tokenize, and synthesize text within files such as PDF and DOCX. The result is a new file with specified entities either tokenized or synthesized.
Textual can also identify, tokenize, and synthesize text within files such as PDF and DOCX. The result is a new file where the specified entities are either tokenized or synthesized.

To generate a redacted file,
To generate a redacted file:

```python
with open('file.pdf','rb') as f:
Expand All @@ -178,23 +175,23 @@ with open('redacted_file.pdf','wb') as of:
of.write(file_bytes)
```

The ```download_redacted_file``` takes similar arguments to the ```redact()``` method and supports a ```generator_config``` parameter to adjust which entities are tokenized and synthesized.
The ```download_redacted_file``` method takes similar arguments to the ```redact()``` method. It also supports a ```generator_config``` parameter to adjust which entities are tokenized and synthesized.

### Consistency

When entities are tokenized, the tokenized values we generate are unique to the original value. A given entity will also generate to the same, unique token. Tokens can be mapped back to their original value via the ```unredact``` function call.
When entities are tokenized, the tokenized values are unique to the original value. A given entity always generates to the same unique token. To map a token back to its original value, use the ```unredact``` function call.

Synthetic entities are consistent. This means, a given entity, such as 'Atlanta' will always get mapped to the same fake city. Synthetic values can potentially collide and are not reversible.
Synthetic entities are consistent. This means that a given entity, such as 'Atlanta', is always mapped to the same fake city. Synthetic values can potentially collide and are not reversible.

To change the underlying mapping of both tokens and synthetic values, you can pass in the optional ```random_seed``` parameter in the ```redact()``` function call.
To change the underlying mapping of both tokens and synthetic values, in the ```redact()``` function call, pass in the optional ```random_seed``` parameter.

_For more examples, please refer to the [Documentation](https://textual.tonic.ai/docs/index.html)_
_For more examples, refer to the [Textual SDK documentation](https://textual.tonic.ai/docs/index.html)._

<p align="right">(<a href="#readme-top">back to top</a>)</p>

## Parse Usage
## Parse usage

Textual supports the extraction of text and other content from files. Textual currently supports
Textual supports the extraction of text and other content from files. Textual currently supports:

- pdf
- png, tif, jpg
Expand All @@ -203,7 +200,7 @@ Textual supports the extraction of text and other content from files. Textual c

Textual takes these unstructured files and converts them to a structured representation in JSON.

The JSON output has file specific pieces, for example, table and KVP detection is performed on PDFs and images but all files support the following JSON properties:
The JSON output has file-specific pieces. For example, table and KVP detection is only performed on PDFs and images. However, all files support the following JSON properties:

```json
{
Expand All @@ -225,9 +222,13 @@ The JSON output has file specific pieces, for example, table and KVP detection i
}
```

PDFs and images additionally have properties for ```tables``` and ```kvps```. DocX files have support for ```headers```, ```footers```, and ```endnotes``` and Xlsx files break content down a per-sheet basis.
PDFs and images have additional properties for ```tables``` and ```kvps```.

DocX files support ```headers```, ```footers```, and ```endnotes```.

For a detailed breakdown of the JSON schema for each file type please reference on documentation, [here](https://docs.tonic.ai/textual/pipelines/viewing-pipeline-results/pipeline-json-structure).
Xlsx files break down the content by the individual sheets.

For a detailed breakdown of the JSON schema for each file type, go to the [JSON schema information in the Textual guide](https://docs.tonic.ai/textual/pipelines/viewing-pipeline-results/pipeline-json-structure).


To parse a file one time, you can use our SDK.
Expand All @@ -237,32 +238,34 @@ with open('invoice.pdf','rb') as f:
parsed_file = textual_parse.parse_file(f.read(), 'invoice.pdf')
```

The parsed_file is a ```FileParseResult``` type and has various helper methods to retrieve content from the document.
The parsed_file is a ```FileParseResult``` type, which has helper methods that you can use to retrieve content from the document.

- ```get_markdown(generator_config={})``` retrieves the document as markdown. The markdown can be optionally tokenized/synthesized by passing in a list of entities to ```generator_config```
- ```get_markdown(generator_config={})``` retrieves the document as Markdown. To tokenize or synthesize the Markdown, pass in a list of entities to ```generator_config```.

- ```get_chunks(generator_config={}, metadata_entities=[])``` chunks the files in a form suitable for vector DB ingestion. Chunks can be tokenized/synthesized and additionally can be enriched with entity level metadata by providing a list of entities. The entity list should be entities that are relevant to questions being asked to the RAG system. e.g. if you are building a RAG for front line customer support reps, you might expect to include 'PRODUCT' and 'ORGANIZATION' as metadata entities.
- ```get_chunks(generator_config={}, metadata_entities=[])``` chunks the files in a form suitable for vector database ingestion. To tokenize or synthesize chunks, or enrich them with entity level metadata, provide a list of entities. The listed entities should be relevant to the questions that are asked of the RAG system. For example, if you are building a RAG for front line customer support reps, you might expect to include 'PRODUCT' and 'ORGANIZATION' as metadata entities.

In addition for processing files from you local system, you can reference files directly in S3. The ```parse_s3_file``` function call behaves the same as ```parse_file``` but requires a bucket and key argument to specify your specific file in S3. It uses boto3 to retrieve files in S3.
In addition to processing files from your local system, you can reference files directly from Amazon S3. The ```parse_s3_file``` function call behaves the same as ```parse_file```, but requires a bucket and key argument to specify your specific file in Amazon S3. It uses boto3 to retrieve the files from Amazon S3.

_For more examples, please refer to the [Documentation](https://textual.tonic.ai/docs/index.html)_
_For more examples, refer to the [Textual SDK documentation](https://textual.tonic.ai/docs/index.html)_

<p align="right">(<a href="#readme-top">back to top</a>)</p>


## UI Automation
## UI automation

The Textual UI supports file redaction and parsing. It provides an experience for users to orchestrate jobs and process files at scale. It supports integrations with various bucket solutions such as Amazon S3, as well as systems such as Sharepoint and Databricks Unity Catalog volumes.

The Textual UI supports file redactionand parsing. It provides an experience for users to orchestrate jobs and process files at scale. It supports integrations with various bucket solutions like S3 as well as systems like Sharepoint and Databricks Unity Catalog volumes. Actions such as building smart pipelines (for parsing) and Dataset collections (file redaction) can be completed via the SDK.
You can use the SDK for actions such as building smart pipelines (for parsing) and dataset collections (for file redaction).

_For more examples, please refer to the [Documentation](https://textual.tonic.ai/docs/index.html)_
_For more examples, refer to the [Textual SDK documentation](https://textual.tonic.ai/docs/index.html)_

<p align="right">(<a href="#readme-top">back to top</a>)</p>


<!-- ROADMAP -->
## Bug Reports and Feature Requests
## Bug reports and feature requests

Bugs and Feature requests can be submitted via the [open issues](https://github.com/tonicai/textual_sdk/issues). We try to be responsive here so any issues filed should expect a prompt response from the Textual team.
To submit a bug or feature request, go to [open issues](https://github.com/tonicai/textual_sdk/issues). We try to be responsive here - any issues filed should expect a prompt response from the Textual team.

<p align="right">(<a href="#readme-top">back to top</a>)</p>

Expand All @@ -272,22 +275,25 @@ Bugs and Feature requests can be submitted via the [open issues](https://github.

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
Don't forget to give the project a star! Thanks again!
If you have a suggestion that would make this better, fork the repo and create a pull request.

1. Fork the project
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a pull request

1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
You can also simply open an issue with the tag "enhancement".

Don't forget to give the project a star! Thanks again!

<p align="right">(<a href="#readme-top">back to top</a>)</p>


<!-- LICENSE -->
## License

Distributed under the MIT License. See `LICENSE.txt` for more information.
Distributed under the MIT License. For more information, see `LICENSE.txt`.


<!-- CONTACT -->
Expand Down
Binary file added images/textual-logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit d68746d

Please sign in to comment.