Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added missing ray notebooks for doc_quality and filter #927

Merged
merged 5 commits into from
Jan 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions transforms/README-list.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Note: This list includes the transforms that were part of the release starting w
* [header_cleanser (Not available on MacOS)](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/header_cleanser/python/README.md)
* [code_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code_quality/python/README.md)
* [proglang_select](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/proglang_select/python/README.md)
* [code_profiler](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code_profiler/README.md)
* language
* [doc_chunk](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_chunk/README.md)
* [doc_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_quality/README.md)
Expand All @@ -40,6 +41,11 @@ Note: This list includes the transforms that were part of the release starting w

## Release notes:

### 1.0.0.a4
Added missing ray implementation for lang_id, doc_quality, tokenization and filter
Added ray notebooks for lang id, Doc Quality, tokenization, and Filter
### 1.0.0.a3
Added code_profiler
### 1.0.0.a2
Relax dependencies on pandas (use latest or whatever is installed by application)
Relax dependencies on requests (use latest or whatever is installed by application)
Expand Down
9 changes: 4 additions & 5 deletions transforms/language/doc_quality/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,11 +91,6 @@ To see results of the transform.

[notebook](./doc_quality.ipynb)

### Transforming data using the transform image

To use the transform image to transform your data, please refer to the
[running images quickstart](../../../doc/quick-start/run-transform-image.md),
substituting the name of this transform image and runtime as appropriate.

## Testing

Expand Down Expand Up @@ -161,6 +156,10 @@ ls output
```
To see results of the transform.

### Code example (Ray)

[notebook](./doc_quality-ray.ipynb)


#### Transforming data using the transform image

Expand Down
140 changes: 140 additions & 0 deletions transforms/language/doc_quality/doc_quality-ray.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "afd55886-5f5b-4794-838e-ef8179fb0394",
"metadata": {},
"source": [
"##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
"```\n",
"make venv \n",
"source venv/bin/activate \n",
"pip install jupyterlab\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"%%capture\n",
"## This is here as a reference only\n",
"# Users and application developers must use the right tag for the latest from pypi\n",
"%pip install \"data-prep-toolkit-transforms[ray,doc_quality]==1.0.0a4\""
]
},
{
"cell_type": "markdown",
"id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n",
"* text_lang - specifies language used in the text content. By default, \"en\" is used.\n",
"* doc_content_column - specifies column name that contains document text. By default, \"contents\" is used.\n",
"* bad_word_filepath - specifies a path to bad word file: local folder (file or directory) that points to bad word file. You don't have to set this parameter if you don't need to set bad words.\n",
"#####"
]
},
{
"cell_type": "markdown",
"id": "ebf1f782-0e61-485c-8670-81066beb734c",
"metadata": {},
"source": [
"##### ***** Import required classes and modules"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
"metadata": {},
"outputs": [],
"source": [
"from dpk_doc_quality.ray.transform import DocQuality\n",
"from data_processing.utils import GB"
]
},
{
"cell_type": "markdown",
"id": "7234563c-2924-4150-8a31-4aec98c1bf33",
"metadata": {},
"source": [
"##### ***** Setup runtime parameters and invoke the transform"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "95737436",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"DocQuality(input_folder='test-data/input',\n",
" output_folder= 'output',\n",
" run_locally= True,\n",
" num_cpus= 0.8,\n",
" memory= 2 * GB,\n",
" runtime_num_workers = 3,\n",
" runtime_creation_delay = 0,\n",
" docq_text_lang = \"en\",\n",
" docq_doc_content_column =\"contents\").transform()"
]
},
{
"cell_type": "markdown",
"id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
"metadata": {},
"source": [
"##### **** The specified folder will include the transformed parquet files."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7276fe84-6512-4605-ab65-747351e13a7c",
"metadata": {},
"outputs": [],
"source": [
"import glob\n",
"glob.glob(\"output/*\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "845a75cf-f4a9-467d-87fa-ccbac1c9beb8",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
44 changes: 5 additions & 39 deletions transforms/language/doc_quality/doc_quality.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,13 @@
"%%capture\n",
"## This is here as a reference only\n",
"# Users and application developers must use the right tag for the latest from pypi\n",
"%pip install data-prep-toolkit\n",
"%pip install data-prep-toolkit-transforms[doc_quality]"
"%pip install \"data-prep-toolkit-transforms[doc_quality]==1.0.0a4\""
]
},
{
"cell_type": "markdown",
"id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"metadata": {},
"source": [
"##### **** Configure the transform parameters. The set of dictionary keys holding DocQualityTransform configuration for values are as follows: \n",
"* text_lang - specifies language used in the text content. By default, \"en\" is used.\n",
Expand Down Expand Up @@ -72,27 +69,7 @@
"execution_count": null,
"id": "95737436",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"11:54:20 INFO - doc_quality parameters are : {'text_lang': 'en', 'doc_content_column': 'contents', 'bad_word_filepath': '/Users/touma/data-prep-kit-pkg/transforms/language/doc_quality/dpk_doc_quality/ldnoobw/en', 's3_cred': None, 'docq_data_factory': <data_processing.data_access.data_access_factory.DataAccessFactory object at 0x10b1d1c50>}\n",
"11:54:20 INFO - pipeline id pipeline_id\n",
"11:54:20 INFO - code location None\n",
"11:54:20 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output\n",
"11:54:20 INFO - data factory data_ max_files -1, n_sample -1\n",
"11:54:20 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
"11:54:20 INFO - orchestrator docq started at 2024-12-04 11:54:20\n",
"11:54:20 INFO - Number of files is 1, source profile {'max_file_size': 0.0009870529174804688, 'min_file_size': 0.0009870529174804688, 'total_file_size': 0.0009870529174804688}\n",
"11:54:20 INFO - Load badwords found locally from /Users/touma/data-prep-kit-pkg/transforms/language/doc_quality/dpk_doc_quality/ldnoobw/en\n",
"11:54:20 INFO - Completed 1 files (100.0%) in 0.002 min\n",
"11:54:20 INFO - Done processing 1 files, waiting for flush() completion.\n",
"11:54:20 INFO - done flushing in 0.0 sec\n",
"11:54:20 INFO - Completed execution in 0.003 min, execution result 0\n"
]
}
],
"outputs": [],
"source": [
"%%capture\n",
"DocQuality(input_folder='test-data/input',\n",
Expand All @@ -111,21 +88,10 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"id": "7276fe84-6512-4605-ab65-747351e13a7c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['output/metadata.json', 'output/test1.parquet']"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"outputs": [],
"source": [
"import glob\n",
"glob.glob(\"output/*\")"
Expand Down
47 changes: 44 additions & 3 deletions transforms/language/doc_quality/dpk_doc_quality/ray/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,18 @@
# limitations under the License.
################################################################################

import pyarrow as pa
from data_processing.utils import get_logger
import sys, os
from data_processing.utils import ParamsUtils, get_logger
from data_processing_ray.runtime.ray import RayTransformLauncher
from data_processing_ray.runtime.ray.runtime_configuration import (
RayTransformRuntimeConfiguration,
)
from dpk_doc_quality.transform import DocQualityTransformConfiguration

from dpk_doc_quality.transform import (
DocQualityTransformConfiguration,
bad_word_filepath_cli_param,
text_lang_cli_param,
)

logger = get_logger(__name__)

Expand All @@ -37,6 +41,43 @@ def __init__(self):
super().__init__(transform_config=DocQualityTransformConfiguration())


# Class used by the notebooks to ingest binary files and create parquet files
class DocQuality:
def __init__(self, **kwargs):
self.params = {}
for key in kwargs:
self.params[key] = kwargs[key]
# if input_folder and output_folder are specified, then assume it is represent data_local_config
try:
local_conf = {k: self.params[k] for k in ("input_folder", "output_folder")}
self.params["data_local_config"] = ParamsUtils.convert_to_ast(local_conf)
del self.params["input_folder"]
del self.params["output_folder"]
except:
pass
try:
worker_options = {k: self.params[k] for k in ("num_cpus", "memory")}
self.params["runtime_worker_options"] = ParamsUtils.convert_to_ast(worker_options)
del self.params["num_cpus"]
del self.params["memory"]
except:
pass

if text_lang_cli_param not in self.params:
self.params[text_lang_cli_param] = "en"
if bad_word_filepath_cli_param not in self.params:
self.params[bad_word_filepath_cli_param] = os.path.abspath(
os.path.join(os.path.dirname(__file__), "../ldnoobw", self.params[text_lang_cli_param])
)


def transform(self):
sys.argv = ParamsUtils.dict_to_req(d=(self.params))
launcher = RayTransformLauncher(DocQualityRayTransformConfiguration())
return_code = launcher.launch()
return return_code


if __name__ == "__main__":
launcher = RayTransformLauncher(DocQualityRayTransformConfiguration())
logger.info("Launching doc_quality transform")
Expand Down
34 changes: 32 additions & 2 deletions transforms/language/lang_id/dpk_lang_id/ray/transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
# limitations under the License.
################################################################################

import pyarrow as pa
from data_processing.utils import get_logger
import sys
from data_processing.utils import ParamsUtils, get_logger
from data_processing_ray.runtime.ray import RayTransformLauncher
from data_processing_ray.runtime.ray.runtime_configuration import (
RayTransformRuntimeConfiguration,
Expand All @@ -36,6 +36,36 @@ def __init__(self):
"""
super().__init__(transform_config=LangIdentificationTransformConfiguration())

# Class used by the notebooks to ingest binary files and create parquet files
class LangId:
def __init__(self, **kwargs):
self.params = {}
for key in kwargs:
self.params[key] = kwargs[key]
# if input_folder and output_folder are specified, then assume it is represent data_local_config
try:
local_conf = {k: self.params[k] for k in ("input_folder", "output_folder")}
self.params["data_local_config"] = ParamsUtils.convert_to_ast(local_conf)
del self.params["input_folder"]
del self.params["output_folder"]
except:
pass
try:
worker_options = {k: self.params[k] for k in ("num_cpus", "memory")}
self.params["runtime_worker_options"] = ParamsUtils.convert_to_ast(worker_options)
del self.params["num_cpus"]
del self.params["memory"]
except:
pass

def transform(self):
sys.argv = ParamsUtils.dict_to_req(d=(self.params))
# create launcher
launcher = RayTransformLauncher(LangIdentificationRayTransformConfiguration())
# launch
return_code = launcher.launch()
return return_code


if __name__ == "__main__":
launcher = RayTransformLauncher(LangIdentificationRayTransformConfiguration())
Expand Down
Loading