This is a component that can be used in Kubeflow Pipelines to read two input datasets (ex: test and training data), and to write the results of whatever processing is done on that data in a secure and governed fashion by leveraging Fybrik
It is assumed that Fybrik is installed together with the chosen Data Catalog and Data Governance engine, and that:
- training and testing datasets have been registered in the data catalog
- governance policies have been defined in the data governance engine
- Install kubeflow pipelines.
- To install on kind, see these instructions.
- Please note that if you are running k8s 1.22 or higher on kind PIPELINE_VERSION should be 1.8.5 and not as indicated in the instructions.
- Installation takes time. Pods often take restart repeatedly until all become ready.
- To install on kind, see these instructions.
- Install Fybrik
This component is compatible with Fybrik v1.3.
Ensure that the pipeline has the appropriate RBAC priveleges to create the FybrikApplication from the pipeline.
kubectl apply -f rbac_resources.yaml -n kubeflow
Register a storage account in which the results can be written. Example files are provided called kfp-storage-secret.yaml and kfp-storage-account.yaml. Please change the values in these files with storage endpoint and credential details.
kubectl apply -f kfp-storage-secret.yaml -n fybrik-system
kubectl apply -f kfp-storage-account.yaml -n fybrik-system
Fybrik documentation has more details about how to create an account in object storage and how to deploy resources for write scenarios. The Fybrik examples create two storage accounts, but in our example one is sufficient.
This component receives the following parameters, all of which are strings.
Input:
- train_dataset_id - data catalog ID of the dataset on which the ML model is trained
- test_dataset_id - data catalog ID of the dataset containing the testing data
Outputs:
- train_endpoint - virtual endpoint used to read the training data
- test_endpoint - virtual endpoint used to read the testing data
- result_endpoint - virtual endpoint used to write the results
def pipeline(
test_dataset_id: str,
train_dataset_id: str
):
# Where to store parameters passed between workflow steps
result_name = "submission-" + st(run_name)
# Default - could also be read from the environment
namespace = kubeflow
# Get the ID of the run. Make sure it's lower case and starts with a letter
run_name = "run-" + dsl.RUN_ID_PLACEHOLDER.lower()
getDataEndpointsOp = components.load_component_from_file(
'https://github.com/fybrik/kfp-components/blob/master/get_data_endpoints/component.yaml')
getDataEndpointsStep = getDataEndpointsOp(
train_dataset_id=train_dataset_id,
test_dataset_id=test_dataset_id,
namespace=namespace,
run_name=run_name,
result_name=result_name)
#...
trainModelOp = components.load_component_from_file(
'./train_model/component.yaml')
trainModelStep = trainModelOp(
train_endpoint_path='%s' % getDataEndpointsStep.outputs['train_endpoint'],
test_endpoint_path='%s' % getDataEndpointsStep.outputs['test_endpoint'],
result_name=result_name,
result_endpoint_path='%s' % getDataEndpointsStep.outputs['result_endpoint'],
train_dataset_id=train_dataset_id,
test_dataset_id=test_dataset_id,
namespace=namespace)
If you wish to enhance or contribute to this component, please note that it is written in python and packaged as a docker image.
Create a virtual python environment by typing the following commands in the command line:
python3 -m venv .venv-kfp
source .venv-kfp/bin/activate
Install required kfp python libraries
Pip3 install kfp_tekton --upgrade
Pip3 install kfp --upgrade
Pip3 install kubernetes --upgrade
Edit get_data_points.py
When done working, the virtual environment may be turned off by entering Deactivate
Register a training and testing datasets in the catalog as per the instructions here.
To test the component independent of kubeflow pipelines you may run the following command:
python3 get_data_endpoints.py --train_dataset_id <train dataset id> --test_dataset_id <test dataset id> ./train.txt --test_endpoint ./test.txt --result_name <name for result dataset> --result_endpoint ./result.txt --result_catalogid ./resultcatalogid.txt --run_name <run name>
To build the docker image for use from kubeflow pipelines use:
sh build_image.sh