copyright

lastupdated

keywords

subcollection

years
2022, 2024

2024-12-11

watsonxdata, staging, config file, target iceberg table, parquet, csv, command line, cli

watsonxdata

{:javascript: #javascript .ph data-hd-programlang='javascript'} {:java: #java .ph data-hd-programlang='java'} {:ruby: #ruby .ph data-hd-programlang='ruby'} {:php: #php .ph data-hd-programlang='php'} {:python: #python .ph data-hd-programlang='python'} {:external: target="_blank" .external} {:shortdesc: .shortdesc} {:codeblock: .codeblock} {:screen: .screen} {:tip: .tip} {:important: .important} {:note: .note} {:deprecated: .deprecated} {:pre: .pre} {:video: .video}

Ingesting data through config file - Presto ingestion mode

{: #create_ingestconfig}

This topic provides step-by-step instructions to ingest data into {{site.data.keyword.lakehouse_full}} by using the config file in Presto ingestion mode. {: shortdesc}

Before you begin:

{: #bybconfig}

Set the mandatory environment variable ENABLED_INGEST_MODE to PRESTO before starting an ingestion job by running the following command:

export ENABLED_INGEST_MODE=PRESTO

{: codeblock}

Set the environment variables for SOURCE_S3_CREDS and STAGING_S3_CREDS based on the requirements before starting an ingestion job by running the following commands:

export SOURCE_S3_CREDS="AWS_ACCESS_KEY_ID=,AWS_SECRET_ACCESS_KEY=,ENDPOINT_URL=,AWS_REGION=,BUCKET_NAME="

{: codeblock}

export STAGING_S3_CREDS="AWS_ACCESS_KEY_ID=,AWS_SECRET_ACCESS_KEY=,ENDPOINT_URL=,AWS_REGION=,BUCKET_NAME="

{: codeblock}

About this task

{: #attask}

To ingest data into watsonx.data, you can use the configuration file option. The advantage of using a configuration file is that you can run multiple ingestion jobs in batches.

Run the following command to do multiple ingestion jobs after you update the configuration file:

ibm-lh data-copy --ingest-config /<your_ingest_configfilename>

{: codeblock}

The commands must run within the ibm-lh container. For more details and instructions to install ibm-lh-client package and use the ibm-lh tool for ingestion, see Installing ibm-lh-client{: external} and Setting up the ibm-lh command-line utility{: external}. {: note}

To access IBM Cloud Object Storage (COS) and MinIO object storage, specify the ENDPOINT_URL to pass the corresponding url to the tool. For more information about IBM COS, see Endpoints and storage locations{: external}. {: note}

Replace the absolute values inside angular brackets of command examples with values applicable to your environment. See Options and variables supported in ibm-lh tool. {: note}

Following are the details of the config file option to ingest data files from S3 or local location to {{site.data.keyword.lakehouse_short}} Iceberg table:

Ingest a single CSV/Parquet file from S3 location by using config file

{: #ingest1}

To ingest a single Parquet file from a S3 location, run the following command:

[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>
[ingest-config1]
source-files:SOURCE_DATA_FILE
staging-location:STAGING_LOCATION

{: codeblock}

For example:

[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true

[ingest-config1]
source-files:s3://cust-bucket/warehouse/a_source_file.parquet
staging-location:s3://cust-bucket/warehouse/staging/

{: screen}

Ingest multiple CSV/Parquet files and CSV folders from S3 location by using config file

{: #ingest2}

To ingest multiple Parquet files from a S3 location, run the following command:

[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>
[ingest-config1]
source-files:SOURCE_DATA_FILE
staging-location:STAGING_LOCATION

{: codeblock}

For example:

[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true

[ingest-config1]
source-files:s3://cust-bucket/warehouse/a_source_file1.csv,s3://cust-bucket/warehouse/a_source_file2.csv
staging-location:s3://cust-bucket/warehouse/staging/

{: screen}

[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true

[ingest-config1]
source-files:s3://cust-bucket/warehouse/
staging-location:s3://cust-bucket/warehouse/staging/

{: screen}

Ingest all Parquet files in a folder from S3 location by using config file

{: #ingest3}

To ingest all Parquet files in a folder from a S3 location, run the following command:

[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>
[ingest-config1]
source-files:SOURCE_DATA_FILE

{: codeblock}

For example:

[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true

[ingest-config1]
source-files:s3://cust-bucket/warehouse/

{: screen}

In general, this option does not require a staging location. However, a few exceptional scenarios are there when a staging location must be specified. When the staging location is not used, make sure that the hive catalog configured with Presto can be used with source-files location. The following are the exceptional cases where a staging location is required:

Any or all parquet files in the folder are huge.
Any or all parquet files in the folder have special columns, such as TIME. {: note}

Ingest a CSV/Parquet file or a folder of files from a local file system by using config file

{: #ingest4}

To ingest a single CSV file from a local location, run the following command:

[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>
[ingest-config1]
source-files:SOURCE_DATA_FILE
staging-location:STAGING_LOCATION

{: codeblock}

For example:

[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true

[ingest-config1]
source-files:/tmp/customer1.parquet
staging-location:s3://cust-bucket/warehouse/staging/

{: screen}

[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true

[ingest-config1]
source-files:/tmp/
staging-location:s3://cust-bucket/warehouse/staging/

{: screen}

Ingest any data file from local file system by using a config file

{: #ingest5}

To ingest any data file from a local location, run the following command:

To ingest any type of data files from a local file system, data files are needed to be copied to ~ /ibm-lh-client/localstorage/volumes/ibm-lh directory. Now, you can access data files from /ibmlhdata/ directory by using the ibm-lh data-copy command. {: note}

[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>

[ingest-config1]
source-files:SOURCE_DATA_FILE
staging-location:STAGING_LOCATIONstaging-hive-catalog:<catalog_name>
schema:<SCHEMA>
dbuser:<DBUSER>
dbpassword:<DBPASSWORD>
trust-store-path:<TRUST_STORE_PATH>
trust-store-password:<TRUST_STORE_PASSWORD>

{: codeblock}

For example:

[global-ingest-config]
target-table:iceberg_data.ivt_sanity_test_1.reptile
ingestion-engine:hostname=ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com,port=443
create-if-not-exist:true

[ingest-config1]
source-files:/ibmlhdata/reptile.csv
staging-location:s3://watsonx.data/staging
staging-hive-catalog:hive_test
schema:schema.cfg
dbuser:xxxx
dbpassword:xxxx
trust-store-path:/mnt/infra/tls/aliases/ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com:443.crt
trust-store-password:changeit

{: screen}

Ingest CSV/local Parquet/S3 Parquet files that use staging location.

{: #ingest6}

To ingest CSV/local Parquet/S3 Parquet files that use staging location:

[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>

[ingest-config1]
source-files:SOURCE_DATA_FILE
staging-location:STAGING_LOCATION
staging-hive-catalog:<catalog_name>
schema:<SCHEMA>
dbuser:<DBUSER>
dbpassword:<DBPASSWORD>
trust-store-path:<TRUST_STORE_PATH>
trust-store-password:<TRUST_STORE_PASSWORD>

{: codeblock}

For example:

[global-ingest-config]
target-table:iceberg_data.test_iceberg.gvt_data_v
ingestion-engine:hostname=ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com,port=443
create-if-not-exist:true

[ingest-config1]
source-files:s3://watsonx-data-0823-2/test_icos/GVT-DATA-C.csv
staging-location:s3://watsonx.data-staging
staging-hive-catalog:staging_catalog
staging-hive-schema:staging_schema
dbuser:xxxx
dbpassword:xxxx
trust-store-path:/mnt/infra/tls/aliases/ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com:443.crt
trust-store-password:changeit

{: screen}

Here, --staging-location is s3://watsonx.data-staging. The --staging-hive-catalog that is staging_catalog must be associated with the storage bucket watsonx.data-staging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest_datafiles_config.md

ingest_datafiles_config.md

Ingesting data through config file - Presto ingestion mode

Before you begin:

About this task

Ingest a single CSV/Parquet file from S3 location by using config file

Ingest multiple CSV/Parquet files and CSV folders from S3 location by using config file

Ingest all Parquet files in a folder from S3 location by using config file

Ingest a CSV/Parquet file or a folder of files from a local file system by using config file

Ingest any data file from local file system by using a config file

Ingest CSV/local Parquet/S3 Parquet files that use staging location.

Files

ingest_datafiles_config.md

Latest commit

History

ingest_datafiles_config.md

File metadata and controls

Ingesting data through config file - Presto ingestion mode

Before you begin:

About this task

Ingest a single CSV/Parquet file from S3 location by using config file

Ingest multiple CSV/Parquet files and CSV folders from S3 location by using config file

Ingest all Parquet files in a folder from S3 location by using config file

Ingest a CSV/Parquet file or a folder of files from a local file system by using config file

Ingest any data file from local file system by using a config file

Ingest CSV/local Parquet/S3 Parquet files that use staging location.