copyright | lastupdated | keywords | subcollection | ||
---|---|---|---|---|---|
|
2024-12-11 |
watsonxdata, staging, config file, target iceberg table, parquet, csv, command line, cli |
watsonxdata |
{:javascript: #javascript .ph data-hd-programlang='javascript'} {:java: #java .ph data-hd-programlang='java'} {:ruby: #ruby .ph data-hd-programlang='ruby'} {:php: #php .ph data-hd-programlang='php'} {:python: #python .ph data-hd-programlang='python'} {:external: target="_blank" .external} {:shortdesc: .shortdesc} {:codeblock: .codeblock} {:screen: .screen} {:tip: .tip} {:important: .important} {:note: .note} {:deprecated: .deprecated} {:pre: .pre} {:video: .video}
{: #create_ingestconfig}
This topic provides step-by-step instructions to ingest data into {{site.data.keyword.lakehouse_full}} by using the config file in Presto ingestion mode. {: shortdesc}
{: #bybconfig}
Set the mandatory environment variable ENABLED_INGEST_MODE
to PRESTO
before starting an ingestion job by running the following command:
export ENABLED_INGEST_MODE=PRESTO
{: codeblock}
Set the environment variables for SOURCE_S3_CREDS
and STAGING_S3_CREDS
based on the requirements before starting an ingestion job by running the following commands:
export SOURCE_S3_CREDS="AWS_ACCESS_KEY_ID=,AWS_SECRET_ACCESS_KEY=,ENDPOINT_URL=,AWS_REGION=,BUCKET_NAME="
{: codeblock}
export STAGING_S3_CREDS="AWS_ACCESS_KEY_ID=,AWS_SECRET_ACCESS_KEY=,ENDPOINT_URL=,AWS_REGION=,BUCKET_NAME="
{: codeblock}
{: #attask}
To ingest data into watsonx.data, you can use the configuration file option. The advantage of using a configuration file is that you can run multiple ingestion jobs in batches.
Run the following command to do multiple ingestion jobs after you update the configuration file:
ibm-lh data-copy --ingest-config /<your_ingest_configfilename>
{: codeblock}
The commands must run within the ibm-lh container. For more details and instructions to install ibm-lh-client
package and use the ibm-lh tool for ingestion, see Installing ibm-lh-client{: external} and Setting up the ibm-lh command-line utility{: external}.
{: note}
To access IBM Cloud Object Storage (COS) and MinIO object storage, specify the ENDPOINT_URL to pass the corresponding url to the tool. For more information about IBM COS, see Endpoints and storage locations{: external}. {: note}
Replace the absolute values inside angular brackets of command examples with values applicable to your environment. See Options and variables supported in ibm-lh tool. {: note}
Following are the details of the config file option to ingest data files from S3 or local location to {{site.data.keyword.lakehouse_short}} Iceberg table:
{: #ingest1}
To ingest a single Parquet file from a S3 location, run the following command:
[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>
[ingest-config1]
source-files:SOURCE_DATA_FILE
staging-location:STAGING_LOCATION
{: codeblock}
For example:
[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true
[ingest-config1]
source-files:s3://cust-bucket/warehouse/a_source_file.parquet
staging-location:s3://cust-bucket/warehouse/staging/
{: screen}
{: #ingest2}
To ingest multiple Parquet files from a S3 location, run the following command:
[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>
[ingest-config1]
source-files:SOURCE_DATA_FILE
staging-location:STAGING_LOCATION
{: codeblock}
For example:
[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true
[ingest-config1]
source-files:s3://cust-bucket/warehouse/a_source_file1.csv,s3://cust-bucket/warehouse/a_source_file2.csv
staging-location:s3://cust-bucket/warehouse/staging/
{: screen}
[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true
[ingest-config1]
source-files:s3://cust-bucket/warehouse/
staging-location:s3://cust-bucket/warehouse/staging/
{: screen}
{: #ingest3}
To ingest all Parquet files in a folder from a S3 location, run the following command:
[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>
[ingest-config1]
source-files:SOURCE_DATA_FILE
{: codeblock}
For example:
[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true
[ingest-config1]
source-files:s3://cust-bucket/warehouse/
{: screen}
In general, this option does not require a staging location. However, a few exceptional scenarios are there when a staging location must be specified. When the staging location is not used, make sure that the hive catalog configured with Presto can be used with source-files location. The following are the exceptional cases where a staging location is required:
- Any or all parquet files in the folder are huge.
- Any or all parquet files in the folder have special columns, such as TIME. {: note}
{: #ingest4}
To ingest a single CSV file from a local location, run the following command:
[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>
[ingest-config1]
source-files:SOURCE_DATA_FILE
staging-location:STAGING_LOCATION
{: codeblock}
For example:
[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true
[ingest-config1]
source-files:/tmp/customer1.parquet
staging-location:s3://cust-bucket/warehouse/staging/
{: screen}
[global-ingest-config]
target-table:iceberg_cat.ice_schema.customer1_tab
ingestion-engine:hostname=localhost,port=8080
create-if-not-exist:true
[ingest-config1]
source-files:/tmp/
staging-location:s3://cust-bucket/warehouse/staging/
{: screen}
{: #ingest5}
To ingest any data file from a local location, run the following command:
To ingest any type of data files from a local file system, data files are needed to be copied to ~ /ibm-lh-client/localstorage/volumes/ibm-lh directory. Now, you can access data files from /ibmlhdata/ directory by using the ibm-lh data-copy
command.
{: note}
[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>
[ingest-config1]
source-files:SOURCE_DATA_FILE
staging-location:STAGING_LOCATIONstaging-hive-catalog:<catalog_name>
schema:<SCHEMA>
dbuser:<DBUSER>
dbpassword:<DBPASSWORD>
trust-store-path:<TRUST_STORE_PATH>
trust-store-password:<TRUST_STORE_PASSWORD>
{: codeblock}
For example:
[global-ingest-config]
target-table:iceberg_data.ivt_sanity_test_1.reptile
ingestion-engine:hostname=ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com,port=443
create-if-not-exist:true
[ingest-config1]
source-files:/ibmlhdata/reptile.csv
staging-location:s3://watsonx.data/staging
staging-hive-catalog:hive_test
schema:schema.cfg
dbuser:xxxx
dbpassword:xxxx
trust-store-path:/mnt/infra/tls/aliases/ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com:443.crt
trust-store-password:changeit
{: screen}
{: #ingest6}
To ingest CSV/local Parquet/S3 Parquet files that use staging location:
[global-ingest-config]
target-table:table_name
ingestion-engine:hostname=<hostname>,port=<port>
create-if-not-exist:<true>
[ingest-config1]
source-files:SOURCE_DATA_FILE
staging-location:STAGING_LOCATION
staging-hive-catalog:<catalog_name>
schema:<SCHEMA>
dbuser:<DBUSER>
dbpassword:<DBPASSWORD>
trust-store-path:<TRUST_STORE_PATH>
trust-store-password:<TRUST_STORE_PASSWORD>
{: codeblock}
For example:
[global-ingest-config]
target-table:iceberg_data.test_iceberg.gvt_data_v
ingestion-engine:hostname=ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com,port=443
create-if-not-exist:true
[ingest-config1]
source-files:s3://watsonx-data-0823-2/test_icos/GVT-DATA-C.csv
staging-location:s3://watsonx.data-staging
staging-hive-catalog:staging_catalog
staging-hive-schema:staging_schema
dbuser:xxxx
dbpassword:xxxx
trust-store-path:/mnt/infra/tls/aliases/ibm-lh-lakehouse-presto-01-presto-svc-cpd-instance.apps.ivt384.cp.fyre.ibm.com:443.crt
trust-store-password:changeit
{: screen}
Here, --staging-location
is s3://watsonx.data-staging
. The --staging-hive-catalog
that is staging_catalog
must be associated with the storage bucket watsonx.data-staging
.