Skip to content

Commit

Permalink
See open-metadata/OpenMetadata@a3a1c57 from refs/heads/main
Browse files Browse the repository at this point in the history
  • Loading branch information
open-metadata committed Dec 5, 2024
1 parent 5acbd19 commit b69c14c
Show file tree
Hide file tree
Showing 4 changed files with 320 additions and 0 deletions.
73 changes: 73 additions & 0 deletions content/v1.5.x/getting-started/day-1/hybrid-saas/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ collate: true

# Hybrid SaaS

{% youtube url="https://drive.google.com/file/d/16-2l9EYBE9DjlHepPKTpVFvashMy1buu/preview" start="0:00" end="6:47" width="560px" height="315px" /%}

There's two options on how to set up a data connector:
1. **Run the connector in Collate SaaS**: In this scenario, you'll get an IP when you add the service. You need to give
access to this IP in your data sources.
Expand All @@ -20,6 +22,77 @@ Any tool capable of running Python code can be used to configure the metadata ex

In this section we'll show you how the ingestion process works and how to test it from your laptop.

## Collate Ingestion Agent

The Collate Ingestion Agent is designed to facilitate metadata ingestion for hybrid deployments, allowing organizations to securely push metadata from their infrastructure into the Collate platform without exposing their internal systems. It provides a secure and efficient channel for running ingestion workflows while maintaining full control over data processing within your network. This document outlines the setup and usage of the Collate Ingestion Agent, emphasizing its role in hybrid environments and key functionalities.

### Overview

The Collate Ingestion Agent is ideal for scenarios where running connectors on-premises is necessary, providing a secure and efficient way to process metadata within your infrastructure. This eliminates concerns about data privacy and streamlines the ingestion process.

With the Collate Ingestion Agent, you can:
- Set up ingestion workflows easily without configuring YAML files manually.
- Leverage the Collate UI for a seamless and user-friendly experience.
- Manage various ingestion types, including metadata, profiling, lineage, usage, DBT, and data quality.

### Setting Up the Collate Ingestion Agent

#### 1. Prepare Your Environment
To begin, download the Collate-provided Docker image for the Ingestion Agent. The Collate team will provide the necessary credentials to authenticate and pull the image from the repository.

**Run the following commands:**
- **Log in to Docker**: Use the credentials provided by Collate to authenticate.
- **Pull the Docker Image**: Run the command to pull the image into your local environment.

Once the image is downloaded, you can start the Docker container to initialize the Ingestion Agent.

#### 2. Configure the Agent

##### Access the Local Agent UI:
- Open your browser and navigate to the local instance of the Collate Ingestion Agent.

##### Set Up the Connection:
- Enter your Collate platform URL (e.g., `https://<your-company>.collate.com/api`).
- Add the ingestion bot token from the Collate settings under **Settings > Bots > Ingestion Bot**.

##### Verify Services:
- Open the Collate UI and confirm that all available services (e.g., databases) are visible in the Ingestion Agent interface.

#### 3. Add a New Service

1. Navigate to the **Database Services** section in the Ingestion Agent UI.
2. Click **Add New Service** and select the database type (e.g., Redshift).
3. Enter the necessary service configuration:
- **Service Name**: A unique name for the database service.
- **Host and Port**: Connection details for the database.
- **Username and Password**: Credentials to access the database.
- **Database Name**: The target database for ingestion.
4. Test the connection to ensure the service is properly configured.

#### 4. Run Metadata Ingestion

1. After creating the service, navigate to the **Ingestion** tab and click **Add Ingestion**.
2. Select the ingestion type (e.g., metadata) and specify any additional configurations:
- Include specific schemas or tables.
- Enable options like DDL inclusion if required.
3. Choose whether to:
- Run the ingestion immediately via the agent.
- Download the YAML configuration file for running ingestion on an external scheduler.
4. Monitor the logs in real-time to track the ingestion process.

#### 5. Verify Ingested Data

1. Return to the Collate platform and refresh the database service.
2. Verify that the ingested metadata, including schemas, tables, and column details, is available.
3. Explore additional ingestion options like profiling, lineage, or data quality for the service.

### Additional Features

The Collate Ingestion Agent supports various ingestion workflows, allowing you to:
- **Generate YAML Configurations**: Download YAML files for external scheduling.
- **Manage Ingestion Types**: Run metadata, profiling, lineage, usage, and other workflows as needed.
- **Monitor Progress**: View logs and monitor real-time ingestion activity.

## 1. How does the Ingestion Framework work?

The Ingestion Framework contains all the logic about how to connect to the sources, extract their metadata
Expand Down
87 changes: 87 additions & 0 deletions content/v1.5.x/how-to-guides/admin-guide/Reindexing-Search.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,90 @@ Perform the reindexing process described below.
{% partial file="/v1.5/deployment/reindex.md" /%}

**Note:** If you continue to experience issues, consider `re-installing` the search application.

## Configuration Parameters for Reindexing

This document provides detailed descriptions and best practices for configuring the reindexing process parameters. Proper configuration ensures efficient and reliable reindexing while minimizing potential system bottlenecks.

{% image
src="/images/v1.5/deployment/upgrade/reindex.png"
alt="Reindex Configuration"
/%}

### 1. `recreateIndex`
**Description**:
Determines whether to recreate the index from scratch during the reindexing process. Setting this to `true` will drop the existing index and create a new one.

**Best Practice**:
Use this option with caution. Set it to `true` only when you need a clean slate, such as after significant changes to your data model or during data migration. For routine updates, keep it `false` to preserve the existing index.

### 2. `batchSize`
**Description**:
Defines the maximum number of events sent in a single batch during reindexing. The default value is `100`.

**Best Practice**:
Adjust the batch size based on system capabilities and event size. A larger batch size improves throughput but may increase memory usage and processing time. Monitor performance and fine-tune accordingly.

### 3. `payLoadSize`
**Description**:
Specifies the maximum payload size (in bytes) for events sent in a batch. Default: `104,857,600 bytes` (100 MB).

**Best Practice**:
Ensure the payload size is within your system’s handling capacity. If memory issues or timeouts occur, reduce this value to improve stability.

### 4. `producerThreads`
**Description**:
Indicates the number of threads used for producing events during reindexing. Default: `10`.

**Best Practice**:
Balance the number of threads with system CPU and I/O capacity. Increasing this number can improve throughput but may lead to contention if set too high.

### 5. `maxConcurrentRequests`
**Description**:
Specifies the maximum number of concurrent requests sent to the search index at any given time. Default: `100`.

**Best Practice**:
Tune this value based on the indexing server’s capacity. Too many concurrent requests can overwhelm the server, leading to failures or slowdowns.

### 6. `maxRetries`
**Description**:
Specifies the maximum number of retry attempts for failed requests. Default: `3 retries`.

**Best Practice**:
Keep this value reasonable to avoid excessive load during failures. Analyze failure patterns to optimize this setting.

### 7. `initialBackoff`
**Description**:
Defines the initial backoff time (in milliseconds) before retrying a failed request. Default: `1000 ms` (1 second).

**Best Practice**:
Start with the default value. Increase it if failures occur frequently due to server overload or network issues.

### 8. `maxBackoff`
**Description**:
Specifies the maximum backoff time (in milliseconds) for retries. Default: `10,000 ms` (10 seconds).

**Best Practice**:
Set this value to align with your application’s latency tolerance. A longer backoff can reduce system load during peak times but may slow recovery from errors.

### 9. `queueSize`
**Description**:
Defines the internal queue size used for reindexing operations. Default: `100`.

**Best Practice**:
Adjust the queue size based on expected load and available memory resources. A larger queue can handle spikes in processing but requires more memory.

## Example Configuration for Best Practices

For high-performance systems, consider the following values as a starting point:

```json
{
"batchSize": 300,
"queueSize": 500,
"producerThreads": 20,
"maxConcurrentRequests": 500
}
```

Monitor system performance and adjust these parameters to optimize throughput and resource usage.
73 changes: 73 additions & 0 deletions content/v1.6.x-SNAPSHOT/getting-started/day-1/hybrid-saas/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ collate: true

# Hybrid SaaS

{% youtube url="https://drive.google.com/file/d/16-2l9EYBE9DjlHepPKTpVFvashMy1buu/preview" start="0:00" end="6:47" width="560px" height="315px" /%}

There's two options on how to set up a data connector:
1. **Run the connector in Collate SaaS**: In this scenario, you'll get an IP when you add the service. You need to give
access to this IP in your data sources.
Expand All @@ -20,6 +22,77 @@ Any tool capable of running Python code can be used to configure the metadata ex

In this section we'll show you how the ingestion process works and how to test it from your laptop.

## Collate Ingestion Agent

The Collate Ingestion Agent is designed to facilitate metadata ingestion for hybrid deployments, allowing organizations to securely push metadata from their infrastructure into the Collate platform without exposing their internal systems. It provides a secure and efficient channel for running ingestion workflows while maintaining full control over data processing within your network. This document outlines the setup and usage of the Collate Ingestion Agent, emphasizing its role in hybrid environments and key functionalities.

### Overview

The Collate Ingestion Agent is ideal for scenarios where running connectors on-premises is necessary, providing a secure and efficient way to process metadata within your infrastructure. This eliminates concerns about data privacy and streamlines the ingestion process.

With the Collate Ingestion Agent, you can:
- Set up ingestion workflows easily without configuring YAML files manually.
- Leverage the Collate UI for a seamless and user-friendly experience.
- Manage various ingestion types, including metadata, profiling, lineage, usage, DBT, and data quality.

### Setting Up the Collate Ingestion Agent

#### 1. Prepare Your Environment
To begin, download the Collate-provided Docker image for the Ingestion Agent. The Collate team will provide the necessary credentials to authenticate and pull the image from the repository.

**Run the following commands:**
- **Log in to Docker**: Use the credentials provided by Collate to authenticate.
- **Pull the Docker Image**: Run the command to pull the image into your local environment.

Once the image is downloaded, you can start the Docker container to initialize the Ingestion Agent.

#### 2. Configure the Agent

##### Access the Local Agent UI:
- Open your browser and navigate to the local instance of the Collate Ingestion Agent.

##### Set Up the Connection:
- Enter your Collate platform URL (e.g., `https://<your-company>.collate.com/api`).
- Add the ingestion bot token from the Collate settings under **Settings > Bots > Ingestion Bot**.

##### Verify Services:
- Open the Collate UI and confirm that all available services (e.g., databases) are visible in the Ingestion Agent interface.

#### 3. Add a New Service

1. Navigate to the **Database Services** section in the Ingestion Agent UI.
2. Click **Add New Service** and select the database type (e.g., Redshift).
3. Enter the necessary service configuration:
- **Service Name**: A unique name for the database service.
- **Host and Port**: Connection details for the database.
- **Username and Password**: Credentials to access the database.
- **Database Name**: The target database for ingestion.
4. Test the connection to ensure the service is properly configured.

#### 4. Run Metadata Ingestion

1. After creating the service, navigate to the **Ingestion** tab and click **Add Ingestion**.
2. Select the ingestion type (e.g., metadata) and specify any additional configurations:
- Include specific schemas or tables.
- Enable options like DDL inclusion if required.
3. Choose whether to:
- Run the ingestion immediately via the agent.
- Download the YAML configuration file for running ingestion on an external scheduler.
4. Monitor the logs in real-time to track the ingestion process.

#### 5. Verify Ingested Data

1. Return to the Collate platform and refresh the database service.
2. Verify that the ingested metadata, including schemas, tables, and column details, is available.
3. Explore additional ingestion options like profiling, lineage, or data quality for the service.

### Additional Features

The Collate Ingestion Agent supports various ingestion workflows, allowing you to:
- **Generate YAML Configurations**: Download YAML files for external scheduling.
- **Manage Ingestion Types**: Run metadata, profiling, lineage, usage, and other workflows as needed.
- **Monitor Progress**: View logs and monitor real-time ingestion activity.

## 1. How does the Ingestion Framework work?

The Ingestion Framework contains all the logic about how to connect to the sources, extract their metadata
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,90 @@ Perform the reindexing process described below.
{% partial file="/v1.6/deployment/reindex.md" /%}

**Note:** If you continue to experience issues, consider `re-installing` the search application.

## Configuration Parameters for Reindexing

This document provides detailed descriptions and best practices for configuring the reindexing process parameters. Proper configuration ensures efficient and reliable reindexing while minimizing potential system bottlenecks.

{% image
src="/images/v1.6/deployment/upgrade/reindex.png"
alt="Reindex Configuration"
/%}

### 1. `recreateIndex`
**Description**:
Determines whether to recreate the index from scratch during the reindexing process. Setting this to `true` will drop the existing index and create a new one.

**Best Practice**:
Use this option with caution. Set it to `true` only when you need a clean slate, such as after significant changes to your data model or during data migration. For routine updates, keep it `false` to preserve the existing index.

### 2. `batchSize`
**Description**:
Defines the maximum number of events sent in a single batch during reindexing. The default value is `100`.

**Best Practice**:
Adjust the batch size based on system capabilities and event size. A larger batch size improves throughput but may increase memory usage and processing time. Monitor performance and fine-tune accordingly.

### 3. `payLoadSize`
**Description**:
Specifies the maximum payload size (in bytes) for events sent in a batch. Default: `104,857,600 bytes` (100 MB).

**Best Practice**:
Ensure the payload size is within your system’s handling capacity. If memory issues or timeouts occur, reduce this value to improve stability.

### 4. `producerThreads`
**Description**:
Indicates the number of threads used for producing events during reindexing. Default: `10`.

**Best Practice**:
Balance the number of threads with system CPU and I/O capacity. Increasing this number can improve throughput but may lead to contention if set too high.

### 5. `maxConcurrentRequests`
**Description**:
Specifies the maximum number of concurrent requests sent to the search index at any given time. Default: `100`.

**Best Practice**:
Tune this value based on the indexing server’s capacity. Too many concurrent requests can overwhelm the server, leading to failures or slowdowns.

### 6. `maxRetries`
**Description**:
Specifies the maximum number of retry attempts for failed requests. Default: `3 retries`.

**Best Practice**:
Keep this value reasonable to avoid excessive load during failures. Analyze failure patterns to optimize this setting.

### 7. `initialBackoff`
**Description**:
Defines the initial backoff time (in milliseconds) before retrying a failed request. Default: `1000 ms` (1 second).

**Best Practice**:
Start with the default value. Increase it if failures occur frequently due to server overload or network issues.

### 8. `maxBackoff`
**Description**:
Specifies the maximum backoff time (in milliseconds) for retries. Default: `10,000 ms` (10 seconds).

**Best Practice**:
Set this value to align with your application’s latency tolerance. A longer backoff can reduce system load during peak times but may slow recovery from errors.

### 9. `queueSize`
**Description**:
Defines the internal queue size used for reindexing operations. Default: `100`.

**Best Practice**:
Adjust the queue size based on expected load and available memory resources. A larger queue can handle spikes in processing but requires more memory.

## Example Configuration for Best Practices

For high-performance systems, consider the following values as a starting point:

```json
{
"batchSize": 300,
"queueSize": 500,
"producerThreads": 20,
"maxConcurrentRequests": 500
}
```

Monitor system performance and adjust these parameters to optimize throughput and resource usage.

0 comments on commit b69c14c

Please sign in to comment.