Skip to content

Commit

Permalink
Text updates
Browse files Browse the repository at this point in the history
  • Loading branch information
shayonj committed Sep 2, 2024
1 parent f1f4ab8 commit a6626f5
Show file tree
Hide file tree
Showing 4 changed files with 66 additions and 3 deletions.
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# 🌊 pg_flo

## ![](internal/demo.gif)

[![CI](https://github.com/shayonj/pg_flo/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/shayonj/pg_flo/actions/workflows/ci.yml)

`pg_flo` is the easiest way to move and transform data from PostgreSQL. It users PostgreSQL Logical Replication to stream inserts, updates, deletes, and DDL changes to multiple destinations. With support for parallelizable bulk copy, near real-time streaming, and powerful transformation and filtering rules, `pg_flo` simplifies data sync and ETL processes.

⚠️ CURRENTLY UNDER ACTIVE DEVELOPMENT. ACCEPTING FEEDBACK/ISSUES/PULL REQUESTS 🚀
Expand Down Expand Up @@ -30,6 +34,7 @@
- Supports tracking DDL changes.
- Configurable via command-line flags or environment variables.
- Supports copy and stream mode to parallelize bulk copy and stream changes.
- Resumable streaming from last `lsn` position.

I invite you to take a look through [issues](https://github.com/shayonj/pg_flo/issues) to see what's coming next 🤗.

Expand Down Expand Up @@ -81,7 +86,6 @@ You can configure `pg_flo` using a YAML configuration file or environment variab

### Example 1: Basic streaming of changes to STDOUT

````shell
```shell
pg_flo stream stdout \
--host localhost \
Expand All @@ -92,7 +96,7 @@ pg_flo stream stdout \
--group your_group \
--schema public \
--tables table1,table2
````
```

### Example 2: Using Configuration File

Expand Down Expand Up @@ -149,6 +153,10 @@ pg_flo stream file \
- `make test`
- `make lint`

## How it Works

You can read about how the tool works briefly here [here](internal/how-it-works.md).

### End-to-End Tests

For running end-to-end tests, use the provided script:
Expand Down
Binary file added internal/demo.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
55 changes: 55 additions & 0 deletions internal/how-it-works.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# How it works

`pg_flo` leverages PostgreSQL's logical replication system to capture and stream data while applying transformations and filtrations to the data on the go.

1. **Publication Creation**: It creates a PostgreSQL publication for the specified tables or all tables (per `group`).

2. **Replication Slot**: A replication slot is created to ensure no data is lost between streaming sessions.

3. **Operation Modes**:

- Users can choose between two modes of operation:
a) **Copy-and-Stream**: Performs an initial bulk copy followed by streaming changes.
b) **Stream-Only**: Starts streaming changes immediately from the last known position.

4. **Initial Bulk Copy** (for Copy-and-Stream mode):

- If no valid LSN (Log Sequence Number) is found in the target sink, `pg_flo` performs an initial bulk copy of existing data.
- This process is parallelized for fast data sync:
- Tables are analyzed to optimize the copy process.
- A snapshot is taken to ensure consistency.
- Each table is divided into page ranges.
- Multiple workers copy different ranges concurrently.

5. **Resumable Streaming**:

- After the initial copy (or immediately in Stream-Only mode), streaming starts from the last known position.
- The last processed LSN is stored in the target sink/destination, allowing `pg_flo` to resume operations from where it left off in case of interruptions.

6. **Message Processing**: It processes various types of messages:

- Relation messages to understand table structures
- Insert, Update, and Delete messages containing actual data changes
- Begin and Commit messages for transaction boundaries

7. **Data Transformation**: Received data is converted into a structured format, with type-aware conversions for different PostgreSQL data types.

8. **Rule Application**: If configured, transformation and filtering rules are applied to the data:

- **Transform Rules**:
- Regex: Apply regular expression transformations to string values.
- Mask: Mask sensitive data, keeping the first and last characters visible.
- **Filter Rules**:
- Comparison: Filter based on equality, inequality, greater than, less than, etc.
- Contains: Filter string values based on whether they contain a specific substring.
- Rules can be applied selectively to insert, update, or delete operations.

9. **Buffering**: Processed data is buffered and written in batches to optimize write operations to the destination.

10. **Writing to Sink**: Data is periodically flushed from the buffer to the configured sink (e.g., stdout, file, or other destinations).

11. **State Management**:

- The tool keeps track of its progress by updating the Last LSN in the target sink/destination.
- This allows for resumable operations across multiple runs.
- Periodic status updates are sent to PostgreSQL to maintain the replication connection..
2 changes: 1 addition & 1 deletion internal/webhook_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ set -euo pipefail

source "$(dirname "$0")/e2e_common.sh"

WEBHOOK_URL="https://big-lamp-86.webhook.cool"
WEBHOOK_URL="https://deep-article-49.webhook.cool"

setup_docker() {
rm -Rf /tmp/pg*
Expand Down

0 comments on commit a6626f5

Please sign in to comment.