Limitations of ETL Pipelines and Design Proposals for Improvement #1253
youngmoneee
started this conversation in
Ideas
Replies: 1 comment
-
Sound interesting! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, guys ✋
The ETL pipeline refers to the process of extracting data from various sources, transforming it into a consistent format, and loading it into a destination. This entire pipeline process can be represented through a Stream.
However, the current pipeline is handled using List, which presents the following limitations:
Current Behavior
Expected Behavior
While I also considered using a Consumer object similar to the original interface, I think this approach is more suitable for handling situations where writing to multiple destinations (File, VectorStore, DB, etc.) is required in the pipeline.(I would like to discuss this further.)
Ultimately, I think we can encapsulate the flow of this pipeline through an ETLPipeLine object that extends Runnable.
The example usage after the changes is as follows.
Benefits of the Proposed Design
This Interface change allows for controlling the data flow and enables more flexible implementation using lambdas or method references. Additionally, by transitioning to Flux, we can significantly improve memory efficiency, as data is processed in a streaming fashion without the need to load all data into memory at once.
This approach also enhances performance by enabling asynchronous processing, allowing the pipeline to handle larger datasets and real-time data streams more effectively.
Anticipated Challenges and Solutions
Interface Changes: While the ultimate goal is to achieve a more flexible and testable codebase through interface changes, modifications to existing implementations will be required.
By directly implementing the wrapping of existing synchronous implementations or applying AOP, this approach can be applied with minimal conflict.
And,
Please let me know if there are any issues I might have overlooked or if you have any better ideas(!) for this proposal.
Thanks 🧑🏼💻
Beta Was this translation helpful? Give feedback.
All reactions