Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Stetl multithreaded #41

Open
fsteggink opened this issue Jul 4, 2016 · 3 comments
Open

Make Stetl multithreaded #41

fsteggink opened this issue Jul 4, 2016 · 3 comments

Comments

@fsteggink
Copy link
Collaborator

fsteggink commented Jul 4, 2016

Stetl is an ideal application to be made multithreaded. Most of the time it is processing datasets which consists of multiple files, and it is run in (server or desktop) environment where multiple processors or cores are available.

See also nlextract/NLExtract#194

@justb4
Copy link
Member

justb4 commented Jul 22, 2020

From the Stetl Gitter conversation:

"Was gister (june 24, 2020 JvdB) op PyAmsterdam virtual Meetup. Erg interessante presentatie van Clayton Bezuidenhout, zie op YouTube na minuut 16: https://youtu.be/Aqu5PE3tzV0?t=998 . In feite iets Stetl-achtigs (basis Pipeline architectuur, gedreven door configuratie) maar elke module is een Thread. Communicatie loopt via Queues. Heb hem gevraagd of hij code wil delen. Celery is soort alternatief maar volgens mij is dat multi-proces met messaging etc, te zwaar. In GeoHealthCheck heb ik goede ervaring met scheduling (package APScheduler) en multi-threading (elke Healthcheck is een thread), erg stabiel. Ik plaats het even hier om het te onthouden..."

The framework is Open Source:
https://bitbucket.org/clayton-bezuidenhout/threads-and-queues-example-app/src/master/

@justb4
Copy link
Member

justb4 commented Jul 22, 2020

So the core architecture of Stetl is a Chain/Pipeline of Components (Inputs, Filters, Outputs) that pass Data Packets to each other. Likewise, a Component (or group of linked Components) could run in a single Thread and pass Data Packets via Queues to other Component Threads. So instead of a direct connection Components could be connected via Queues.

In other cases we may consider running multiple instances of a Chain, e.g. typically with Dutch Keyregistries (Basisregistraties) there are multiple files where the order of processing is not significant.

@fsteggink
Copy link
Collaborator Author

The best solution depends on the workflow. I would keep Stetl as 'atomic' as possible. Just use it for a single task. IMO this means that it should be executed on a single machine, and in that case I agree that threads are much more efficient than processes. An example is loading the BGT in a database. This can be seen as a single job, which can perfectly be parallellized.

On the other hand, there are many situations that you want to run multiple Stetl jobs. In this case processes should be used, and if you want to perform the processing on multiple machines, Celery or similar task queues for distributed processing are needed.

So, I would suggest to focus on options to make Stetl multithreaded when performing one single job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants