Chris Kotfila
Kitware Inc.
- What is parallel programming?
- How can I do it in Python?
- Notebook Examples
+++
ds = xr.open_dataset('/path/to/some.nc')
Inputs - Functions - Outputs
+++
ds = xr.open_dataset('/path/to/some.nc')
avg = ds.groupby('time.season') / ds.astype(float).groupby('time.season').sum()
- Some times the work takes a long time.
- Can't calculate the second statement until the first statement is done.
- One after another, input - transformation - output
+++
- Parallel programming is the process of 'pitching' work to something else that 'catches' that work and then does the heavy lifting.
- Once we've pitched our work we can continue to the next statement.
+++
- We're still doing input, function, output
- Except now the 'work' is to pass our real work on to someone else.
- The output is usually a reference to where we can get the result of the work when it's done.
+++
- If we don't want the output we can continue to the next statement.
- If we do want the output, we use the reference to wait until the output is available.
- This is called blocking.
+++
Wait.. if we're pitching work then just waiting for it to complete, what is the point? Isn't that just like serial programming?
+++
We can pitch as many jobs as we like off to other workers then wait for them all to complete.
+++
This is like me trying to solve four Rubik's cubes versus handing off four Rubik's cubes to my friends.
+++
If I have a function that takes 30 seconds and I need to run it 10 times, that will take me 5 minutes in wall time. If I can pitch those 10 jobs to 10 different CPU's, then waiting for it to complete should only take ~30 seconds (that's 10x improvement!)
+++
- File conversion
- Extract, Transform, Load (ETL) pipelines
- Parameter Sweeps
- Anything that can be cut up, worked on, and stitched back together.
+++
- Who's pitching the work?
- Who's catching the work?
- What exactly is being thrown?
+++
- Subprocess
- Multiprocessing
- IPyParallel
- Celery
+++
- Run a command line script from python
- Built-in module (batteries included)
- The
subprocess
module pitches - The operating system catches that work
- Pitch a string that runs the command
+++
- Run a function in a different process
- Built-in module (batteries included)
- the
multiprocessing
module pitches - A
Pool
of process objects catches - Pitch a python function through the magic of forking
+++
- Run functions in parallel from Jupyter Cells
- a special
view
object does the pitching - The
ipcluster
oripengine
commands catch - Can be run across computers (but this takes a little work)
- Pitch messages that contain data/function
+++
- Resiliently run hundreds of millions of functions.
- A decorated function pitches itself
- A message queue (
rabbitmq
) catches the work - Queue passes this to the
celery worker
- Pitch messages the contain data/function.
- The technology that powers Instagram.