Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pool.imap runs indefinitely on a Windows machine #211

Open
lucazav opened this issue May 9, 2021 · 1 comment
Open

Pool.imap runs indefinitely on a Windows machine #211

lucazav opened this issue May 9, 2021 · 1 comment
Labels

Comments

@lucazav
Copy link

lucazav commented May 9, 2021

I'm trying to parallelize the row wise Pandas dataframe's apply() function, as I reported in this Stackoverflow question. Following the hint of albert, I run the following code using a conda environment with Python 3.9.1 64-bit on a Windows machine:

import pandas as pd
import time
from pathos.multiprocessing import Pool

def enrich_str(str):
        
    val1 = f'{str}_1'
    val2 = f'{str}_2'
    val3 = f'{str}_3'
    time.sleep(3)
    
    return val1, val2, val3
    
def enrich_row(row_tuple):
    passed_row = row_tuple[1]
    col_name = str(passed_row['colName'])
    my_string = str(passed_row[col_name])
    
    val1, val2, val3 = enrich_str(my_string)
    
    passed_row['enriched1'] = val1
    passed_row['enriched2'] = val2
    passed_row['enriched3'] = val3
    
    return passed_row

df = pd.DataFrame({'numbers': [1, 2, 3, 4, 5], 'colors': ['red', 'white', 'blue', 'orange', 'red']}, 
                  columns=['numbers', 'colors'])

df['colName'] = 'colors'

tic = time.perf_counter()
result = Pool(8).imap(enrich_row, df.iterrows(), chunksize=1)
enriched_df = pd.DataFrame(result)
toc = time.perf_counter()

print(f"{enriched_df.shape[0]} rows enriched in {toc - tic:0.4f} seconds")
print(enriched_df)

Unfortunately it runs indefinitely on my machine using all the cores at 100%.
Any hint?

@mmckerns
Copy link
Member

@lucazav: I tested this on a mac, and it works for me. So, I'm going to assume that it's a windows issue. Python spawns processes on Windows differently then on other systems, and there are a few workarounds when things get stuck on Windows.

On Windows, you should generally use pathos.helpers.freeze_support(), which requires a if __name__ == '__main__': block of code. There's also multiprocess.set_start_method to change the character of the Pool, but I don't have a lot of experience with that function on windows, so I'm not sure if it's as functional as it is on a mac. I'd sit that aside for now. Going back to freeze_support, if you find it throws an error once freeze_support is added, then the next natural step would either be to use set_start_method (to change how pools are created), or to use dill.settings['recurse'] = True (to change how objects are serialized).

Give freeze_support a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants