Multiprocessing and Performance Improvements #1117

carlsonp · 2024-03-14T21:18:12Z

This is related to some of the discussion in #1098

In my testing, I have a single dataset. I am running this in a Docker container. I'm running with the following settings:

import sys
import json
import time
import dataprofiler as dp

filename = "myfile.parquet"

def profile_test(filename):
    data = dp.Data(filename)
    profile_options = dp.ProfilerOptions()
    
    profile_options.set({
        "structured_options.data_labeler.is_enabled": False,
        "unstructured_options.data_labeler.is_enabled": False,
        "structured_options.correlation.is_enabled": False,
        "structured_options.multiprocess.is_enabled": True,
        "structured_options.chi2_homogeneity.is_enabled": False,
        "structured_options.category.max_sample_size_to_check_stop_condition": 1,
        "structured_options.category.stop_condition_unique_value_ratio": 0.001,
        "structured_options.sampling_ratio": 0.3,
        "structured_options.null_replication_metrics.is_enabled": False
    })

    print(profile_options)

    profile = dp.Profiler(data, options=profile_options)

    human_readable_report = profile.report(report_options={"output_format":"pretty"})

    with open("reportfile.json", "w") as outfile:
        outfile.write(json.dumps(human_readable_report, indent=4))

start_time = time.time()
profile_test(filename)
end_time = time.time()

print("Profile runtime for "+filename, end_time-start_time, 'seconds')

When Data Profiler gets to the first tqdm loop and displays Finding the Null values in the columns... it's pretty quick. It also lists 19 processes corresponding to the pool_size available in the Python multiprocessing pool. This works fine.

Then when it gets to the second tqdm loop and displays Calculating the statistics... I noticed that it was only using 4 processes. When I looked at what was running, I am only seeing a single core being used. When I looked at the code, profile_builder.py has 4 hard-coded. This doesn't seem right. There's a utility function profiler_utils.suggest_pool_size that's not even used anywhere in the codebase as far as I can tell that returns the pool size. So I swapped that out. Now when I run, instead of 4 processes it shows 19 so that seems better. At least we're not leaving potential performance on the table with hard-coding.

However, I'm still seeing only a single core being used. I also checked the CPU affinity after reading some comments on Stackoverflow. It looks reasonable to me.

print(f"Affinity: {os.sched_getaffinity(os.getpid())}")

I'm going to try profiling a bit more and see if I can figure out where it's hanging. It seems like it should be faster, particularly on a multi-core machine. Calculating all the statistics and stuff is taxing but it seems like it should be faster.

* add downloads tile (capitalone#1085) * Replace snappy with cramjam * Delete test_no_snappy --------- Co-authored-by: Taylor Turner <[email protected]>

…l object

taylorfturner · 2024-03-15T14:53:19Z

I'm going to try profiling a bit more and see if I can figure out where it's hanging.

Testing is welcome @carlsonp -- I'll keep on comments here and in #1098

carlsonp · 2024-03-20T13:30:47Z

I made a bit more progress in understanding what's going on. No solutions yet though. Maybe someone will have suggestions. The profiling via snakeviz looks like this:

It's iterating through the profile_types and adding work to the pool via apply_async. This is fine, but then it seems to immediately go into a for loop blocking via get and waiting for those jobs to finish. This seems to result in small batches of 2-4 jobs kicking off then waiting until they're all finished before progressing with another set of 2-4 jobs. I can sort of see now why the pool size of 4 was hard-coded. If I try to comment that out and have it run all out on all the jobs and then just wait for that work to complete, I get incomplete and bad results. Maybe these really can't be parallelized any more and this is the best we can get but... you all know the codebase much more than myself. Any ideas?

taylorfturner · 2024-06-07T18:34:27Z

you'll want a rebase onto dev here

gliptak and others added 2 commits March 14, 2024 09:12

Replace snappy with cramjam (capitalone#1091)

f411110

* add downloads tile (capitalone#1085) * Replace snappy with cramjam * Delete test_no_snappy --------- Co-authored-by: Taylor Turner <[email protected]>

fix: switch to utility function to return suggested pool size for poo…

a16e9ab

…l object

carlsonp requested a review from a team as a code owner March 14, 2024 21:18

carlsonp marked this pull request as draft March 14, 2024 21:18

taylorfturner changed the base branch from 0.10.9-dev to dev June 7, 2024 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing and Performance Improvements #1117

Multiprocessing and Performance Improvements #1117

carlsonp commented Mar 14, 2024 •

edited

Loading

taylorfturner commented Mar 15, 2024

carlsonp commented Mar 20, 2024

taylorfturner commented Jun 7, 2024

Multiprocessing and Performance Improvements #1117

Are you sure you want to change the base?

Multiprocessing and Performance Improvements #1117

Conversation

carlsonp commented Mar 14, 2024 • edited Loading

taylorfturner commented Mar 15, 2024

carlsonp commented Mar 20, 2024

taylorfturner commented Jun 7, 2024

carlsonp commented Mar 14, 2024 •

edited

Loading