Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing and Performance Improvements #1117

Draft
wants to merge 2 commits into
base: dev
Choose a base branch
from

Conversation

carlsonp
Copy link
Contributor

@carlsonp carlsonp commented Mar 14, 2024

This is related to some of the discussion in #1098

In my testing, I have a single dataset. I am running this in a Docker container. I'm running with the following settings:

import sys
import json
import time
import dataprofiler as dp

filename = "myfile.parquet"

def profile_test(filename):
    data = dp.Data(filename)
    profile_options = dp.ProfilerOptions()
    
    profile_options.set({
        "structured_options.data_labeler.is_enabled": False,
        "unstructured_options.data_labeler.is_enabled": False,
        "structured_options.correlation.is_enabled": False,
        "structured_options.multiprocess.is_enabled": True,
        "structured_options.chi2_homogeneity.is_enabled": False,
        "structured_options.category.max_sample_size_to_check_stop_condition": 1,
        "structured_options.category.stop_condition_unique_value_ratio": 0.001,
        "structured_options.sampling_ratio": 0.3,
        "structured_options.null_replication_metrics.is_enabled": False
    })

    print(profile_options)

    profile = dp.Profiler(data, options=profile_options)

    human_readable_report = profile.report(report_options={"output_format":"pretty"})

    with open("reportfile.json", "w") as outfile:
        outfile.write(json.dumps(human_readable_report, indent=4))

start_time = time.time()
profile_test(filename)
end_time = time.time()

print("Profile runtime for "+filename, end_time-start_time, 'seconds')

When Data Profiler gets to the first tqdm loop and displays Finding the Null values in the columns... it's pretty quick. It also lists 19 processes corresponding to the pool_size available in the Python multiprocessing pool. This works fine.

Then when it gets to the second tqdm loop and displays Calculating the statistics... I noticed that it was only using 4 processes. When I looked at what was running, I am only seeing a single core being used. When I looked at the code, profile_builder.py has 4 hard-coded. This doesn't seem right. There's a utility function profiler_utils.suggest_pool_size that's not even used anywhere in the codebase as far as I can tell that returns the pool size. So I swapped that out. Now when I run, instead of 4 processes it shows 19 so that seems better. At least we're not leaving potential performance on the table with hard-coding.

However, I'm still seeing only a single core being used. I also checked the CPU affinity after reading some comments on Stackoverflow. It looks reasonable to me.

print(f"Affinity: {os.sched_getaffinity(os.getpid())}")

I'm going to try profiling a bit more and see if I can figure out where it's hanging. It seems like it should be faster, particularly on a multi-core machine. Calculating all the statistics and stuff is taxing but it seems like it should be faster.

gliptak and others added 2 commits March 14, 2024 09:12
* add downloads tile (capitalone#1085)

* Replace snappy with cramjam

* Delete test_no_snappy

---------

Co-authored-by: Taylor Turner <[email protected]>
@carlsonp carlsonp requested a review from a team as a code owner March 14, 2024 21:18
@carlsonp carlsonp marked this pull request as draft March 14, 2024 21:18
@taylorfturner
Copy link
Contributor

I'm going to try profiling a bit more and see if I can figure out where it's hanging.

Testing is welcome @carlsonp -- I'll keep on comments here and in #1098

@carlsonp
Copy link
Contributor Author

I made a bit more progress in understanding what's going on. No solutions yet though. Maybe someone will have suggestions. The profiling via snakeviz looks like this:

snakeviz-profile

It's iterating through the profile_types and adding work to the pool via apply_async. This is fine, but then it seems to immediately go into a for loop blocking via get and waiting for those jobs to finish. This seems to result in small batches of 2-4 jobs kicking off then waiting until they're all finished before progressing with another set of 2-4 jobs. I can sort of see now why the pool size of 4 was hard-coded. If I try to comment that out and have it run all out on all the jobs and then just wait for that work to complete, I get incomplete and bad results. Maybe these really can't be parallelized any more and this is the best we can get but... you all know the codebase much more than myself. Any ideas?

@taylorfturner taylorfturner changed the base branch from 0.10.9-dev to dev June 7, 2024 18:34
@taylorfturner
Copy link
Contributor

you'll want a rebase onto dev here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants