Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scikit learn library not working with n_jobs for RandomForestClassifier? #69

Open
RoKu-1999 opened this issue May 3, 2023 · 12 comments

Comments

@RoKu-1999
Copy link

Hello,

i am trying to execute ML workloads from scikit-library and want to see what performance benefits inside the enclave by using the n_jobs parameter for RandomForrestClassifier. In native execution with Python I have a performance improvement roughly of 50% when doubling the number of threads (n_jobs). However, for the execution inside SGX with Gramine I get the same performance for 1, 2, 4, 8, and 16 threads / jobs. Is it possible to use n_jobs inside SGX with Gramine?

For my manifest template I tried both, passing loader.env.OMP_NUM_THREADS with the environment variable and assigning it directly:

a)

loader.insecure__use_host_env = true
loader.insecure__use_cmdline_argv = true
loader.env.OMP_NUM_THREADS = { passthrough = true }

b)

loader.env.OMP_NUM_THREADS = "8"

Besides that, I have set sgx.max_threads=128 and increased my enclave_size to 64GB. From scikit-learn I understand that n_jobs can also be set with OMP_NUM_THREADS. Am I doing something wrong in the manifest template or is it just not possible to use n_jobs with Gramine?

Thanks in advance and best regards,
Robert Kubicek

@dimakuv
Copy link

dimakuv commented May 4, 2023

Is it possible to use n_jobs inside SGX with Gramine?

Yes, of course, it is possible. Gramine fully supports multi-threaded workloads.

Am I doing something wrong in the manifest template or is it just not possible to use n_jobs with Gramine?

You seem to be doing it correctly in the manifest template. Both options that you showed are correct. The first option (option a) is insecure but allows fast checking. The second option (option b) is more secure but requires rebuilding of the SGX app each time you want to change the number of threads.

I'm not sure what you're doing wrong. Could you do the option a and then show us the commands that you run and the outputs that you get? I would assume you do smth like OMP_NUM_THREADS=16 gramine-sgx my_scikit_app.

Also, did you try gramine-direct? Does it have the same problem of not changing performance no matter how many threads you set?

@RoKu-1999
Copy link
Author

Hi, thanks for the quick answer! :)
I'm using following bash script that exports OMP_NUM_THREADS and also as you suggested sets OMP_NUM_THREADS before gramine-sgx and gramine-direct. Both exporting it and setting it in the same line does not work.

for thread in 1 4 8 16; do
    export OMP_NUM_THREADS="$thread"
    echo "$OMP_NUM_THREADS"
    for filename in datasets/*.csv; do
        OMP_NUM_THREADS=$thread python3 ml_workloads.py ${filename} $thread >> OUTPUTS/OUTPUT_py_${thread}
        OMP_NUM_THREADS=$thread gramine-sgx ./sklearnex ./ml_workloads.py ${filename} $thread >> OUTPUTS/OUTPUT_sgx_${thread}
        OMP_NUM_THREADS=$thread gramine-direct ./sklearnex ./ml_workloads.py ${filename} $thread >> OUTPUTS/OUTPUT_direct_${thread}
    done
done

My results are as followed and don't improve as the python execution inside SGX for both, gramine-sgx and gramine-direct: For my application im measuring only training time of Random Forest, like this:

def train(clf, X_train, y_train):
    start = time.perf_counter_ns()
    # fit the classifier
    clf = clf.fit(X_train,y_train)
    end = time.perf_counter_ns()
    train_patched = end - start
    print(str(train_patched)+',', end=' ')
    return clf

That means only inside SGX the measurement takes place.

Python
1 Thread: 13,70142874 sec
4 Threads: 3,932495128 sec
8 Threads: 2,416561571 sec
16 Threads: 1,560428782 sec

gramine-sgx
1 Thread: 20,466038 sec
4 Threads: 20,402158 sec
8 Threads: 20,573651 sec
16 Threads: 20,390825 sec

gramine-direct
1 Thread: 14,417658 sec
4 Threads: 14,459555 sec
8 Threads: 14,424367 sec
16 Threads: 14,454447 sec

As you can see, there is no performance change measurable. Is there probably an error in my bash script?

Thanks a lot!
Robert

@dimakuv
Copy link

dimakuv commented May 5, 2023

Something is definitely broken in your setup. Could you please show us two things:

  1. The contents of your manifest.template file.
  2. Please set loader.log_level = "all" in the manifest-template file, rebuild sklearnex files, and run one experiment (e.g. with thread=4). Please then share the Gramine log with us.

We need more info to debug this issue.

@RoKu-1999
Copy link
Author

Hi, here is my manifest.template:

# Intel(R) Extension for Scikit-learn* manifest example

loader.entrypoint = "file:{{ gramine.libos }}"
libos.entrypoint = "{{ entrypoint }}"

#loader.log_level = "{{ log_level }}"
loader.log_level = "all"
loader.log_file = "logs.txt"

loader.env.LD_LIBRARY_PATH = "/lib:{{ arch_libdir }}:/usr/{{ arch_libdir }}:/home/user/.local/lib"
loader.env.HOME = "{{ env.HOME }}"


loader.insecure__use_host_env = true
loader.insecure__use_cmdline_argv = true
loader.env.OMP_NUM_THREADS = { passthrough = true }

# Restrict the maximum number of threads to prevent insufficient memory
# issue, observed on CentOS/RHEL.

# loader.insecure__use_host_env = true

#loader.insecure__use_host_env = true

#loader.insecure__use_cmdline_argv = true
#loader.env.OMP_NUM_THREADS = { passthrough = true }

loader.uid = {{ env_user_uid }}
loader.gid = {{ env_user_gid }}

loader.pal_internal_mem_size = "128M"

sys.stack.size = "8M"
sys.enable_extra_runtime_domain_names_conf = true

fs.mounts = [
  { path = "{{ entrypoint }}", uri = "file:{{ entrypoint }}" },
  { path = "/lib", uri = "file:{{ gramine.runtimedir() }}" },
  { path = "{{ arch_libdir }}", uri = "file:{{ arch_libdir }}" },
  { path = "/usr/{{ arch_libdir }}", uri = "file:/usr/{{ arch_libdir }}" },
{% for path in python.get_sys_path(entrypoint) %}
  { path = "{{ path }}", uri = "file:{{ path }}" },
{% endfor %}

  # scikit-learn and its dependencies install shared libs under this path (e.g. daal4py package
  # installs libonedal_core.so lib); note that we use `/home/user/` prefix inside Gramine and
  # specify this prefix in LD_LIBRARY_PATH envvar above
  { path = "/home/user/.local/lib", uri = "file:{{ env.HOME }}/.local/lib" },

  { type = "tmpfs", path = "/tmp" },
]

sgx.enclave_size = "64G"
sgx.max_threads = 128
sgx.edmm_enable = {{ 'true' if env.get('EDMM', '0') == '1' else 'false' }}

sgx.trusted_files = [
  "file:{{ gramine.libos }}",
  "file:{{ entrypoint }}",
  "file:{{ gramine.runtimedir() }}/",
  "file:{{ arch_libdir }}/",
  "file:/usr/{{ arch_libdir }}/",
  "file:datasets/",
  "file:datasets_old/",
  "file:done_datasets/",
  "file:datasets_large/",
{% for path in python.get_sys_path(entrypoint) %}
  "file:{{ path }}{{ '/' if path.is_dir() else '' }}",
{% endfor %}
  "file:{{ env.HOME }}/.local/lib/",
  "file:ml_workloads.py",
]

My experiment:

set -e
mkdir OUTPUTS
OMP_NUM_THREADS=8 gramine-sgx ./sklearnex ./ml_workloads.py datasets/card_transdata.csv 8 >> OUTPUTS/OUTPUT_sgx_8

creates following Output:

rf, datasets/card_transdata.csv, 8, 7, 700000, 9570718000, 300000, 253467000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9574023000, 300000, 253537000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9551418000, 300000, 253593000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9555182000, 300000, 253467000, 0.9999966666666666, 0.9999966666955177

i have attached my log-file.
logs.txt.zip
logs.txt.zip

@RoKu-1999
Copy link
Author

For my output the columns are:

model, dataset, num_threads, num_features, train_size, train_time, test_size, test_time, f1-score, accuracy
rf, datasets/card_transdata.csv, 8, 7, 700000, 9570718000, 300000, 253467000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9574023000, 300000, 253537000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9551418000, 300000, 253593000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9555182000, 300000, 253467000, 0.9999966666666666, 0.9999966666955177

Train and test time are measured in nanoseconds.

@dimakuv
Copy link

dimakuv commented May 5, 2023

I don't see anything suspicios in the logs.

Where do you run this workload? Is this a bare metal machine, or some VM?

@RoKu-1999
Copy link
Author

Where do you run this workload? Is this a bare metal machine, or some VM?

I run it on a computer-server with following CPU: Intel(R) Xeon(R) Platinum 8352S CPU @ 2.20GHz. It is a bare metal machine.

@dimakuv
Copy link

dimakuv commented May 5, 2023

Can you set the number of threads programatically in your script? Just to experiment, because I see no reason for OMP_NUM_THREADS to be ignored...

@RoKu-1999
Copy link
Author

RoKu-1999 commented May 5, 2023

Can you set the number of threads programatically in your script?

Sure, just did it with

loader.env.OMP_NUM_THREADS = "8"

Results above:

model, dataset, num_threads, num_features, train_size, train_time, test_size, test_time, f1-score, accuracy

rf, datasets/card_transdata.csv, 8, 7, 700000, 9570718000, 300000, 253467000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9574023000, 300000, 253537000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9551418000, 300000, 253593000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9555182000, 300000, 253467000, 0.9999966666666666, 0.9999966666955177

Results now:

The scikit-learn version is 1.2.2.
rf, datasets/card_transdata.csv, 8, 7, 700000, 9236753000, 300000, 246944000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9252438000, 300000, 247544000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9258795000, 300000, 247450000, 0.9999966666666666, 0.9999966666955177
rf, datasets/card_transdata.csv, 8, 7, 700000, 9217154000, 300000, 247485000, 0.9999966666666666, 0.9999966666955177

Still, there is only one core used.

Here also the log-file
logs.txt.zip

@dimakuv
Copy link

dimakuv commented May 8, 2023

Can you set the number of threads programatically in your script?

Sure, just did it with

loader.env.OMP_NUM_THREADS = "8"

No, sorry, that's not what I meant. By programatically, I meant literal code snippet in your Python script. In other words, setting n_jobs parameter to a hard-coded value, like 1 for the single-threaded experiment, then 2 for the 2-thread experiment, etc. And check if this changes anything in Gramine behavior.

@RoKu-1999
Copy link
Author

No, sorry, that's not what I meant. By programatically, I meant literal code snippet in your Python script.

I see 👍 I just did this and the Output remains the same unfortunately.
clf = RandomForestClassifier(n_estimators=16, random_state=42, n_jobs=8)
Also I checked with and without intel-scikit-extension. Both did not use multiple threads.
The version of the libraries are also correct as provided in intel-scikit-example.

Is there anything else you can think of to check?

logs.txt 2.zip

@dimakuv
Copy link

dimakuv commented May 8, 2023

I find it weird. Unfortunately, at this point you'll need to go very deep in what happens in gramine-direct and gramine-sgx. For this, you'll have to use GDB and/or perf analysis. Please check https://gramine.readthedocs.io/en/stable/devel/debugging.html and https://gramine.readthedocs.io/en/stable/performance.html#profiling-with-perf.

I would also suggest to start with gramine-direct, as it is easier to debug/profile non-SGX environment rather than SGX environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants