cuEquivariance library has only a slight performance improvement for the inference of MACE models #740

LHJ1098826475 · 2024-12-09T07:16:37Z

LHJ1098826475
Dec 9, 2024

I tried to accelerate inference with cuEquivariance library according to the User Guide (https://mace-docs.readthedocs.io/en/latest/guide/cuda_acceleration.html) of MACE.
But adding the enable_cueq=True only has a slight performance improvement, which is significantly different from the several fold performance improvement mentioned in the blog (https://developer.nvidia.com/blog/accelerate-drug-and-material-discovery-with-new-math-library-nvidia-cuequivariance/).

`atoms = build.molecule('H2O')
macemp = mace_mp(model="MACE_MPtrj_2022.9.model", enable_cueq=True) # Return ASE calculator
descriptors_mp = macemp.get_descriptors(atoms)

start = time.time()
loop = 5
for i in range(loop):
descriptors_mp = macemp.get_descriptors(atoms)
print((time.time() - start) / loop)`

Excuse me, is the case I constructed correct? Can you provide test cases?
Thanks.

ThePauliPrinciple · 2024-12-09T08:48:45Z

ThePauliPrinciple
Dec 9, 2024

I notice a similar result, about 10-20% improvement in speed (for a 25x25x25 angstrom water box). I do notice that the amount of memory used is about half, which is extremely convenient as it allows to use larger systems on the same GPU.
I obtained these results with a 4090.

13 replies

ThePauliPrinciple Dec 9, 2024

How about the following script:

from mace.calculators import mace_mp
from ase.io import read
from torch.utils.benchmark import Timer

cueq_calculator = mace_mp(enable_cueq=True)
original_calculator = mace_mp(enable_cueq=False)

atoms = read('carbon.xyz')

def benchmark_calculator(calculator, num_iterations=100, warmup=100):
    def run_inference():
        calculator.calculate(atoms)
        return calculator.results

    # Warmup
    for _ in range(warmup):
        run_inference()

    # Benchmark
    timer = Timer(
        stmt="run_inference()",
        globals={
            "run_inference": run_inference,
        },
    )
    #warmu_up_measurement = timer.timeit(num_iterations)
    measurement = timer.timeit(num_iterations)
    return measurement

original_time = benchmark_calculator(original_calculator)
cueq_time= benchmark_calculator(cueq_calculator)
print(original_time)
print(cueq_time)
print(original_time.mean/cueq_time.mean)

ilyes319 Dec 9, 2024
Maintainer

ASE has some cache handling that might be broken by that. The most safe is for you to do 24 hours of MD of your system, writing log only every 1000 of steps and compare the number of steps made.

Can you please share the output produced by my script?

ThePauliPrinciple Dec 9, 2024

This was the output of your script:

run_inference()
  98.37 ms
  1 measurement, 100 runs , 1 thread
/home/user/prog/anaconda3/envs/mace-env-cueq/lib/python3.11/site-packages/mace/modules/models.py:69: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  "atomic_numbers", torch.tensor(atomic_numbers, dtype=torch.int64)

CUET Measurement:
<torch.utils.benchmark.utils.common.Measurement object at 0x7adaa1de1e10>
run_inference()
  31.66 ms
  1 measurement, 100 runs , 1 thread

Speedup: 3.11x

ThePauliPrinciple Dec 10, 2024

It would still be appreciated if there is a simple benchmark script (with output) available that does use mace_mp and demonstrates the speedup. I am not able to observe it.

LHJ1098826475 Dec 19, 2024
Author

After running the above script, I did not see any performance acceleration. @ThePauliPrinciple Can you provide information such as your GPU, driver, CUDA, and torch version so that I can reproduce it.

Thanks.

ilyes319 · 2024-12-09T10:48:15Z

ilyes319
Dec 9, 2024
Maintainer

Hello,
Did you make sure to compile the kernels before using them?
Also the way you are timing the model call is not good. You need to do warm ups before, the first step will be slow because of the kernel setup. Plus when you are doing these kind of calls, a lot of things are happening that are external to the model evaluation, as buiding the neighbors list. If you want examples of benchmark scripts you can look here: https://github.com/ACEsuit/mace/blob/main/tests/test_benchmark.py

You should see up to x5 time speed up for the large model on A100 and H100, I confirmed this during MD. Be careful that the actual speed up that you will see is dependent on many things, small models have less acceleration than large. Changing GPUs will also affect the speed up. It is very delicale to time things, so be careful. What GPU are you using?

12 replies

ilyes319 Dec 12, 2024
Maintainer

can you run nvidia-smi on your GPU and share it with me?

LHJ1098826475 Dec 12, 2024
Author

ilyes319 Dec 12, 2024
Maintainer

can you reinstall the kernels but using CUDA 12?

pip install cuequivariance-ops-torch-cu12

please start from a fresh env

LHJ1098826475 Dec 13, 2024
Author

I have created a new conda env and installed cuequivariance-ops-torch-cu12, but there is still no performance acceleration.

cu12_equ/lib/python3.10/site-packages/torch/jit/_check.py:178: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn(
cu12_equ/lib/python3.10/site-packages/torch/jit/_check.py:178: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in `__init__`. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in `torch.jit.Attribute`.
  warnings.warn(
E3NN Measurement:
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa59207cd60>
run_inference()
  35.72 ms
  1 measurement, 1000 runs , 1 thread
cu12_equ/lib/python3.10/site-packages/mace/modules/models.py:69: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  "atomic_numbers", torch.tensor(atomic_numbers, dtype=torch.int64)

CUET Measurement:
<torch.utils.benchmark.utils.common.Measurement object at 0x7fa50538a590>
run_inference()
  38.47 ms
  1 measurement, 1000 runs , 1 thread

Speedup: 0.93x
(cu12_equ) root@ubuntu:/data/cu_equ# pip list | grep "cuequ"
cuequivariance                0.1.0
cuequivariance-ops-torch-cu12 0.1.0
cuequivariance-torch          0.1.0

LHJ1098826475 Dec 17, 2024
Author

@ilyes319
Do you have any other suggestions regarding the above phenomenon?

Thanks and looking forward to seeing your answer!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuEquivariance library has only a slight performance improvement for the inference of MACE models #740

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 25 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

cuEquivariance library has only a slight performance improvement for the inference of MACE models #740

LHJ1098826475 Dec 9, 2024

Replies: 2 comments · 25 replies

ThePauliPrinciple Dec 9, 2024

ThePauliPrinciple Dec 9, 2024

ilyes319 Dec 9, 2024 Maintainer

ThePauliPrinciple Dec 9, 2024

ThePauliPrinciple Dec 10, 2024

LHJ1098826475 Dec 19, 2024 Author

ilyes319 Dec 9, 2024 Maintainer

ilyes319 Dec 12, 2024 Maintainer

LHJ1098826475 Dec 12, 2024 Author

ilyes319 Dec 12, 2024 Maintainer

LHJ1098826475 Dec 13, 2024 Author

LHJ1098826475 Dec 17, 2024 Author

LHJ1098826475
Dec 9, 2024

Replies: 2 comments 25 replies

ThePauliPrinciple
Dec 9, 2024

ilyes319 Dec 9, 2024
Maintainer

LHJ1098826475 Dec 19, 2024
Author

ilyes319
Dec 9, 2024
Maintainer

ilyes319 Dec 12, 2024
Maintainer

LHJ1098826475 Dec 12, 2024
Author

ilyes319 Dec 12, 2024
Maintainer

LHJ1098826475 Dec 13, 2024
Author

LHJ1098826475 Dec 17, 2024
Author