Skip to content

Commit

Permalink
Account for input data size in memory hook test without inline cupy a…
Browse files Browse the repository at this point in the history
…rray copy

The change is small and so may be confusing why it's needed or what it's
doing, so I'll provide some details that I think are relevant.

When the GPU memory requirements of a method is being calculated by a
memory estimator, the estimation needs to account for the input data of
the method already being in GPU memory. In other words, when calculating
the maximum number of slices that can be fit into GPU memory, the GPU
memory estimation must include both:
- the bytes required to store the input data in GPU memory prior to the
  method being called
- the bytes required by the method during processing (because the method
  will be doing some allocations of GPU memory)

The custom memory hook tracks the memory allocations and deallocations
that occur during the GPU method running, to see what the peak memory
usage of the method is. However, while it can see the allocations and
deallocations that the method performs, it can't see the GPU memory used
to store the method's input cupy array (because that allocation occurs
before the method is called).

The purpose of the "inline copy" of the cupy array being passed to the
GPU method (ie, doing `cp.copy(data)`) in the test is to enable the
memory hook to "see" the size of the input data, so then then custom
memory hook can include the input cupy array's size in the "total memory
used" by the method.

For a while, this strategy for including the input data size in the
memory hook's "total memory used" result, seemed to be fine for our
needs in testing the GPU memory estimators. However, after some
investigation into the possible cause of #512, there appears to be a
subtlety with this approach that has unintended consequences for
tracking peak GPU memory usage.

Using a combination of cupy's `LineProfileHook`, and print statements in
the custom memory hook's `malloc_postprocess()` and `free_postprocess()`
methods, the core of the issue appears to be the following:
- if `cp.copy(data)` is used to copy the input cupy array (in the
  previously mentioned strategy to make the custom memory hook see the
  input data size)
- this causes the allocations/deallocations in the method to be done
  differently, compared to if the method is given a cupy array that is
  not an "inline copy" (ie, providing `data`, rather than
  `cp.copy(data)`)

For this particualr case of the Savu paganin filter method, what appears
to happen is the following:
- if `cp.copy(data)` is provided to the method, then the unpadded input
  cupy array to the method actually gets deallocated during the method's
  execution
- if just `data` is provided to the method, then the unpadded input cupy
  array is not deallocated during the method's execution

For reference, the "unpadded input cupy array" I'm referring to above
is the `data` variable (which is the "input data" parameter to the
method) in the Savu paganin filter function
[here](https://github.com/DiamondLightSource/httomolibgpu/blob/a77f3346460d26a9c490cbb0675df6ed2ea45266/httomolibgpu/prep/phase.py#L53).
So, the input cupy array, when given as an "inline copy" to the method
function, is somehow being deallocated. This is likely due to how python
reference counting works, and the fact that the "inline copy" value was
not assigned to a binding/variable in the memory hook test that was
calling the method function, meaning that the reference count of the
cupy array value could in theory drop to zero during execution of the
method function (which would mean that the cupy array value could be
garbage collected during execution of the method function, and thus have
the underlying memory deallocated during execution of the method
function).

At any rate, however it was occurring, this is what the printing in the
custom memory hook's methods showed - the input cupy array being
deallocated during the method function's execution. This is not what the
memory estimator assumes as far as allocations/deallocations, and is not
what would be typically be expected when inspecting the code of the
method.

What this means is that, in the context of the memory hook test, the
method's allocations and deallocations do not replicate what occurs when
the method is used outside of memory hook tests. The memory estimator,
on the other hand, is trying to calculate the amount of memory needed by
the GPU method in the context of the running in the "real world" (ie,
outside of a memory hook test).

This disparity between:
- what the memory hook test is doing to make the custom memory hook
  account for the input data size
- how the method's memory estimator is written, and how the method is
  used in practise

appears to be the root of the failing Savu paganin filter memory hook
tests.

This change does not do an "inline copy" of the input cupy array (as per
the existing strategy to make the custom memory hook include the input
data size). Instead:
- the input cupy array is not copied (so the custom memory hook sees
  allocations and deallocations performed during method execution only)
- the size of the input cupy array is added to the peak memory usage
  reported by the custom memory hook after the method has finished
  executing

Note that this is assuming that the input cupy array is never
deallocated during the method execution. I believe that this is a valid
assumption due to the combination of the following two reasons:
1. how python's reference counting works
    - the input cupy array to the method function can never be
      deallocated during the method function, due to the caller having a
      binding assigned to the cupy array value
    - so the reference count of the cupy array value can never drop to 0
      during method execution
2. how httomo executes method functions
    - wrappers execute methods by passing a cupy array that is inside a
      block
    - so httomo never does anything like passing a "cupy array pointer"
      to a method function (method functions may do this internally, but
      not httomo itself), which would in theory enable the input cupy
      array's memory to be deallocated during method execution
  • Loading branch information
yousefmoazzam committed Oct 25, 2024
1 parent 64069cb commit f89a7f4
Showing 1 changed file with 6 additions and 5 deletions.
11 changes: 6 additions & 5 deletions tests/test_backends/test_httomolibgpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,12 +202,11 @@ def test_paganin_filter_savu_memoryhook(slices, dim_x, dim_y, ensure_clean_memor
kwargs["increment"] = 0.0
hook = MaxMemoryHook()
with hook:
data_filtered = paganin_filter_savu(cp.copy(data), **kwargs).get()
data_filtered = paganin_filter_savu(data, **kwargs).get()

# make sure estimator function is within range (80% min, 100% max)
max_mem = (
hook.max_mem
) # the amount of memory in bytes needed for the method according to memoryhook
# The amount of bytes used by the method's processing according to the memory hook, plus
# the amount of bytes needed to hold the method's input in GPU memory
max_mem = hook.max_mem + data.nbytes

# now we estimate how much of the total memory required for this data
(estimated_memory_bytes, subtract_bytes) = _calc_memory_bytes_paganin_filter_savu(
Expand All @@ -218,6 +217,8 @@ def test_paganin_filter_savu_memoryhook(slices, dim_x, dim_y, ensure_clean_memor
max_mem_mb = round(max_mem / (1024**2), 2)

# now we compare both memory estimations
#
# make sure estimator function is within range (80% min, 100% max)
difference_mb = abs(estimated_memory_mb - max_mem_mb)
percents_relative_maxmem = round((difference_mb / max_mem_mb) * 100)
# the estimated_memory_mb should be LARGER or EQUAL to max_mem_mb
Expand Down

0 comments on commit f89a7f4

Please sign in to comment.