Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Account for input data size in memory hook test without inline cupy a…
…rray copy The change is small and so may be confusing why it's needed or what it's doing, so I'll provide some details that I think are relevant. When the GPU memory requirements of a method is being calculated by a memory estimator, the estimation needs to account for the input data of the method already being in GPU memory. In other words, when calculating the maximum number of slices that can be fit into GPU memory, the GPU memory estimation must include both: - the bytes required to store the input data in GPU memory prior to the method being called - the bytes required by the method during processing (because the method will be doing some allocations of GPU memory) The custom memory hook tracks the memory allocations and deallocations that occur during the GPU method running, to see what the peak memory usage of the method is. However, while it can see the allocations and deallocations that the method performs, it can't see the GPU memory used to store the method's input cupy array (because that allocation occurs before the method is called). The purpose of the "inline copy" of the cupy array being passed to the GPU method (ie, doing `cp.copy(data)`) in the test is to enable the memory hook to "see" the size of the input data, so then then custom memory hook can include the input cupy array's size in the "total memory used" by the method. For a while, this strategy for including the input data size in the memory hook's "total memory used" result, seemed to be fine for our needs in testing the GPU memory estimators. However, after some investigation into the possible cause of #512, there appears to be a subtlety with this approach that has unintended consequences for tracking peak GPU memory usage. Using a combination of cupy's `LineProfileHook`, and print statements in the custom memory hook's `malloc_postprocess()` and `free_postprocess()` methods, the core of the issue appears to be the following: - if `cp.copy(data)` is used to copy the input cupy array (in the previously mentioned strategy to make the custom memory hook see the input data size) - this causes the allocations/deallocations in the method to be done differently, compared to if the method is given a cupy array that is not an "inline copy" (ie, providing `data`, rather than `cp.copy(data)`) For this particualr case of the Savu paganin filter method, what appears to happen is the following: - if `cp.copy(data)` is provided to the method, then the unpadded input cupy array to the method actually gets deallocated during the method's execution - if just `data` is provided to the method, then the unpadded input cupy array is not deallocated during the method's execution For reference, the "unpadded input cupy array" I'm referring to above is the `data` variable (which is the "input data" parameter to the method) in the Savu paganin filter function [here](https://github.com/DiamondLightSource/httomolibgpu/blob/a77f3346460d26a9c490cbb0675df6ed2ea45266/httomolibgpu/prep/phase.py#L53). So, the input cupy array, when given as an "inline copy" to the method function, is somehow being deallocated. This is likely due to how python reference counting works, and the fact that the "inline copy" value was not assigned to a binding/variable in the memory hook test that was calling the method function, meaning that the reference count of the cupy array value could in theory drop to zero during execution of the method function (which would mean that the cupy array value could be garbage collected during execution of the method function, and thus have the underlying memory deallocated during execution of the method function). At any rate, however it was occurring, this is what the printing in the custom memory hook's methods showed - the input cupy array being deallocated during the method function's execution. This is not what the memory estimator assumes as far as allocations/deallocations, and is not what would be typically be expected when inspecting the code of the method. What this means is that, in the context of the memory hook test, the method's allocations and deallocations do not replicate what occurs when the method is used outside of memory hook tests. The memory estimator, on the other hand, is trying to calculate the amount of memory needed by the GPU method in the context of the running in the "real world" (ie, outside of a memory hook test). This disparity between: - what the memory hook test is doing to make the custom memory hook account for the input data size - how the method's memory estimator is written, and how the method is used in practise appears to be the root of the failing Savu paganin filter memory hook tests. This change does not do an "inline copy" of the input cupy array (as per the existing strategy to make the custom memory hook include the input data size). Instead: - the input cupy array is not copied (so the custom memory hook sees allocations and deallocations performed during method execution only) - the size of the input cupy array is added to the peak memory usage reported by the custom memory hook after the method has finished executing Note that this is assuming that the input cupy array is never deallocated during the method execution. I believe that this is a valid assumption due to the combination of the following two reasons: 1. how python's reference counting works - the input cupy array to the method function can never be deallocated during the method function, due to the caller having a binding assigned to the cupy array value - so the reference count of the cupy array value can never drop to 0 during method execution 2. how httomo executes method functions - wrappers execute methods by passing a cupy array that is inside a block - so httomo never does anything like passing a "cupy array pointer" to a method function (method functions may do this internally, but not httomo itself), which would in theory enable the input cupy array's memory to be deallocated during method execution
- Loading branch information