Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CONV] fix naive conv kernel for large tensors #3434

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 73 additions & 8 deletions src/solver/conv/conv_direct_naive_conv.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -361,11 +361,27 @@ GetConv2DFWDSolution(const ExecutionContext& ctx, const ::miopen::conv::ProblemD
size_t grid_size = 1;
if(problem.IsLayoutDefault())
{
grid_size = static_cast<size_t>(n) * k;
size_t all_workload = static_cast<size_t>(n) * k;
if(all_workload <= block_size)
{
grid_size = all_workload;
}
else
{
grid_size = (all_workload + block_size - 1) / block_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #2748
It is an integer Ceil() function.
It's just a reminder that the problem still exists.

}
}
else if(problem.IsLayoutNHWC())
{
grid_size = static_cast<size_t>(group) * n * ho;
size_t all_workload = static_cast<size_t>(group) * n * ho;
if(all_workload <= block_size)
{
grid_size = all_workload;
}
else
{
grid_size = (all_workload + block_size - 1) / block_size;
}
Comment on lines +376 to +384
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel like the problem is solved here, actually I see a few more problems.
Now it divides total workload by 256 - technically speaking it's just 256 times further from now. Quite far away, but still there.
And since it is dividing total workload by 256, we have 256 times underloaded GPU. Can be a huge performance drop for a wide range of legit tensor sizes, and even it's a naive algorithm, we are using it everywhere in the tests to compute reference data.
The last concern is the kernel itself - it should be aware about that fact that the number of groups can be capped, and it should contain extra loop to handle it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will implemented the kernel itself to handle the capped number of groups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that a new kernel should be implemented, or the old one can be changed, or even the old one has already got this support and we should change anything - firstly it should be checked.

Underloaded GPU problem should be fixed too.
Let's imagine - all_workload is 256 and we have a grid size of 256; when it is 257, the grid size suddenly becomes 2.
We have more work but fewer workers.

Copy link
Contributor Author

@bghimireamd bghimireamd Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need to modify the kernel. We could loop over the same kernel, adjusting the chunk size and buffer offsets as needed. This would handle the limitation of uint32_t in hipExtModuleLaunchKernel which currently overflows when we pass a global work size as gridX((589824 *256) *256 ).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking to see how we handle this issue in other locations (since it seems like it would be a global constraint).

Looks like the batched_transpose solver also has a version of this issue (and seems somewhat likely we have this issue throughout MIOpen).

For HIP this is a general constraint across any kernel launch I think:

What are the maximum limits of kernel launch parameters?

Product of block.x, block.y, and block.z should be less than 1024. Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32, so gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32.

I think we might need to come up with a general solution for this, and make sure it's implemented broadly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking to see how we handle this issue in other locations (since it seems like it would be a global constraint).

We don't, kind of. There are some places where the kernel is aware about number of workgroups limit and sometimes the number of workgroups is capped by some value like 4096. That's mostly it.

I think we might need to come up with a general solution for this, and make sure it's implemented broadly.

I'm not sure if thatcan be easily implemented. The main reason is that: the number of workgroups heavily depends on the algorithm and, the most important, on the kernel itself, and sometimes it even comes from heuristics.
Putting some hardlimit in the library will not resolve the problem, and it can even do a bad stuff like previously runtime explicitly failed to launch the kernel, but now it will be silently capped, launched and produced a wrong result, which will be much harded to notice, especially when you run production code without any verifications.

}
else
MIOPEN_THROW("Unsupported layout");
Expand Down Expand Up @@ -507,13 +523,30 @@ GetConv3DFWDSolution(const ExecutionContext& ctx, const ::miopen::conv::ProblemD

size_t block_size = 256;
size_t grid_size = 1;

if(problem.IsLayoutDefault())
{
grid_size = static_cast<size_t>(n) * k;
size_t all_workload = static_cast<size_t>(n) * k;
if(all_workload <= block_size)
{
grid_size = all_workload;
}
else
{
grid_size = (all_workload + block_size - 1) / block_size;
}
}
else if(problem.IsLayoutNHWC())
{
grid_size = static_cast<size_t>(group) * n * do_;
size_t all_workload = static_cast<size_t>(group) * n * do_;
if(all_workload <= block_size)
{
grid_size = all_workload;
}
else
{
grid_size = (all_workload + block_size - 1) / block_size;
}
}
else
MIOPEN_THROW("Unsupported layout");
Expand Down Expand Up @@ -867,11 +900,27 @@ GetConv2DBWDSolution(const ExecutionContext& ctx, const ::miopen::conv::ProblemD
size_t grid_size = 1;
if(problem.IsLayoutDefault())
{
grid_size = static_cast<size_t>(n) * c;
size_t all_workload = static_cast<size_t>(n) * c;
if(all_workload <= block_size)
{
grid_size = all_workload;
}
else
{
grid_size = (all_workload + block_size - 1) / block_size;
}
}
else if(problem.IsLayoutNHWC())
{
grid_size = static_cast<size_t>(group) * n * hi;
size_t all_workload = static_cast<size_t>(group) * n * hi;
if(all_workload <= block_size)
{
grid_size = all_workload;
}
else
{
grid_size = (all_workload + block_size - 1) / block_size;
}
}
else
{
Expand Down Expand Up @@ -1017,11 +1066,27 @@ GetConv3DBWDSolution(const ExecutionContext& ctx, const ::miopen::conv::ProblemD
size_t grid_size = 1;
if(problem.IsLayoutDefault())
{
grid_size = static_cast<size_t>(n) * c;
size_t all_workload = static_cast<size_t>(n) * c;
if(all_workload <= block_size)
{
grid_size = all_workload;
}
else
{
grid_size = (all_workload + block_size - 1) / block_size;
}
}
else if(problem.IsLayoutNHWC())
{
grid_size = static_cast<size_t>(group) * n * di;
size_t all_workload = static_cast<size_t>(group) * n * di;
if(all_workload <= block_size)
{
grid_size = all_workload;
}
else
{
grid_size = (all_workload + block_size - 1) / block_size;
}
}
else
{
Expand Down
Loading