Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION]: Need help with race condition #1317

Closed
elekta-cp-rnca opened this issue Jan 28, 2025 · 3 comments
Closed

[QUESTION]: Need help with race condition #1317

elekta-cp-rnca opened this issue Jan 28, 2025 · 3 comments

Comments

@elekta-cp-rnca
Copy link

Question

Soliciting help.

Trying to sort 1D array with Odd-Even (Brick) sort. Race condition popping up in arrays ~100 or larger. Condition worsens as array size gets large ... 500, 1000, etc. Assuming issues when array larger than threads/block ... 768 in this GPU. Seeing condition for smaller number though. Noticed that placing Interops can help or worsen results. In some cases array values are lost and duplicated with others.

Is there a way to prevent this? __syncthreads(), atomic(bools) ... etc?

Trying to keep minimal memory footprint on device ... only 1 array.

Program.txt

Environment

ILGPU version: 1.5.1
.NET 4.8
Windows 10 Enterprise LTSC (v. 10.0.17763 Build 17763)
NVIDIA RTX 6000 Ada

Additional context

namespace GPU_1D_Sort
{
internal class Program
{
static void Main(string[] args)
{

        //ILGPU.Context ctx = ILGPU.Context.Create(builder => builder.Cuda().Debug());

        //Console.WriteLine(Debugger.IsAttached.ToString());

        //Device gpuDevice = ctx.GetCudaDevice(0);
        //Accelerator gpu_acc = ctx.GetPreferredDevice(preferCPU: false).CreateAccelerator(ctx);

        //gpuDevice.PrintInformation();

        var sw = Stopwatch.StartNew();

        var ndim = 100;

        var e_matrix = new double[ndim];
        var f_matrix = new double[ndim];

        for (int i = 0; i < e_matrix.Length; i++)
            e_matrix[i] = Convert.ToDouble(i);                  

        var rand = new Random();
        for (int i=0; i<e_matrix.Length; i++)
        {
            int j = rand.Next(e_matrix.Length);
            double temp = e_matrix[i];
            e_matrix[i] = e_matrix[j];
            e_matrix[j] = temp;
        }
        for (int i = 0; i < f_matrix.Length; i++)
        {
            f_matrix[i] = e_matrix[i];
        }

        if (e_matrix.GroupBy(x => x).Any(g => g.Count() >1))
        {
            Console.WriteLine("Contains Duplicates!");
        }
        else
        {
            Console.WriteLine("Does not contain Duplicates!");
        }

        sw.Restart();

        var ctx_cu_1d = Context.Create(builder => builder.Cuda().EnableAlgorithms().Debug());

        var device_1d = ctx_cu_1d.GetCudaDevices()[0];

        var gpu_cu_1d_acc = device_1d.CreateCudaAccelerator(ctx_cu_1d);

        MemoryBuffer1D<double, Stride1D.Dense> gpu_e_matrix = gpu_cu_1d_acc.Allocate1D<double>(e_matrix.GetExtent());

        gpu_e_matrix.CopyFromCPU(e_matrix);

        Action<Index1D, ArrayView1D<double, Stride1D.Dense>> loadsort1dkernel =
           gpu_cu_1d_acc.LoadAutoGroupedStreamKernel<Index1D, ArrayView1D<double, Stride1D.Dense >> (Sort_1D_Kernel);

        loadsort1dkernel(gpu_e_matrix.Extent.ToIntIndex(), gpu_e_matrix);
        //loadsort1dkernel(1, gpu_e_matrix, gpu_isSort);

        //Console.WriteLine("I am here!");
        //Console.ReadLine();
        //Environment.Exit(0);

        gpu_cu_1d_acc.Synchronize();

        gpu_e_matrix.CopyToCPU(e_matrix);

        
       for (int i = 0; i < e_matrix.Length; i++)
       //for (int i = 0; i < 50; i++)
       {
            Console.WriteLine(String.Format("{0},{1},{2}", e_matrix[i], i, f_matrix[i]));
       }
        Console.WriteLine("End 1D Sort on GPU");

        gpu_e_matrix.Dispose();

        gpu_cu_1d_acc.Dispose();
        ctx_cu_1d.Dispose();

        sw.Stop();
        Console.WriteLine($"- Sort-1D Array GPU Atomic.Exchange Calculation Time: {sw.Elapsed}");



        Console.ReadKey();

        Environment.Exit(0);

    }

    public static void Sort_1D_Kernel(Index1D index, ArrayView1D<double, Stride1D.Dense> gpu_e_matrix)
    {
        var x = index.X;

        int col_num = gpu_e_matrix.IntExtent.X;

        bool sw_OE = false;

        //Interop.WriteLine("ThreadID: {0}", Grid.GlobalIndex.X);
        //Interop.WriteLine("Threads/block: {0}", ILGPU.Group.Dimension.X);
        //Interop.WriteLine("BlockID: {0}", Grid.Index.X);

        //while (gpu_e_matrix[x] > gpu_e_matrix[x + 1])
        for (int i = 0; i < col_num*2; i++)                 // array[N], N iterations guarantees solution worst case (supposedly!)
        {

            if (!sw_OE)             //Odd-phase
            {
                if ((x < col_num - 1) && (x % 2 != 0))
                {
                    //if(x == 1)
                    //Interop.WriteLine("Index OP is: {0}, {1}, {2}, {3}", x, x+1, gpu_e_matrix[x], gpu_e_matrix[x+1]);

                    if (gpu_e_matrix[x] > gpu_e_matrix[x + 1])
                    {
                        //if (x == 1)
                        //{
                            //Interop.WriteLine("Index OP Before is: {0}, {1}, {2}", 1, gpu_e_matrix[1], gpu_e_matrix[2]);
                        //}
                        
                        gpu_e_matrix[x + 1] = Atomic.Exchange(ref gpu_e_matrix[x], gpu_e_matrix[x + 1]);

                        //if (x == 1)
                        //{
                            //Interop.WriteLine("Index OP After is: {0}, {1}, {2}", 1, gpu_e_matrix[1], gpu_e_matrix[2]);
                        //}

                    }
                }
                sw_OE = true;

            }
            //Interop.WriteLine("gpu_isSort post odd is: {0}", gpu_isSort[0]);


            if (sw_OE)              //Even-phase
            {
                
                if ((x < col_num - 1) && (x % 2 == 0))
                {

                    //Interop.WriteLine("Index EP is: {0}", x);
                    if (gpu_e_matrix[x] > gpu_e_matrix[x + 1])
                    {
                        //if (x == 0)
                        //{
                            //Interop.WriteLine("Index EP Before is: {0}, {1}, {2}", 0, gpu_e_matrix[0], gpu_e_matrix[1]);
                        //}

                        gpu_e_matrix[x + 1] = Atomic.Exchange(ref gpu_e_matrix[x], gpu_e_matrix[x + 1]);

                        //if (x == 0)
                        //{
                            //Interop.WriteLine("Index EP After is: {0}, {1}, {2}", 0, gpu_e_matrix[0], gpu_e_matrix[1]);
                        //}

                    }
                }
                sw_OE = false;
            }

        }
    }


}

}

@elekta-cp-rnca
Copy link
Author

Group.Barrier for thread sync solved for a single block (768 threads ... elements long). Above 768 we are into two blocks or more.

Is there a way to do Cooperative_Groups in ILGPU?

@elekta-cp-rnca
Copy link
Author

Seems ... Explicit kernels and stride are my friends ...

@elekta-cp-rnca
Copy link
Author

Anyone have ideas on block sync? Code works until u use more than 1 block. Grid.Sync?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant