-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complex64 not working #23
Comments
And the only function that really calls in https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/generic/common.c#L311 and so my idea is that it probably is not related to alignment, but something funky going on with regards to queues (or that because of misaligment, we are overriding queue information). |
@dfdx When building with debug information I got it to happen with |
Currently all high-level functions in CLBLAS add task to a queue and return corresponding event. However, in tests we immediately call BTW and coming a little bit back, I think
So any type larger than 8 bytes is aligned to 16 bytes. |
@dfdx Here is a travis build that build clBLAS from scratch with debug information enabled and it clearly tells us where the seqfault comes from https://travis-ci.org/JuliaGPU/CLBLAS.jl/jobs/103249068#L1672 Thats on the branch https://github.com/JuliaGPU/CLBLAS.jl/tree/vc/complex32 |
Maybe we are encountering something similar to clMathLibraries/clBLAS#187 |
@dfdx I won't have time this week to look into this, feel free to continue the bug hunt or ping me in a week or so. |
Got it. I think I'll have some spare time later this week to debug it. |
For convenience of debugging, here's a short test calling
Which gives |
If the event list is empty you need to pass in C_NULL and not a pointer to --edit: |
so running the script under lldb
First thing I noticed is that you have an error in your ccall, Second argument should be
But even after that:
|
I'm not really familiar with |
Yes and alpha e.g. DX is not correctly passed through. So I am back On Fri, 22 Jan 2016, 06:47 Andrei Zhabinski [email protected]
|
I just created a small c-example to see if I could figure out the correct way of working with Complex64 and cl_float2 and if I compile the library with gcc and the test program with clang I get the same problem on the C level. Right now the only way of solving this for users is to recommend to them to compile clBLAS with clang/clang++ |
Sounds unpleasant, but reasonable. I'll add corresponding note to the README. |
@dfdx Maybe just maybe we could use https://strpackjl.readthedocs.org/en/latest/ |
|
You need to use the current master. Then there should be no warnings. On Mon, 28 Mar 2016, 20:15 Andrei Zhabinski, [email protected]
|
Seems like data packed using
gives:
Obviously, |
What happens if you do
|
Actually, it's not about mutability - packed data is However, I found out that using
If we change But what is not intuitive is that changing definition to:
leads to a segmentation fault:
and the only difference between our
|
Ok, it's not even about type parameters, but about inheritance. This works fine:
but the following modification makes it break with the same segfault again:
And, by the way, none of these options work with |
Hey, I think I have found a solution to this problem for Linux. I have developed a solution that works on Windows 7 64 bit: Initially, the zGEMM function on Windows 7 x64 was throwing segmentation faults. However, I found that the ccall function to clblasZgemm() would stop throwing segmentation faults if I changed the argument type of variables alpha and beta to Ref{cl_double2} or Ptr{cl_double2}. This is reflected in changing variables alpha and beta to 1-element cl_double2 arrays. Here is the function I used in my code while working on my project:
My suspicion is that the libclBLAS library is treating the alpha and beta arguments as pointers like the event variable. Can you guys check if this fixes the segmentation faults on Linux or OSX? |
I won't have access to GPU-enabled laptop till the end of this week, but I think you can test your code on Travis. The easiest way to go should be to:
Note, that you may need to setup your own Travis account and add Mac OS X to |
I've just checked it on @mikhail-j: could you please provide full code you used for testing? Just to be on the same page. |
The following code I used was copied from test_zgemm.jl. I have aliased clblasDoubleComplex to Complex{Float64} in clblas_typedef.jl as CLBLAS.jl does too. const libclblas = Libdl.find_library(["clBLAS","libclBLAS"],["C:\\AMD\\clBLA-2.10.0\\bin","C:\\AMD\\acml6.1.0.33\\ifort64\\lib\\"])
#const libopencl = Libdl.find_library(["libOpenCL","OpenCL"],["."])
const libopencl = Libdl.find_library(["OpenCL64","OpenCL"],["C:\\Program Files\\NVIDIA Corporation\\OpenCL\\","C:\\Program Files (x86)\\AMD APP SDK\\2.9-1\\bin\\x86_64"])
if (isempty(libclblas))
print("clBLAS can't be found!")
end
include("cl_typedef.jl")
include("clblas_typedef.jl")
include("cl_functions.jl")
include("clblas_functions.jl")
#ccall((:function, “library”), return_type, (argtype,),arg)
function clblasZgemm(o,tA,tB,M,N,K,alpha,A,offA,lda,B,offB,ldb,beta,C,offC,ldc,ncq,cq,ne,wle,e)
return ccall((:clblasZgemm, libclblas), cl_int, (clblasOrder,
clblasTranspose,
clblasTranspose,
Csize_t,
Csize_t,
Csize_t,
Ref{clblasDoubleComplex},#treating this as a pointer fixed a segmentation fault
#clblasDoubleComplex,
cl_mem,
Csize_t,
Csize_t,
cl_mem,
Csize_t,
Csize_t,
Ref{clblasDoubleComplex},#treating this as a pointer fixed a segmentation fault
#clblasDoubleComplex,
#Base.cconvert(Ptr{Void}, Ref{cl_mem}),
#Ref{cl_mem},
cl_mem,
Csize_t,
Csize_t,
cl_uint,
Ref{cl_command_queue},
cl_uint,
#Ref{cl_event},
#AMD's OpenCL driver (Windows 7 x64) throws invalid event if argument type is Ref{cl_event}
Ptr{cl_event},
Ptr{cl_event}),
#Ptr{cl_event_info},
#Ptr{cl_event_info}),
o,tA,tB,M,N,K,alpha,A,offA,lda,B,offB,ldb,beta,C,offC,ldc,ncq,cq,ne,wle,e)
end
function main()
local props = vec(convert(Array{cl_context_properties, 2}, [CL_CONTEXT_PLATFORM 0 0]))
devs = Array(cl_device_id, 1)
devs[1] = clGetFirstGPU()
local platform = clGetGPUPlatform(devs[1])
println(string("Selected GPU: ",clGetDeviceVendor(devs[1])), " ", clGetDeviceName(devs[1]))
props[2] = Base.cconvert(cl_context_properties,platform)
err = Array(cl_int, 1)
local ctx = clCreateContext(props,1,devs[1],C_NULL,C_NULL,err)
statusCheck(err[1])
err = Array(cl_int, 1)
local queue = Array(cl_command_queue, 1)
queue[1] = clCreateCommandQueue(ctx, devs[1], cl_command_queue_properties(0), err)
statusCheck(err[1])
################################ create arrays
A = convert(Array{clblasDoubleComplex,2}, [[11, 12, 13, 14, 15]';[21, 22, 23, 24, 25]';[31, 32, 33, 34, 35]';[41, 42, 43, 44, 45]'])
B = convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]';[51, 52, 53]'])
C = convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]'])
##A = convert(Array{cl_double2,2}, convert(Array{clblasDoubleComplex,2}, [[11, 12, 13, 14, 15]';[21, 22, 23, 24, 25]';[31, 32, 33, 34, 35]';[41, 42, 43, 44, 45]']))
##B = convert(Array{cl_double2,2}, convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]';[51, 52, 53]']))
##C = convert(Array{cl_double2,2}, convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]']))
A1 = vec(A)
B1 = vec(B)
C1 = vec(C)
M = Csize_t(length(A[:,1]))
K = Csize_t(length(B[:,1]))
N = Csize_t(length(B[1,:]))
order = clblasColumnMajor ##julia uses column major
alpha = Array(clblasDoubleComplex, 1)
alpha[1] = convert(clblasDoubleComplex, 10)
#println(string("alpha: ",alpha))
beta = Array(clblasDoubleComplex, 1)
beta[1] = convert(clblasDoubleComplex, 20)
#println(string("beta: ",beta))
transA = clblasNoTrans;
transB = clblasNoTrans;
off = convert(Csize_t, 0)
offA = convert(Csize_t, 0)
offB = convert(Csize_t, 0)
offC = convert(Csize_t, 0)
#Now initialize OpenCLBLAS and buffers
statusCheck(clblasSetup())
statusCheck(clFlush(queue[1]))
err = Array(cl_int, 1)
bufA = clCreateBuffer(ctx, CL_MEM_READ_ONLY, M * K * sizeof(clblasDoubleComplex), C_NULL, err)
statusCheck(err[1])
err = Array(cl_int, 1)
bufB = clCreateBuffer(ctx, CL_MEM_READ_ONLY, K * N * sizeof(clblasDoubleComplex), C_NULL, err)
statusCheck(err[1])
err = Array(cl_int, 1)
bufC = clCreateBuffer(ctx, CL_MEM_READ_WRITE, M * N * sizeof(clblasDoubleComplex), C_NULL, err)
statusCheck(err[1])
statusCheck(clFlush(queue[1]))
event = Array(cl_event, 1)
event[1] = C_NULL
statusCheck(clEnqueueWriteBuffer(queue[1], bufA, CL_TRUE, Csize_t(0), M * K * sizeof(clblasDoubleComplex), A1, cl_uint(0), C_NULL, event))
statusCheck(clWaitForEvents(1,event))
statusCheck(clReleaseEvent(event[1])) #free the memory
event[1] = C_NULL
statusCheck(clEnqueueWriteBuffer(queue[1], bufB, CL_TRUE, Csize_t(0), K * N * sizeof(clblasDoubleComplex), B1, cl_uint(0), C_NULL, event))
statusCheck(clWaitForEvents(1,event))
statusCheck(clReleaseEvent(event[1])) #free the memory
event[1] = C_NULL
statusCheck(clEnqueueWriteBuffer(queue[1], bufC, CL_TRUE, Csize_t(0), M * N * sizeof(clblasDoubleComplex), C1, cl_uint(0), C_NULL, event))
statusCheck(clWaitForEvents(1,event))
statusCheck(clReleaseEvent(event[1])) #free the memory
#=================Check respective buffer sizes in GPU
ref_count = Array(Csize_t, 1)
statusCheck(clGetMemObjectInfo(bufA, CL_MEM_SIZE, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
println(string("bufA memory object size: ", Int32(ref_count[1])))
ref_count = 0
ref_count = Array(Csize_t, 1)
statusCheck(clGetMemObjectInfo(bufB, CL_MEM_SIZE, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
println(string("bufB memory object size: ", Int32(ref_count[1])))
ref_count = 0
ref_count = Array(Csize_t, 1)
statusCheck(clGetMemObjectInfo(bufC, CL_MEM_SIZE, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
println(string("bufC memory object size: ", Int32(ref_count[1])))
ref_count = 0
=====#
event[1] = C_NULL
#=
statusCheck(clblasSgemm(clblasRowMajor, clblasNoTrans, clblasNoTrans, M, N, K,
alpha, bufA, 0, K,
bufB, 0, N, beta,
bufC, 0, N,
1, queue, 0, C_NULL, event))
=#
statusCheck(clblasZgemm(clblasColumnMajor, clblasNoTrans, clblasNoTrans, M, N, K,
alpha, bufA, 0, M,
bufB, 0, K, beta,
bufC, 0, M,
1, queue, 0, C_NULL, event))
statusCheck(clFlush(queue[1]))
statusCheck(clWaitForEvents(1,event))
statusCheck(clReleaseEvent(event[1])) #free the memory
C2=Array(clblasDoubleComplex,length(C1))
event[1] = C_NULL
statusCheck(clEnqueueReadBuffer(queue[1], bufC, CL_TRUE, Csize_t(0), length(C1)*sizeof(clblasDoubleComplex), C2, cl_uint(0), C_NULL, event))
statusCheck(clWaitForEvents(1,event))
statusCheck(clReleaseEvent(event[1])) #free the memory
statusCheck(clFlush(queue[1]))
statusCheck(clReleaseMemObject(bufC))
statusCheck(clFlush(queue[1]))
statusCheck(clReleaseMemObject(bufB))
statusCheck(clFlush(queue[1]))
statusCheck(clReleaseMemObject(bufA))
statusCheck(clFlush(queue[1]))
#statusCheck(clGetMemObjectInfo(bufA, CL_MEM_REFERENCE_COUNT, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
#bufA = C_NULL
#bufB = C_NULL
#bufC = C_NULL
clblasTeardown()
statusCheck(clFlush(queue[1]))
statusCheck(clReleaseCommandQueue(queue[1]))
statusCheck(clReleaseContext(ctx))
bufC = C_NULL
bufB = C_NULL
bufA = C_NULL
queue[1] = C_NULL
event[1] = C_NULL
ctx = C_NULL
devs[1] = C_NULL
Base.gc() ##not sure if julia has been garbage collecting, now is a good time though
return reshape(C2, Int(M), Int(N))
end
if (!isempty(libclblas) && !isempty(libopencl))
main()
end |
I'm afraid this doesn't fix the error for me (Ubuntu 15.10, NVidia GForce GT 630M):
Yet I'm curious what was your idea when you tried to pass pointer to |
I came across the possible solution when I started writing these wrapper ccall functions myself. I found that some functions threw a segmentation fault if I passed a normal variable rather than a pointer. So, I tweaked my clbla @dfdx, I noticed that you had changed the line numbers in the code when the error occurs on line 166. If the message is CL_INVALID_COMMAND_QUEUE, could you change the Ref{cl_command_queue} in the wrapper to Ptr{cl_command_queue}? or Do a git pull for the revised version (and then add your path to the libraries)? |
@mikhail-j: I only changed code for finding libraries, the rest of the code is the same. I'm using another laptop right now, so will check your suggestion in the evening (~10 hours from now). |
@mikhail-j: nope, changing Just for reference, on what CPU/GPU do you test it? |
I've tested my code on Windows 7 x64 with a NVIDIA GTX 780 Ti GPU (CUDA 7.5) and AMD R9 390 GPU (Crimson 14.2 hotfix). In regards to the CPU, I used a Intel Core i7-3930K. |
@mikhail-j May I ask which compiler you are using for CLBLAS? I found that different compilers have different alignments and as such influence which call works and which doesn't. |
@vchuravy I used MinGW-w64 on Windows 7 x64. However, I recently tested the cGEMM and zGEMM functions on SUSE SLES 11 SP3 Linux (customized kernel version 3.18.36). At first, libclBLAS.so refused to load because my glibc version was too old for its liking (I had 2.11.3). After updating my glibc version to 2.23, libclBLAS.so finally loaded into julia (I compiled julia v0.4.6 with gcc 4.8.5 x86_64). I found that Complex{Float64} functioned properly without Ptr{T}/Ref{T}. When I tested the Complex{Float32} function, it threw a segmentation fault as you noted earlier. This was tested on a NVIDIA GTX 780 Ti GPU: julia> include("test_cgemm.jl")
Selected GPU: NVIDIA Corporation GeForce GTX 780 Ti
signal (11): Segmentation fault
_Z10clblasGemmI9cl_float2E13clblasStatus_12clblasOrder_16clblasTranspose_S3_mmmT_P7_cl_memmmS6_mmS4_S6_mmjPP17_cl_command_queuejPKP9_cl_eventPSB_ at ../clBLAS-2.10.0-Hawaii-Linux-x64-CL2.0/lib64/libclBLAS.so (unknown line)
clblasCgemm at ~/OpenCLBLAS.jl/src/test_cgemm.jl:38
main at ~/OpenCLBLAS.jl/src/test_cgemm.jl:173
jlcall_main_21183 at (unknown line)
jl_apply_generic at~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7fe4a04ec0f3)
unknown function (ip: 0x7fe4a04eb527)
unknown function (ip: 0x7fe4a04ec988)
unknown function (ip: 0x7fe4a04ea84d)
unknown function (ip: 0x7fe4a050094f)
unknown function (ip: 0x7fe4a05011c9)
jl_load at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
include at ./boot.jl:261
jl_apply_generic at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:320
jl_apply_generic at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7fe4a04ec0f3)
unknown function (ip: 0x7fe4a04eb527)
unknown function (ip: 0x7fe4a05004d8)
jl_toplevel_eval_in at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
eval_user_input at REPL.jl:62
jlcall_eval_user_input_21160 at (unknown line)
jl_apply_generic at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
anonymous at REPL.jl:92
unknown function (ip: 0x7fe4a04f252c)
unknown function (ip: (nil))
Segmentation fault I wonder if a fresh compilation of libclBLAS.so would generate better behavior with complex GEMM. |
as discussed in #21
Complex64
is currently not working and we are getting a seqfault when passing it to clblas, whereasComplex128
works without issue.Complex64
maps tocl_float2
andComplex128
maps tocl_double2
.The definitions of both types in cl_platform.h is:
The only difference I can see is that
cl_float2
is using 8bit alignment andcl_double2
is using 16bit alignment and if I remember correctly Julia uses 16bit alignment for nearly everything.Complex is defined here:
https://github.com/JuliaLang/julia/blob/02aeb44299d090d50d2c58e004f58a8b8d4f3da6/base/complex.jl
The text was updated successfully, but these errors were encountered: