-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run dynolog Segmentation fault #168
Comments
cpu profiling was successful |
@zhuzhenxxx I looked at the error you were seeing -33
This means that some feature has not been loaded or certain field groups are not supported, this might be due to container environment. How about this, you can just add one field "--dcgm_fields 100" that is SM Clock to the dynolog command line |
Thank you very much, I tried the method you suggested, and after using --dcgm_fields 100, there is no segfault, but there is a new problem, I receive a return value of -6 from dcgm, and he tells me Feature not supported. What does it mean |
I have also got -6 :(. |
After searching answer in source code. In header file |
The host machine is centos, and the container built with dynolog's dockerfile executes ./build/dynolog/src/dynolog -enable_gpu_monitor -use_JSON and a segment error occurs.
dcgm fails to start using systemctl. After installation, the command line manually executes /usr/bin/nv-hostengine -n --service-account nvidia-dcgm to provide services.
root@j66f07370 dynolog]# Started host engine version 3.1.8 using port number: 5555
/usr/bin/nv-hostengine -n ./build/dynolog/src/dynolog -enable_gpu_monitor -use_JSON
I20230726 09:32:50.613127 4285 Main.cpp:163] Starting dynolog, version = 0.3.0, build git-hash = bb3c3a0
I20230726 09:32:50.613193 4285 DcgmGroupInfo.cpp:125] Creating DCGM instance with fields: 100 155 204 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012
I20230726 09:32:50.613641 4285 DcgmApiStub.cpp:144] Parse "libdcgm.so.3", dcgm version = 3
I20230726 09:32:50.613673 4285 DcgmApiStub.cpp:175] Loaded dcgm dynamic library
I20230726 09:32:50.634850 4285 DcgmGroupInfo.cpp:172] Added group id 2
I20230726 09:32:50.634871 4285 DcgmGroupInfo.cpp:182] Found 2 supported devices, with id:
I20230726 09:32:50.634882 4285 DcgmGroupInfo.cpp:187] Successfully add device: 0
I20230726 09:32:50.634891 4285 DcgmGroupInfo.cpp:187] Successfully add device: 1
I20230726 09:32:50.634907 4285 DcgmGroupInfo.cpp:218] Added field group 4 to group 2
I20230726 09:32:50.634915 4285 DcgmGroupInfo.cpp:228] Watching DCGM fields at interval (ms) = 10000
E20230726 09:32:50.715700 4285 DcgmGroupInfo.cpp:239] Failed dcgmWatchFields() return: -33 with group 2, field group 4
I20230726 09:32:50.715747 4285 DcgmGroupInfo.cpp:414] Unwatched profiling fields for group id 2
E20230726 09:32:50.715778 4285 DcgmGroupInfo.cpp:420] Failed dcgmUnwatchFields() for field group 4, return: -33
I20230726 09:32:50.715791 4285 DcgmGroupInfo.cpp:431] Destroyed field group 4
I20230726 09:32:50.715837 4285 DcgmGroupInfo.cpp:439] Destroyed group 2
I20230726 09:32:51.638577 4285 DcgmGroupInfo.cpp:445] Stopped embedded mode
I20230726 09:32:51.638674 4285 DcgmGroupInfo.cpp:451] Shutdown DCGM
I20230726 09:32:51.638762 4291 Main.cpp:143] Running DCGM loop : interval = 10 s.
I20230726 09:32:51.638803 4285 SimpleJsonServer.cpp:82] Listening to connections on port 1778
I20230726 09:32:51.638808 4291 Main.cpp:145] DCGM fields: 100,155,204,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012
I20230726 09:32:51.638819 4285 SimpleJsonServer.cpp:229] Launching RPC thread
The text was updated successfully, but these errors were encountered: