Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run dynolog Segmentation fault #168

Open
zhuzhenxxx opened this issue Jul 26, 2023 · 5 comments
Open

run dynolog Segmentation fault #168

zhuzhenxxx opened this issue Jul 26, 2023 · 5 comments

Comments

@zhuzhenxxx
Copy link

The host machine is centos, and the container built with dynolog's dockerfile executes ./build/dynolog/src/dynolog -enable_gpu_monitor -use_JSON and a segment error occurs.
dcgm fails to start using systemctl. After installation, the command line manually executes /usr/bin/nv-hostengine -n --service-account nvidia-dcgm to provide services.

root@j66f07370 dynolog]# Started host engine version 3.1.8 using port number: 5555
/usr/bin/nv-hostengine -n ./build/dynolog/src/dynolog -enable_gpu_monitor -use_JSON
I20230726 09:32:50.613127 4285 Main.cpp:163] Starting dynolog, version = 0.3.0, build git-hash = bb3c3a0
I20230726 09:32:50.613193 4285 DcgmGroupInfo.cpp:125] Creating DCGM instance with fields: 100 155 204 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012
I20230726 09:32:50.613641 4285 DcgmApiStub.cpp:144] Parse "libdcgm.so.3", dcgm version = 3
I20230726 09:32:50.613673 4285 DcgmApiStub.cpp:175] Loaded dcgm dynamic library
I20230726 09:32:50.634850 4285 DcgmGroupInfo.cpp:172] Added group id 2
I20230726 09:32:50.634871 4285 DcgmGroupInfo.cpp:182] Found 2 supported devices, with id:
I20230726 09:32:50.634882 4285 DcgmGroupInfo.cpp:187] Successfully add device: 0
I20230726 09:32:50.634891 4285 DcgmGroupInfo.cpp:187] Successfully add device: 1
I20230726 09:32:50.634907 4285 DcgmGroupInfo.cpp:218] Added field group 4 to group 2
I20230726 09:32:50.634915 4285 DcgmGroupInfo.cpp:228] Watching DCGM fields at interval (ms) = 10000
E20230726 09:32:50.715700 4285 DcgmGroupInfo.cpp:239] Failed dcgmWatchFields() return: -33 with group 2, field group 4
I20230726 09:32:50.715747 4285 DcgmGroupInfo.cpp:414] Unwatched profiling fields for group id 2
E20230726 09:32:50.715778 4285 DcgmGroupInfo.cpp:420] Failed dcgmUnwatchFields() for field group 4, return: -33
I20230726 09:32:50.715791 4285 DcgmGroupInfo.cpp:431] Destroyed field group 4
I20230726 09:32:50.715837 4285 DcgmGroupInfo.cpp:439] Destroyed group 2
I20230726 09:32:51.638577 4285 DcgmGroupInfo.cpp:445] Stopped embedded mode
I20230726 09:32:51.638674 4285 DcgmGroupInfo.cpp:451] Shutdown DCGM
I20230726 09:32:51.638762 4291 Main.cpp:143] Running DCGM loop : interval = 10 s.
I20230726 09:32:51.638803 4285 SimpleJsonServer.cpp:82] Listening to connections on port 1778
I20230726 09:32:51.638808 4291 Main.cpp:145] DCGM fields: 100,155,204,1001,1002,1003,1004,1005,1006,1007,1008,1009,1010,1011,1012
I20230726 09:32:51.638819 4285 SimpleJsonServer.cpp:229] Launching RPC thread

@zhuzhenxxx
Copy link
Author

cpu profiling was successful

@briancoutinho
Copy link
Contributor

@zhuzhenxxx I looked at the error you were seeing -33
you can find it in dcgm_structs.h

DCGM_ST_MODULE_NOT_LOADED = -33, //!< This request is serviced by a module of DCGM that is not currently loaded

This means that some feature has not been loaded or certain field groups are not supported, this might be due to container environment.

How about this, you can just add one field "--dcgm_fields 100" that is SM Clock to the dynolog command line
https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/dcgm_fields.h#L435
The command line flags are documented here -https://github.com/facebookincubator/dynolog#gpu-monitoring

@zhuzhenxxx
Copy link
Author

Thank you very much, I tried the method you suggested, and after using --dcgm_fields 100, there is no segfault, but there is a new problem, I receive a return value of -6 from dcgm, and he tells me Feature not supported. What does it mean

@stricklandye
Copy link

@zhuzhenxxx I looked at the error you were seeing -33 you can find it in dcgm_structs.h

DCGM_ST_MODULE_NOT_LOADED = -33, //!< This request is serviced by a module of DCGM that is not currently loaded

This means that some feature has not been loaded or certain field groups are not supported, this might be due to container environment.

How about this, you can just add one field "--dcgm_fields 100" that is SM Clock to the dynolog command line https://github.com/NVIDIA/DCGM/blob/master/dcgmlib/dcgm_fields.h#L435 The command line flags are documented here -https://github.com/facebookincubator/dynolog#gpu-monitoring

I have also got -6 :(.

@stricklandye
Copy link

After searching answer in source code. In header file dynolog/src/gpumon/dcgm_structs.h, the -6 indicates that some features are not available but I can run dcgm-exporter directly. I don't know why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants