A collection of fixes to improve performance and stability #34

ErwanAliasr1 · 2024-10-23T13:26:43Z

No description provided.

get_oem_system() is trying to find the right endpoint as depending on systems they might vary. During the monitoring loop, in your are in the case where we need to use the /Attributes, the function generates two redfish calls. Once the right endpoint is found, let's reuse it directly at each call to save time and reduce the number of redfish call we perform. Signed-off-by: Erwan Velu <[email protected]>

A typical crash trace look like : Traceback (most recent call last): File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/e.velu/hwbench/hwbench.py", line 117, in <module> main() File "/home/e.velu/hwbench/hwbench.py", line 44, in main benches.parse_jobs_config() File "/home/e.velu/hwbench/bench/benchmarks.py", line 128, in parse_jobs_config self.__schedule_benchmarks( File "/home/e.velu/hwbench/bench/benchmarks.py", line 145, in __schedule_benchmarks self.__schedule_benchmark(job, pinned_cpu, emp, validate_parameters) File "/home/e.velu/hwbench/bench/benchmarks.py", line 160, in __schedule_benchmark self.hardware.vendor.get_bmc().connect_redfish() File "/home/e.velu/hwbench/environment/vendors/bmc.py", line 82, in connect_redfish return super().connect_redfish(bmc_username, bmc_password, self.get_url()) File "/home/e.velu/hwbench/environment/vendors/hpe/hpe.py", line 31, in get_url ipv4 = self.ilo.get_bmc_ipv4() File "/home/e.velu/hwbench/environment/vendors/hpe/ilorest.py", line 149, in get_bmc_ipv4 if "Manager Dedicated Network Interface" in nc.get("Name"): AttributeError: 'str' object has no attribute 'get' get_bmc_ipv4() make the following ilorest call : ilorest list --select ethernetinterface --filter "id=1" -j. On multi-node chassis like Apollo 2000 Gen10, the output is a list of interfaces. On single-node chassis like a DL380, the output is a dict. The code was considering the first case and crashed on the single-node servers. This patch ensures that if we get a single interface, we convert it into a single item list. This allow us having the same parsing code for both cases. Tested on DL380, ilorest 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.03' Tested on XL225n (Apollo 2000 Gen10), iloreset 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.08' Tested on DL340Gen12, ilorest 4.9.0.0, Model: 'iLO 6' FW: 'iLO 6 v1.62' Signed-off-by: Erwan Velu <[email protected]>

hwbench/environment/vendors/hpe/hpe.py

When a PSU is missing the PowerSupplies endpoint looks like the following : {'@odata.id': '/redfish/v1/Chassis/1/Power/#PowerSupplies/0', 'FirmwareVersion': '0.23.49', 'LastPowerOutputWatts': 144, 'LineInputVoltage': 233, 'LineInputVoltageType': 'ACHighLine', 'Manufacturer': None, 'MemberId': '0', 'Model': 'P67240-B21', 'Name': 'HpeServerPowerSupply', 'Oem': {'Hpe': {'@odata.context': '/redfish/v1/$metadata#HpeServerPowerSupply.HpeServerPowerSupply', '@odata.type': '#HpeServerPowerSupply.v2_0_0.HpeServerPowerSupply', 'AveragePowerOutputWatts': 144, 'BayNumber': 1, 'Domain': 'System', 'HotplugCapable': True, 'MaxPowerOutputWatts': 143, 'Mismatched': False, 'PowerSupplyStatus': {'State': 'Ok'}, 'iPDUCapable': False}}, 'PowerCapacityWatts': 1000, 'PowerSupplyType': 'Unknown', 'SerialNumber': 'XXXXXXXXXXU', 'SparePartNumber': 'P68455-001', 'Status': {'Health': 'OK', 'State': 'Enabled'}} {'@odata.id': '/redfish/v1/Chassis/1/Power/#PowerSupplies/1', 'MemberId': '1', 'Oem': {'Hpe': {'@odata.context': '/redfish/v1/$metadata#HpeServerPowerSupply.HpeServerPowerSupply', '@odata.type': '#HpeServerPowerSupply.v2_0_0.HpeServerPowerSupply', 'BayNumber': 2}}, 'Status': {'Health': 'Warning', 'State': 'Absent'}} The missing 'Name' member, for PSU 1 here, crashed the application. This commit is about managing the PSU presence by checking the 'Status' member and its content. If no 'Status' and/or 'State' found, a custom error message will be reported to the user. A cache is used to avoid polluting the output. Tested on DL380, ilorest 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.03' Tested on XL225n (Apollo 2000 Gen10), iloreset 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.08' Tested on DL340Gen12, ilorest 4.9.0.0, Model: 'iLO 6' FW: 'iLO 6 v1.62' Reported-by: Elyes ZEKRI <[email protected]> Signed-off-by: Erwan Velu <[email protected]>

On mono-node HPE chassis, the current code is calling the enclosurechassis endpoint for each monitoring loop. This call is perfectly useless since the enclosurechassis does not exists. This patch is about adding a is_multinode_chassis() function to be cached that try this endpoint and get a constant answer. Regarding this cached status, a real read of the enclosurechassis is done for multinode chassis only. Tested on DL380, ilorest 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.03' Tested on XL225n (Apollo 2000 Gen10), iloreset 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.08' Tested on DL340Gen12, ilorest 4.9.0.0, Model: 'iLO 6' FW: 'iLO 6 v1.62' Signed-off-by: Erwan Velu <[email protected]>

By default, turbostat was given 200ms to start & complete. When CPUs do not have hyperthreading and all cores are loaded by a benchmark, like stress-ng, turbostat takes more than 200ms. As a result, the self.turbostat.parse() call in monitoring.py is waiting turbostat to complete which generates a timing overdue. This was generating traces like : hwbench: 1 jobs, 1 benchmarks, ETA 0h 01m 00s [full_cpu_load] stressng/cpu/matrixprod(M): 128 stressor on CPU [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127] for 60s Monitoring iteration 8 is 0.12ms late Monitoring iteration 9 is 0.47ms late Monitoring iteration 10 is 0.78ms late Monitoring iteration 11 is 1.19ms late Monitoring iteration 12 is 1.28ms late Monitoring iteration 13 is 1.59ms late Monitoring iteration 14 is 1.90ms late Monitoring iteration 15 is 2.22ms late Monitoring iteration 16 is 2.53ms late Monitoring iteration 17 is 2.85ms late Monitoring iteration 18 is 3.54ms late Monitoring iteration 19 is 3.48ms late Monitoring iteration 20 is 3.82ms late Monitoring iteration 21 is 4.15ms late Monitoring iteration 22 is 4.42ms late Monitoring iteration 23 is 4.71ms late Monitoring iteration 24 is 5.02ms late Monitoring iteration 25 is 5.57ms late Monitoring iteration 26 is 5.66ms late Monitoring iteration 27 is 5.89ms late Monitoring iteration 28 is 5.77ms late This patch is lamely increasing the time budget to 500ms. On the affeacted machines, it solved the issue. Maybe at some point we'll have to increase the 'precision' window to get a better ratio time budget vs run time. Tested on Intel(R) Xeon(R) 6756E systems. Signed-off-by: Erwan Velu <[email protected]>

If a monitoring metric is empty, do not crash hwgraph but rather display a warning. This was seen in issue #35. The following metric was causing the issue : "PDU": { "Polling": { "name": "Polling", "unit": "ms", "mean": [], "min": [], "max": [], "stdev": [], "samples": [] } }, With this commit a typical output looks like the following: environment: rendering 131 jobs from scaleway/hwbench-out-20241023125613/results.json (DL320) avx_1: No samples found in PDU.Polling, ignoring metric. avx_10: No samples found in PDU.Polling, ignoring metric. avx_11: No samples found in PDU.Polling, ignoring metric. avx_12: No samples found in PDU.Polling, ignoring metric. avx_13: No samples found in PDU.Polling, ignoring metric. Reported-by: Elyes ZEKRI <[email protected]> Signed-off-by: Erwan Velu <[email protected]>

When loading metrics, it appear that some can have the same name making invalid data, data name and misleading graphs. This was seen in issue #36. The BMC reported the two following structures : "CPU": { "02-CPU 1 PkgTmp": { "name": "CPU1", "unit": "Celsius", "mean": [ [...] "12-CPU 1": { "name": "CPU1", "unit": "Celsius", "mean": [ As we use the name member, we got the two different metrics being merged as a single one. This patch is adding a workaround by consider the parent's name as the "full_name" of the metric. So this patch is adding this value, like "12-CPU 1", as the full name and reachable via get_full_name(). This commit ensures the env graphs are using this full name to avoid smashing metrics. Signed-off-by: Erwan Velu <[email protected]>

As per issue #38, the dmidecode call to save the SMBIOS table in a binary file for later reading is broken as the file is not present into the hwbench-out directory. It appear that when dmidecode --dump-file is called, that is done with a relative path. The stderr log show the following error : hwbench-out-20241025123459/dmidecode.bin: open: No such file or directory It appear the relative path passed to dmidecode is not existing because dmidecode is not started in the same home dir as we are currently running the tool. This commit is about using an absolute path instead of a relative one to ensure that every command/tool can dump content in the proper directory. With this patch, the dmidecode-bin file is now present in the hwbench-out archive. Signed-off-by: Erwan Velu <[email protected]>

As reported in issue #40, when data points are missing, hwgraph was crashing because graphs must have the same number of data points. In such case, let's default to a fatal message to the user instead of crahsing. This commit also add a new hwgraph option, --ignore-missing-datapoint that can be used to define a behavior to ignore this case and graph. --ignore-missing-datapoint=zero will replace the missing datapoint by a 0 value. --ignore-missing-datapoint=last will replace the missing datapoint by the last known value of the serie. It's up to the user to decide which one is the best but also understand that graphs will be partially wrong. By default, hwgraph generates a fatal error. Signed-off-by: Erwan Velu <[email protected]>

As per issue #39, it could be super useful to have a new type of graph to represent how the CPU clock performed during a benchmark. This new graph gives a quick overview on how the CPU frequency behave during a scaling benchmark. The rendering is not as precise as environment graphs, but it gives a brief overview with the following trade-offs for the y-err bars: - min of the yerr-bar, is the min of min values - mean of the yerr-bar, is the mean of mean values - max of the yerr-bar, is the max of the max values Signed-off-by: Erwan Velu <[email protected]>

graph/scaling.py

graph/trace.py

ErwanAliasr1 added 2 commits October 23, 2024 14:03

ErwanAliasr1 force-pushed the various branch from 93a33ec to b13b607 Compare October 23, 2024 13:30

ErwanAliasr1 requested review from anisse and Arno500 October 23, 2024 14:31

ErwanAliasr1 force-pushed the various branch from 70b7f52 to 4ea10b0 Compare October 23, 2024 16:59

anisse reviewed Oct 24, 2024

View reviewed changes

hwbench/environment/vendors/hpe/hpe.py Outdated Show resolved Hide resolved

ErwanAliasr1 added 3 commits October 25, 2024 11:01

ErwanAliasr1 force-pushed the various branch from 4ea10b0 to 47bc82b Compare October 25, 2024 09:01

ErwanAliasr1 added 4 commits October 25, 2024 11:40

ErwanAliasr1 force-pushed the various branch from a3abe44 to f8c8e15 Compare October 25, 2024 13:55

ErwanAliasr1 force-pushed the various branch from 0e28e2e to eef217a Compare October 28, 2024 09:45

anisse reviewed Oct 28, 2024

View reviewed changes

graph/scaling.py Show resolved Hide resolved

anisse reviewed Oct 28, 2024

View reviewed changes

graph/trace.py Show resolved Hide resolved

anisse approved these changes Oct 28, 2024

View reviewed changes

ErwanAliasr1 removed the request for review from Arno500 October 28, 2024 15:11

ErwanAliasr1 merged commit a930da3 into main Oct 28, 2024
4 checks passed

ErwanAliasr1 deleted the various branch October 28, 2024 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A collection of fixes to improve performance and stability #34

A collection of fixes to improve performance and stability #34

ErwanAliasr1 commented Oct 23, 2024

A collection of fixes to improve performance and stability #34

A collection of fixes to improve performance and stability #34

Conversation

ErwanAliasr1 commented Oct 23, 2024