-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A collection of fixes to improve performance and stability #34
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
get_oem_system() is trying to find the right endpoint as depending on systems they might vary. During the monitoring loop, in your are in the case where we need to use the /Attributes, the function generates two redfish calls. Once the right endpoint is found, let's reuse it directly at each call to save time and reduce the number of redfish call we perform. Signed-off-by: Erwan Velu <[email protected]>
A typical crash trace look like : Traceback (most recent call last): File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/e.velu/hwbench/hwbench.py", line 117, in <module> main() File "/home/e.velu/hwbench/hwbench.py", line 44, in main benches.parse_jobs_config() File "/home/e.velu/hwbench/bench/benchmarks.py", line 128, in parse_jobs_config self.__schedule_benchmarks( File "/home/e.velu/hwbench/bench/benchmarks.py", line 145, in __schedule_benchmarks self.__schedule_benchmark(job, pinned_cpu, emp, validate_parameters) File "/home/e.velu/hwbench/bench/benchmarks.py", line 160, in __schedule_benchmark self.hardware.vendor.get_bmc().connect_redfish() File "/home/e.velu/hwbench/environment/vendors/bmc.py", line 82, in connect_redfish return super().connect_redfish(bmc_username, bmc_password, self.get_url()) File "/home/e.velu/hwbench/environment/vendors/hpe/hpe.py", line 31, in get_url ipv4 = self.ilo.get_bmc_ipv4() File "/home/e.velu/hwbench/environment/vendors/hpe/ilorest.py", line 149, in get_bmc_ipv4 if "Manager Dedicated Network Interface" in nc.get("Name"): AttributeError: 'str' object has no attribute 'get' get_bmc_ipv4() make the following ilorest call : ilorest list --select ethernetinterface --filter "id=1" -j. On multi-node chassis like Apollo 2000 Gen10, the output is a list of interfaces. On single-node chassis like a DL380, the output is a dict. The code was considering the first case and crashed on the single-node servers. This patch ensures that if we get a single interface, we convert it into a single item list. This allow us having the same parsing code for both cases. Tested on DL380, ilorest 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.03' Tested on XL225n (Apollo 2000 Gen10), iloreset 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.08' Tested on DL340Gen12, ilorest 4.9.0.0, Model: 'iLO 6' FW: 'iLO 6 v1.62' Signed-off-by: Erwan Velu <[email protected]>
anisse
reviewed
Oct 24, 2024
When a PSU is missing the PowerSupplies endpoint looks like the following : {'@odata.id': '/redfish/v1/Chassis/1/Power/#PowerSupplies/0', 'FirmwareVersion': '0.23.49', 'LastPowerOutputWatts': 144, 'LineInputVoltage': 233, 'LineInputVoltageType': 'ACHighLine', 'Manufacturer': None, 'MemberId': '0', 'Model': 'P67240-B21', 'Name': 'HpeServerPowerSupply', 'Oem': {'Hpe': {'@odata.context': '/redfish/v1/$metadata#HpeServerPowerSupply.HpeServerPowerSupply', '@odata.type': '#HpeServerPowerSupply.v2_0_0.HpeServerPowerSupply', 'AveragePowerOutputWatts': 144, 'BayNumber': 1, 'Domain': 'System', 'HotplugCapable': True, 'MaxPowerOutputWatts': 143, 'Mismatched': False, 'PowerSupplyStatus': {'State': 'Ok'}, 'iPDUCapable': False}}, 'PowerCapacityWatts': 1000, 'PowerSupplyType': 'Unknown', 'SerialNumber': 'XXXXXXXXXXU', 'SparePartNumber': 'P68455-001', 'Status': {'Health': 'OK', 'State': 'Enabled'}} {'@odata.id': '/redfish/v1/Chassis/1/Power/#PowerSupplies/1', 'MemberId': '1', 'Oem': {'Hpe': {'@odata.context': '/redfish/v1/$metadata#HpeServerPowerSupply.HpeServerPowerSupply', '@odata.type': '#HpeServerPowerSupply.v2_0_0.HpeServerPowerSupply', 'BayNumber': 2}}, 'Status': {'Health': 'Warning', 'State': 'Absent'}} The missing 'Name' member, for PSU 1 here, crashed the application. This commit is about managing the PSU presence by checking the 'Status' member and its content. If no 'Status' and/or 'State' found, a custom error message will be reported to the user. A cache is used to avoid polluting the output. Tested on DL380, ilorest 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.03' Tested on XL225n (Apollo 2000 Gen10), iloreset 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.08' Tested on DL340Gen12, ilorest 4.9.0.0, Model: 'iLO 6' FW: 'iLO 6 v1.62' Reported-by: Elyes ZEKRI <[email protected]> Signed-off-by: Erwan Velu <[email protected]>
On mono-node HPE chassis, the current code is calling the enclosurechassis endpoint for each monitoring loop. This call is perfectly useless since the enclosurechassis does not exists. This patch is about adding a is_multinode_chassis() function to be cached that try this endpoint and get a constant answer. Regarding this cached status, a real read of the enclosurechassis is done for multinode chassis only. Tested on DL380, ilorest 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.03' Tested on XL225n (Apollo 2000 Gen10), iloreset 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.08' Tested on DL340Gen12, ilorest 4.9.0.0, Model: 'iLO 6' FW: 'iLO 6 v1.62' Signed-off-by: Erwan Velu <[email protected]>
By default, turbostat was given 200ms to start & complete. When CPUs do not have hyperthreading and all cores are loaded by a benchmark, like stress-ng, turbostat takes more than 200ms. As a result, the self.turbostat.parse() call in monitoring.py is waiting turbostat to complete which generates a timing overdue. This was generating traces like : hwbench: 1 jobs, 1 benchmarks, ETA 0h 01m 00s [full_cpu_load] stressng/cpu/matrixprod(M): 128 stressor on CPU [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127] for 60s Monitoring iteration 8 is 0.12ms late Monitoring iteration 9 is 0.47ms late Monitoring iteration 10 is 0.78ms late Monitoring iteration 11 is 1.19ms late Monitoring iteration 12 is 1.28ms late Monitoring iteration 13 is 1.59ms late Monitoring iteration 14 is 1.90ms late Monitoring iteration 15 is 2.22ms late Monitoring iteration 16 is 2.53ms late Monitoring iteration 17 is 2.85ms late Monitoring iteration 18 is 3.54ms late Monitoring iteration 19 is 3.48ms late Monitoring iteration 20 is 3.82ms late Monitoring iteration 21 is 4.15ms late Monitoring iteration 22 is 4.42ms late Monitoring iteration 23 is 4.71ms late Monitoring iteration 24 is 5.02ms late Monitoring iteration 25 is 5.57ms late Monitoring iteration 26 is 5.66ms late Monitoring iteration 27 is 5.89ms late Monitoring iteration 28 is 5.77ms late This patch is lamely increasing the time budget to 500ms. On the affeacted machines, it solved the issue. Maybe at some point we'll have to increase the 'precision' window to get a better ratio time budget vs run time. Tested on Intel(R) Xeon(R) 6756E systems. Signed-off-by: Erwan Velu <[email protected]>
If a monitoring metric is empty, do not crash hwgraph but rather display a warning. This was seen in issue #35. The following metric was causing the issue : "PDU": { "Polling": { "name": "Polling", "unit": "ms", "mean": [], "min": [], "max": [], "stdev": [], "samples": [] } }, With this commit a typical output looks like the following: environment: rendering 131 jobs from scaleway/hwbench-out-20241023125613/results.json (DL320) avx_1: No samples found in PDU.Polling, ignoring metric. avx_10: No samples found in PDU.Polling, ignoring metric. avx_11: No samples found in PDU.Polling, ignoring metric. avx_12: No samples found in PDU.Polling, ignoring metric. avx_13: No samples found in PDU.Polling, ignoring metric. Reported-by: Elyes ZEKRI <[email protected]> Signed-off-by: Erwan Velu <[email protected]>
When loading metrics, it appear that some can have the same name making invalid data, data name and misleading graphs. This was seen in issue #36. The BMC reported the two following structures : "CPU": { "02-CPU 1 PkgTmp": { "name": "CPU1", "unit": "Celsius", "mean": [ [...] "12-CPU 1": { "name": "CPU1", "unit": "Celsius", "mean": [ As we use the name member, we got the two different metrics being merged as a single one. This patch is adding a workaround by consider the parent's name as the "full_name" of the metric. So this patch is adding this value, like "12-CPU 1", as the full name and reachable via get_full_name(). This commit ensures the env graphs are using this full name to avoid smashing metrics. Signed-off-by: Erwan Velu <[email protected]>
As per issue #38, the dmidecode call to save the SMBIOS table in a binary file for later reading is broken as the file is not present into the hwbench-out directory. It appear that when dmidecode --dump-file is called, that is done with a relative path. The stderr log show the following error : hwbench-out-20241025123459/dmidecode.bin: open: No such file or directory It appear the relative path passed to dmidecode is not existing because dmidecode is not started in the same home dir as we are currently running the tool. This commit is about using an absolute path instead of a relative one to ensure that every command/tool can dump content in the proper directory. With this patch, the dmidecode-bin file is now present in the hwbench-out archive. Signed-off-by: Erwan Velu <[email protected]>
As reported in issue #40, when data points are missing, hwgraph was crashing because graphs must have the same number of data points. In such case, let's default to a fatal message to the user instead of crahsing. This commit also add a new hwgraph option, --ignore-missing-datapoint that can be used to define a behavior to ignore this case and graph. --ignore-missing-datapoint=zero will replace the missing datapoint by a 0 value. --ignore-missing-datapoint=last will replace the missing datapoint by the last known value of the serie. It's up to the user to decide which one is the best but also understand that graphs will be partially wrong. By default, hwgraph generates a fatal error. Signed-off-by: Erwan Velu <[email protected]>
As per issue #39, it could be super useful to have a new type of graph to represent how the CPU clock performed during a benchmark. This new graph gives a quick overview on how the CPU frequency behave during a scaling benchmark. The rendering is not as precise as environment graphs, but it gives a brief overview with the following trade-offs for the y-err bars: - min of the yerr-bar, is the min of min values - mean of the yerr-bar, is the mean of mean values - max of the yerr-bar, is the max of the max values Signed-off-by: Erwan Velu <[email protected]>
anisse
reviewed
Oct 28, 2024
anisse
reviewed
Oct 28, 2024
anisse
approved these changes
Oct 28, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.