Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A collection of fixes to improve performance and stability #34

Merged
merged 10 commits into from
Oct 28, 2024

Conversation

ErwanAliasr1
Copy link
Collaborator

No description provided.

get_oem_system() is trying to find the right endpoint as depending on
systems they might vary.

During the monitoring loop, in your are in the case where we need to use
the /Attributes, the function generates two redfish calls.

Once the right endpoint is found, let's reuse it directly at each call
to save time and reduce the number of redfish call we perform.

Signed-off-by: Erwan Velu <[email protected]>
A typical crash trace look like :
Traceback (most recent call last):
  File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/e.velu/hwbench/hwbench.py", line 117, in <module>
    main()
  File "/home/e.velu/hwbench/hwbench.py", line 44, in main
    benches.parse_jobs_config()
  File "/home/e.velu/hwbench/bench/benchmarks.py", line 128, in parse_jobs_config
    self.__schedule_benchmarks(
  File "/home/e.velu/hwbench/bench/benchmarks.py", line 145, in __schedule_benchmarks
    self.__schedule_benchmark(job, pinned_cpu, emp, validate_parameters)
  File "/home/e.velu/hwbench/bench/benchmarks.py", line 160, in __schedule_benchmark
    self.hardware.vendor.get_bmc().connect_redfish()
  File "/home/e.velu/hwbench/environment/vendors/bmc.py", line 82, in connect_redfish
    return super().connect_redfish(bmc_username, bmc_password, self.get_url())
  File "/home/e.velu/hwbench/environment/vendors/hpe/hpe.py", line 31, in get_url
    ipv4 = self.ilo.get_bmc_ipv4()
  File "/home/e.velu/hwbench/environment/vendors/hpe/ilorest.py", line 149, in get_bmc_ipv4
    if "Manager Dedicated Network Interface" in nc.get("Name"):
AttributeError: 'str' object has no attribute 'get'

get_bmc_ipv4() make the following ilorest call :  ilorest list  --select ethernetinterface --filter "id=1" -j.

On multi-node chassis like Apollo 2000 Gen10, the output is a list of interfaces.
On single-node chassis like a DL380, the output is a dict.

The code was considering the first case and crashed on the single-node servers.

This patch ensures that if we get a single interface, we convert it into
a single item list. This allow us having the same parsing code for both
cases.

Tested on DL380, ilorest 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.03'
Tested on XL225n (Apollo 2000 Gen10), iloreset 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.08'
Tested on DL340Gen12, ilorest 4.9.0.0, Model: 'iLO 6' FW: 'iLO 6 v1.62'

Signed-off-by: Erwan Velu <[email protected]>
When a PSU is missing the PowerSupplies endpoint looks like the
following :

{'@odata.id': '/redfish/v1/Chassis/1/Power/#PowerSupplies/0',
 'FirmwareVersion': '0.23.49',
 'LastPowerOutputWatts': 144,
 'LineInputVoltage': 233,
 'LineInputVoltageType': 'ACHighLine',
 'Manufacturer': None,
 'MemberId': '0',
 'Model': 'P67240-B21',
 'Name': 'HpeServerPowerSupply',
 'Oem': {'Hpe': {'@odata.context': '/redfish/v1/$metadata#HpeServerPowerSupply.HpeServerPowerSupply',
                 '@odata.type': '#HpeServerPowerSupply.v2_0_0.HpeServerPowerSupply',
                 'AveragePowerOutputWatts': 144,
                 'BayNumber': 1,
                 'Domain': 'System',
                 'HotplugCapable': True,
                 'MaxPowerOutputWatts': 143,
                 'Mismatched': False,
                 'PowerSupplyStatus': {'State': 'Ok'},
                 'iPDUCapable': False}},
 'PowerCapacityWatts': 1000,
 'PowerSupplyType': 'Unknown',
 'SerialNumber': 'XXXXXXXXXXU',
 'SparePartNumber': 'P68455-001',
 'Status': {'Health': 'OK', 'State': 'Enabled'}}
{'@odata.id': '/redfish/v1/Chassis/1/Power/#PowerSupplies/1',
 'MemberId': '1',
 'Oem': {'Hpe': {'@odata.context': '/redfish/v1/$metadata#HpeServerPowerSupply.HpeServerPowerSupply',
                 '@odata.type': '#HpeServerPowerSupply.v2_0_0.HpeServerPowerSupply',
                 'BayNumber': 2}},
 'Status': {'Health': 'Warning', 'State': 'Absent'}}

The missing 'Name' member, for PSU 1 here, crashed the application.

This commit is about managing the PSU presence by checking the 'Status' member and its content.
If no 'Status' and/or 'State' found, a custom error message will be reported to the user.
A cache is used to avoid polluting the output.

Tested on DL380, ilorest 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.03'
Tested on XL225n (Apollo 2000 Gen10), iloreset 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.08'
Tested on DL340Gen12, ilorest 4.9.0.0, Model: 'iLO 6' FW: 'iLO 6 v1.62'

Reported-by: Elyes ZEKRI <[email protected]>
Signed-off-by: Erwan Velu <[email protected]>
On mono-node HPE chassis, the current code is calling the
enclosurechassis endpoint for each monitoring loop.

This call is perfectly useless since the enclosurechassis does not
exists.

This patch is about adding a is_multinode_chassis() function to be
cached that try this endpoint and get a constant answer.

Regarding this cached status, a real read of the enclosurechassis is
done for multinode chassis only.

Tested on DL380, ilorest 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.03'
Tested on XL225n (Apollo 2000 Gen10), iloreset 4.9.0.0, Model: 'iLO 5' FW: 'iLO 5 v3.08'
Tested on DL340Gen12, ilorest 4.9.0.0, Model: 'iLO 6' FW: 'iLO 6 v1.62'

Signed-off-by: Erwan Velu <[email protected]>
By default, turbostat was given 200ms to start & complete.
When CPUs do not have hyperthreading and all cores are loaded by a
benchmark, like stress-ng, turbostat takes more than 200ms.

As a result, the  self.turbostat.parse() call in monitoring.py is
waiting turbostat to complete which generates a timing overdue.

This was generating traces like :
	hwbench: 1 jobs, 1 benchmarks, ETA 0h 01m 00s
	[full_cpu_load] stressng/cpu/matrixprod(M): 128 stressor on CPU [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127] for 60s
	Monitoring iteration 8 is 0.12ms late
	Monitoring iteration 9 is 0.47ms late
	Monitoring iteration 10 is 0.78ms late
	Monitoring iteration 11 is 1.19ms late
	Monitoring iteration 12 is 1.28ms late
	Monitoring iteration 13 is 1.59ms late
	Monitoring iteration 14 is 1.90ms late
	Monitoring iteration 15 is 2.22ms late
	Monitoring iteration 16 is 2.53ms late
	Monitoring iteration 17 is 2.85ms late
	Monitoring iteration 18 is 3.54ms late
	Monitoring iteration 19 is 3.48ms late
	Monitoring iteration 20 is 3.82ms late
	Monitoring iteration 21 is 4.15ms late
	Monitoring iteration 22 is 4.42ms late
	Monitoring iteration 23 is 4.71ms late
	Monitoring iteration 24 is 5.02ms late
	Monitoring iteration 25 is 5.57ms late
	Monitoring iteration 26 is 5.66ms late
	Monitoring iteration 27 is 5.89ms late
	Monitoring iteration 28 is 5.77ms late

This patch is lamely increasing the time budget to 500ms.
On the affeacted machines, it solved the issue.

Maybe at some point we'll have to increase the 'precision' window to get
a better ratio time budget vs run time.

Tested on Intel(R) Xeon(R) 6756E systems.

Signed-off-by: Erwan Velu <[email protected]>
If a monitoring metric is empty, do not crash hwgraph but rather display
a warning.

This was seen in issue #35.

The following metric was causing the issue :
  "PDU": {
    "Polling": {
      "name": "Polling",
      "unit": "ms",
      "mean": [],
      "min": [],
      "max": [],
      "stdev": [],
      "samples": []
    }
  },

With this commit a typical output looks like the following:

  environment: rendering 131 jobs from scaleway/hwbench-out-20241023125613/results.json (DL320)
  avx_1: No samples found in PDU.Polling, ignoring metric.
  avx_10: No samples found in PDU.Polling, ignoring metric.
  avx_11: No samples found in PDU.Polling, ignoring metric.
  avx_12: No samples found in PDU.Polling, ignoring metric.
  avx_13: No samples found in PDU.Polling, ignoring metric.

Reported-by: Elyes ZEKRI <[email protected]>
Signed-off-by: Erwan Velu <[email protected]>
When loading metrics, it appear that some can have the same name making
invalid data, data name and misleading graphs.

This was seen in issue #36.

The BMC reported the two following structures :

          "CPU": {
            "02-CPU 1 PkgTmp": {
              "name": "CPU1",
              "unit": "Celsius",
              "mean": [
	  [...]
            "12-CPU 1": {
              "name": "CPU1",
              "unit": "Celsius",
              "mean": [

As we use the name member, we got the two different metrics being merged
as a single one.

This patch is adding a workaround by consider the parent's name as the
"full_name" of the metric.

So this patch is adding this value, like "12-CPU 1", as the full name
and reachable via get_full_name().

This commit ensures the env graphs are using this full name to avoid
smashing metrics.

Signed-off-by: Erwan Velu <[email protected]>
As per issue #38, the dmidecode call to save the SMBIOS table in a
binary file for later reading is broken as the file is not present into
the hwbench-out directory.

It appear that when dmidecode --dump-file is called, that is done with a
relative path.

The stderr log show the following error :
	hwbench-out-20241025123459/dmidecode.bin: open: No such file or directory

It appear the relative path passed to dmidecode is not existing because
dmidecode is not started in the same home dir as we are currently
running the tool.

This commit is about using an absolute path instead of a relative one to
ensure that every command/tool can dump content in the proper directory.

With this patch, the dmidecode-bin file is now present in the
hwbench-out archive.

Signed-off-by: Erwan Velu <[email protected]>
As reported in issue #40, when data points are missing,
hwgraph was crashing because graphs must have the same
number of data points.

In such case, let's default to a fatal message to the user instead of
crahsing.

This commit also add a new hwgraph option, --ignore-missing-datapoint
that can be used to define a behavior to ignore this case and graph.

  --ignore-missing-datapoint=zero will replace the missing datapoint by a 0 value.
  --ignore-missing-datapoint=last will replace the missing datapoint by the last known value of the serie.

It's up to the user to decide which one is the best but also understand
that graphs will be partially wrong.

By default, hwgraph generates a fatal error.

Signed-off-by: Erwan Velu <[email protected]>
As per issue #39, it could be super useful to have a new type of graph
to represent how the CPU clock performed during a benchmark.

This new graph gives a quick overview on how the CPU frequency behave
during a scaling benchmark.

The rendering is not as precise as environment graphs, but it gives a
brief overview with the following trade-offs for the y-err bars:
- min of the yerr-bar, is the min of min values
- mean of the yerr-bar, is the mean of mean values
- max of the yerr-bar, is the max of the max values

Signed-off-by: Erwan Velu <[email protected]>
@ErwanAliasr1 ErwanAliasr1 removed the request for review from Arno500 October 28, 2024 15:11
@ErwanAliasr1 ErwanAliasr1 merged commit a930da3 into main Oct 28, 2024
4 checks passed
@ErwanAliasr1 ErwanAliasr1 deleted the various branch October 28, 2024 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants