Skip to content

Commit

Permalink
monitoring: Increasing turbostat time budget
Browse files Browse the repository at this point in the history
By default, turbostat was given 200ms to start & complete.
When CPUs do not have hyperthreading and all cores are loaded by a
benchmark, like stress-ng, turbostat takes more than 200ms.

As a result, the  self.turbostat.parse() call in monitoring.py is
waiting turbostat to complete which generates a timing overdue.

This was generating traces like :
	hwbench: 1 jobs, 1 benchmarks, ETA 0h 01m 00s
	[full_cpu_load] stressng/cpu/matrixprod(M): 128 stressor on CPU [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127] for 60s
	Monitoring iteration 8 is 0.12ms late
	Monitoring iteration 9 is 0.47ms late
	Monitoring iteration 10 is 0.78ms late
	Monitoring iteration 11 is 1.19ms late
	Monitoring iteration 12 is 1.28ms late
	Monitoring iteration 13 is 1.59ms late
	Monitoring iteration 14 is 1.90ms late
	Monitoring iteration 15 is 2.22ms late
	Monitoring iteration 16 is 2.53ms late
	Monitoring iteration 17 is 2.85ms late
	Monitoring iteration 18 is 3.54ms late
	Monitoring iteration 19 is 3.48ms late
	Monitoring iteration 20 is 3.82ms late
	Monitoring iteration 21 is 4.15ms late
	Monitoring iteration 22 is 4.42ms late
	Monitoring iteration 23 is 4.71ms late
	Monitoring iteration 24 is 5.02ms late
	Monitoring iteration 25 is 5.57ms late
	Monitoring iteration 26 is 5.66ms late
	Monitoring iteration 27 is 5.89ms late
	Monitoring iteration 28 is 5.77ms late

This patch is lamely increasing the time budget to 500ms.
On the affeacted machines, it solved the issue.

Maybe at some point we'll have to increase the 'precision' window to get
a better ratio time budget vs run time.

Tested on Intel(R) Xeon(R) 6756E systems.

Signed-off-by: Erwan Velu <[email protected]>
  • Loading branch information
ErwanAliasr1 committed Oct 25, 2024
1 parent d79bb7f commit 47bc82b
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions hwbench/bench/monitoring.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,8 +192,8 @@ def next_iter():
start_time = self.get_monotonic_clock()
if self.turbostat:
# Turbostat will run for the whole duration of this loop
# We just retract a 2/10th of second to ensure it will not overdue
self.turbostat.run(interval=(precision - 0.2))
# We just retract a 5/10th of second to ensure it will not overdue
self.turbostat.run(interval=(precision - 0.5))
# Let's monitor the time spent at monitoring the CPU
self.get_metric(Metrics.MONITOR)["CPU"]["Polling"].add(
(self.get_monotonic_clock() - start_time) * 1e-6
Expand Down

0 comments on commit 47bc82b

Please sign in to comment.