monitoring: Increasing turbostat time budget

By default, turbostat was given 200ms to start & complete. When CPUs do not have hyperthreading and all cores are loaded by a benchmark, like stress-ng, turbostat takes more than 200ms. As a result, the self.turbostat.parse() call in monitoring.py is waiting turbostat to complete which generates a timing overdue. This was generating traces like : hwbench: 1 jobs, 1 benchmarks, ETA 0h 01m 00s [full_cpu_load] stressng/cpu/matrixprod(M): 128 stressor on CPU [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127] for 60s Monitoring iteration 8 is 0.12ms late Monitoring iteration 9 is 0.47ms late Monitoring iteration 10 is 0.78ms late Monitoring iteration 11 is 1.19ms late Monitoring iteration 12 is 1.28ms late Monitoring iteration 13 is 1.59ms late Monitoring iteration 14 is 1.90ms late Monitoring iteration 15 is 2.22ms late Monitoring iteration 16 is 2.53ms late Monitoring iteration 17 is 2.85ms late Monitoring iteration 18 is 3.54ms late Monitoring iteration 19 is 3.48ms late Monitoring iteration 20 is 3.82ms late Monitoring iteration 21 is 4.15ms late Monitoring iteration 22 is 4.42ms late Monitoring iteration 23 is 4.71ms late Monitoring iteration 24 is 5.02ms late Monitoring iteration 25 is 5.57ms late Monitoring iteration 26 is 5.66ms late Monitoring iteration 27 is 5.89ms late Monitoring iteration 28 is 5.77ms late This patch is lamely increasing the time budget to 500ms. On the affeacted machines, it solved the issue. Maybe at some point we'll have to increase the 'precision' window to get a better ratio time budget vs run time. Tested on Intel(R) Xeon(R) 6756E systems. Signed-off-by: Erwan Velu <[email protected]>
criteo · Oct 23, 2024 · 4ea10b0 · 4ea10b0
1 parent b13b607
commit 4ea10b0
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/hwbench/bench/monitoring.py b/hwbench/bench/monitoring.py
@@ -192,8 +192,8 @@ def next_iter():
             start_time = self.get_monotonic_clock()
             if self.turbostat:
                 # Turbostat will run for the whole duration of this loop
-                # We just retract a 2/10th of second to ensure it will not overdue
-                self.turbostat.run(interval=(precision - 0.2))
+                # We just retract a 5/10th of second to ensure it will not overdue
+                self.turbostat.run(interval=(precision - 0.5))
                 # Let's monitor the time spent at monitoring the CPU
                 self.get_metric(Metrics.MONITOR)["CPU"]["Polling"].add(
                     (self.get_monotonic_clock() - start_time) * 1e-6