ConcurrentBag optimize internal lists #60574
Replies: 2 comments 5 replies
-
Another idea besides using volatile int m_addIndex;
private int InterlockedGetAddIndex()
{
// Interlocked make m_addIndex loop around the array length.
int initialValue, newValue;
do
{
initialValue = m_addIndex;
newValue = (initialValue + 1) % Environment.ProcessorCount;
} while (Interlocked.CompareExchange(ref m_addIndex, newValue, initialValue) != initialValue);
return newValue;
} The downside to this method is the higher likelihood of add/take contention, but that's mitigated by each operating on opposite ends of the list (like the current steal). Effectively, we always steal, instead of conditionally steal. But we only need to steal from 1 list instead of scanning all lists, making the runtime true Also, this would make local pooling of the list nodes for amortized zero-allocations make sense. (I expect to use a ConcurrentBag as an object pool to reduce GC pressure, constantly adding and taking items throughout its lifetime, why does it allocate on every add itself?) |
Beta Was this translation helpful? Give feedback.
-
This is not a reliable technique for determining "the number of hardware threads". You can tweak this value with an environment variable these days. In general, it seems very hard/impossible to query the "true" number of processors reliably, due to virtualization and emulation layers that exist. |
Beta Was this translation helpful? Give feedback.
-
I was looking at the source code for
ConcurrentBag<T>
and noticed its implementation potentially creates many more lists than are theoretically necessary, due to theThreadLocal<ThreadLocalList> m_locals;
field. This is done for a single list for each software thread for fast access on a single thread.Accessing the bag from background/threadpool threads could make thousands of
ThreadLocal
objects (1 for each thread), causingO(n+m)
memory consumption andO(n)
runtime for take/peek. But we really only need a single list for each hardware thread, as we know that 2 software threads cannot run simultaneously on the same hardware thread. Even if we have 1000 software threads running, we may only have 16 hardware threads executing those software threads.So I was thinking, instead of having these fields:
it can use a readonly array with its size set to the number of hardware threads:
Then, when doing a take/peek/add operation, we use the list according to the current processor id, and use a spinlock technique in case of the operating system moving the thread to another cpu after we get the processor id snapshot (very low probability, so it will 99%+ not spin at all).
And the steal operation can use the same algorithm it is currently using, except that it will only have to search through the fixed-size array instead of the n-sized linked list.
This reduces memory to
O(n)
and take/peek toO(1)
(because the hardware threads do not change, it is constant while the program is running).Thoughts?
Beta Was this translation helpful? Give feedback.
All reactions