ConcurrentBag optimize internal lists #60574

timcassell · 2021-10-18T19:57:24Z

timcassell
Oct 18, 2021

I was looking at the source code for ConcurrentBag<T> and noticed its implementation potentially creates many more lists than are theoretically necessary, due to the ThreadLocal<ThreadLocalList> m_locals; field. This is done for a single list for each software thread for fast access on a single thread.

Accessing the bag from background/threadpool threads could make thousands of ThreadLocal objects (1 for each thread), causing O(n+m) memory consumption and O(n) runtime for take/peek. But we really only need a single list for each hardware thread, as we know that 2 software threads cannot run simultaneously on the same hardware thread. Even if we have 1000 software threads running, we may only have 16 hardware threads executing those software threads.

So I was thinking, instead of having these fields:

ThreadLocal<ThreadLocalList> m_locals;
// This head and tail pointers points to the first and last local lists, to allow enumeration on the thread locals objects
volatile ThreadLocalList m_headList, m_tailList;

it can use a readonly array with its size set to the number of hardware threads:

// 
readonly ProcessorLocalList[] m_locals = new ProcessorLocalList[Environment.ProcessorCount];

Then, when doing a take/peek/add operation, we use the list according to the current processor id, and use a spinlock technique in case of the operating system moving the thread to another cpu after we get the processor id snapshot (very low probability, so it will 99%+ not spin at all).

ProcessorLocalList currentList = m_locals[Thread.GetCurrentProcessorId()];
// not proper locking for brevity
currentList.spinlock.Enter();
// do operation...
currentList.spinlock.Exit();

And the steal operation can use the same algorithm it is currently using, except that it will only have to search through the fixed-size array instead of the n-sized linked list.

This reduces memory to O(n) and take/peek to O(1) (because the hardware threads do not change, it is constant while the program is running).

Thoughts?

timcassell · 2021-10-19T22:11:25Z

timcassell
Oct 19, 2021
Author

Another idea besides using Thread.GetCurrentProcessorId() as the list index and searching all lists for steals, is store the add and remove indices as fields and increment them while constraining to the array length in an interlocked manner for each operation. That way the load is spread evenly across all lists no matter which thread accesses the bag most. Always add/peek the head of the list, and always take from the tail of the list (or vice-versa). This would effectively make it behave like a queue, but without the ordering guarantees due to thread races (and peek returns a different value than take).

volatile int m_addIndex;

private int InterlockedGetAddIndex()
{
    // Interlocked make m_addIndex loop around the array length.
    int initialValue, newValue;
    do
    {
        initialValue = m_addIndex;
        newValue = (initialValue + 1) % Environment.ProcessorCount;
    } while (Interlocked.CompareExchange(ref m_addIndex, newValue, initialValue) != initialValue);
    return newValue;
}

The downside to this method is the higher likelihood of add/take contention, but that's mitigated by each operating on opposite ends of the list (like the current steal). Effectively, we always steal, instead of conditionally steal. But we only need to steal from 1 list instead of scanning all lists, making the runtime true O(1) regardless of how many hardware threads there are. It also reduces contention on a single list due to steals (the current algorithm directs multiple threads to attempt to steal from the same list).

Also, this would make local pooling of the list nodes for amortized zero-allocations make sense. (I expect to use a ConcurrentBag as an object pool to reduce GC pressure, constantly adding and taking items throughout its lifetime, why does it allocate on every add itself?)

3 replies

danmoseley Oct 19, 2021
Collaborator

cc @stephentoub who probably has context on the implementation choices made..

timcassell Nov 1, 2021
Author

I just realized I was looking at .Net Framework 4.8's implementation rather than the current. The current implementation uses WorkStealingQueues with internal arrays instead of linked lists, but it also still uses the ThreadLocal field with links to each other thread's locals, so my thoughts are still valid.

stephentoub Nov 1, 2021
Collaborator

The primary benefit of the work-stealing queues is not just reduced contention but actually the ability to avoid a lock at all for local access when there are at least a couple of items in the queue. I don't see how that's achieved with your proposal, nor how in your alternative proposal how you're handling thread-safety in the face of thread growth, or even multiple threads adding and taking at the same time (which can result in them all trying to use the same slot at the same time even with the interlocked there). At the end of the day, this all comes down to measuring on the target scenarios and making a call as to which data structures make the most sense given the scenarios of the day. At the time ConcurrentBag was written (and then rewritten), the work-stealing queues per thread worked out the best, compared to all the other approaches measured, which included variations like a locked stack per core (similar to what's used in ArrayPool after a thread-local cache), a ConcurrentQueue per core, etc. You're of course welcome to implement your proposed solution along with various benchmarks that you believe represent real-world usage as well as expected best/worst case microbenchmarks.

SingleAccretion · 2021-11-01T13:09:10Z

SingleAccretion
Nov 1, 2021
Collaborator

readonly ProcessorLocalList[] m_locals = new ProcessorLocalList[Environment.ProcessorCount];

This is not a reliable technique for determining "the number of hardware threads". You can tweak this value with an environment variable these days. In general, it seems very hard/impossible to query the "true" number of processors reliably, due to virtualization and emulation layers that exist.

2 replies

timcassell Nov 1, 2021
Author

I'm not sure what tweaking an env var for that would do, but the other things you mention actually give us a number we want. We only need the number of hardware threads available to the process, not necessarily the "true" number of hardware threads in the physical computer. Though I'm not entirely sure if the Environment.ProcessorCount value can change while the process is running (like changing the processor affinity in the Task Manager), which would throw a wrench in this... 🤔

svick Nov 1, 2021

Though I'm not entirely sure if the Environment.ProcessorCount value can change while the process is running (like changing the processor affinity in the Task Manager), which would throw a wrench in this... 🤔

At least with the current implementation, it can't:

runtime/src/libraries/System.Private.CoreLib/src/System/Environment.cs

Line 14 in 2a87ffe

public static int ProcessorCount { get; } = GetProcessorCount();

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConcurrentBag optimize internal lists #60574

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

ConcurrentBag optimize internal lists #60574

timcassell Oct 18, 2021

Replies: 2 comments · 5 replies

timcassell Oct 19, 2021 Author

danmoseley Oct 19, 2021 Collaborator

timcassell Nov 1, 2021 Author

stephentoub Nov 1, 2021 Collaborator

SingleAccretion Nov 1, 2021 Collaborator

timcassell Nov 1, 2021 Author

svick Nov 1, 2021

timcassell
Oct 18, 2021

Replies: 2 comments 5 replies

timcassell
Oct 19, 2021
Author

danmoseley Oct 19, 2021
Collaborator

timcassell Nov 1, 2021
Author

stephentoub Nov 1, 2021
Collaborator

SingleAccretion
Nov 1, 2021
Collaborator

timcassell Nov 1, 2021
Author