$$ \def\empty{\mathcal E} $$

CS206

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Spring 2015: Concurrency

[TOC]

Introduction

Moore's law : 2x transistors every 24 months
- got faster, software grew too
- but in 2005 : chips got bigger but not faster (15% not 50% anymore per year)
- reason : keep power at 100W
power : proportionnal to operating voltages squared times clock frequency
- mobile cpu : few W
- latop : 10s of W (burn you)
- server : 100 W (cooling)
- datacenter : 20 MW
voltage are not going down enough (allowing better clock frequency)
new Moore's law : more cores
- simpler cores (Prius instead of Audi)
- each core : fewer joules/op
- need for parallelism/concurrency : cannot wait 6 month for each software grows
scaling process : not intuitive
- require great care : synchronization
historical view
- I/O : message passing, hadoop, sql
- memory : shared memory, java threads
- processor : dataflow, GPU, CUDA

Basic architecture

model
- multiple threads (or so called processes)
- objects live in shared memory
- unpredictable asynchronous delays
  - cache misses (short)
  - page faults (long)
  - scheduling quantum used up (really long)
multicore architecture
- processors : on the same chip
  - SMP : symmetric multiprocessor (no longer these days, but for this course)
  - NUMA : non-uniform memory access (organized in a grid)
  - CC-NUMA : cache-coherent
  - bottleneck : bus
  - 1 cycle : fetch and execute one instruction
- threads
  - execution of a sequential program
  - does I/O
  - spining : waiting
- interconnect
  - broadcast medium
  - connects : processor to processor and processor to memory
  - network : tiny LAN
  - finite resource : avoid using too much bandwidth (if >50%, waiting is exponential)
- memory
  - 100s cycles : access
  - far apart
- caches
  - keep memory close
  - avoid use of interconnect
  - replacement policy
  - types
    - fully associative : any line can be anywhere in the cache (flexible replace, hard to find)
    - direct mapped cache : every address has exactly 1 slot (fixed replace, easy to find)
    - K-way set associative : each slots holds k lines (flexible replace, easy to find)
memory hierarchies
- L1 cache : small & fast (1/2 cycles)
- L2 cache : larger & slower (10s of cycles ~ 128 byte block)
  - lower level can be more associative (> 16 ways) : share sets
- latency : average memory access time
  - = cache time time x (1 - cache miss rate) + memory latency x cache miss rate
- misses
  - compulsory
  - conflict
  - capacity misses
  - coherence : read/write from multiple processors (need protocol)
    - true sharing : read/write same variable, communicate unique value in block from one thread to another
    - false sharing : read/write distinct variables that reside in the same cache block, reduce performance (cause extra misses), increased if more spatial
    - when writing software should try to divide up data structures to avoid false sharing
- write policy
  - write-through : memory up-to-date, cache always agree, too much bus traffic (especially for unshared data as a loop)
  - write-back : accumulate changes, write when the block is evicted (what happens if another processor wants it : need invalidate)
  - need for write buffer : absorbing / batching

Performance and efficiency

parallel thinking
- maximize parallelism
- reduce overhead
- galatic motion : simulation of n bodies can be approximated
- ocean simulation : grid of element with velocity, compute impact and iterate
- deep learning
- concurrency : two or more threads make forward progress together
- parallelism : two or more threads execute at the same time
- all parallel threads are concurrent but not the converse
- task : piece of work
- thread : performs tasks
- processors : execute threads
forms of parallelism
- throughput parallelism : perform many identical sequential tasks
- functional or task parallelism : perform tasks that are functionnaly different
- pipeline parallelism : perform tasks that are different in a particular order
- data parallelism : perform the same task on different data
division of work
- balance workload
- reduce communication : overhead
- reduce extra work : thread creations, scheduling
- thread require resources
  - memory for stacks
  - a copy of the register file
  - program state : program counter, stack pointr
  - setup
  - short-lived : bad ratio of work
performance
- Amdahl's law : if you speed up only a small fraction of the execution time of a program or a computation, the speedup you achieve on the whole application is limited
  - speedup = 1 / (fraction enhanced / speedup enhanced (cores ?) + 1 - fraction enhanced)
- parallel execution is not ideal : 10 processors rarely get a speedup of 10
  - load imbalance
  - thread start/join overhead
  - communication overhead
  - even if fraction is almost everything : speedup enhanced << # processors
- best efficiency : concurrency ~ parallelism (1 thread per processor)
- prime finding :
  - split work rejected : unbalanced workload
  - need a shared counter : ticketting
  - but counter not atomic : need for mutual exclusion (synchronization)

Mutal exclusion

share items
- cannot just see and check : not atomic
- need explicit communication for coordination
- communication must be : persistent (like writing) and not transient (like speaking)
protocols
- cans (interrupt) : fallen = ready, upright = not ready
  - finite resources
  - what if long reset
- flag : raise, wait until other's flag is down, unleash, lower flag at returns
  - danger at start, but once started it is good
- flag 2 : same but let always Alice go first
  - proof : derive a contradiction by reasoning backwards
    - assume Alice was the last to look
    - therefore her flag is down
    - if her flag is down Bob can go
    - but Bob only go with his flag up
    - thus Alice will wait until he lower his flag
properties
- correctness
  - safety : nothing bad happens ever
    - never in simultaneously
  - liveness : something good happens eventually
    - no deadlock
    - if one wants in, it gets in
    - if both want in, one gets in
- quality
  - fairness : both are equally getting in
  - performance (waitingd) : stuck if the thread in is stuck
commons problems
- producer-consumer
  - cannot meet
  - cyclic dependence : don't put much or left the place empty
  - imply some waiting (idle) : problematic is one is delayed but common
  - safe : mutual exclusion
  - starvation : if producer is willing, consumer will always buy (liveness)
- reader-writer
  - multiple read
  - if one write, no one reads : no mixed message
  - mutual exclusion : requires waiting but reduce performance
  - solvable without mutual exclusion : add validity flag (semaphore)
    - Alice only writes to invalid billboards (validate them after)
    - Bob only reads from valid billboards (invalidate them after)
    - Alice waits only if all billboards are written
    - Bob only waits if there is no message
    - read/write in parallel (impossible with locks)
mutual exclusion
- formal definition
  - time
  - event $a_0$ : instantaneous, no simultaneous events
  - thread $a_0,a_1,\ldots$ : sequence of events, state machines (events are transistions)
  - states : thread state (program counter, local variables), system state (shared variables, union of thread states)
  - critical section $CS_i^k$ (thread $i$, $k$th execution) : piece of code that accesses a shared resource (data structure or device) that must not be concurrently accessed by more than one thread of execution
  - interleavings : events of two or more threads (not necessarly independent)
  - interval $A=(a_0,a_1)$ : time between events (may overlap or disjoint)
  - precedence $A_0\to B_0$ : end event of $A_0$ before start event of $B_0$
    - irreflexive : never true $A\to A$
    - antisymmetric : if $A\to B$ then not true that $B\to A$
    - transitive : if $A\to B \wedge B\to C$ then $A\to C$
  - $k$th occurence $a_0^k$
  - locks : acquire lock and release unlock methods
- properties
  - deadlock-free : system as a whole makes progress even if individuals starve
  - starvation-free : individual threads make progress
- variable should be volatile : no optimization is done (i.e not reorderable, not in register)
- solutions for 2
  - lockOne : raise flag, idle until other flag is down
    - source : write A (flag A = true) -> read A (flag B = false) -> CS A
    - source : write B (flag B = true) -> read B (flag A = false) -> CS B
    - assumption : read A (flag B = false) -> write B (flag B = true)
    - assumption : read B (flag A = false) -> write A (flag A = true)
    - give a cycle !
    - fails deadlock-freedom (both concurrent entering)
  - lockTwo : set itself as victim, idle until the victim is itself
    - satisfy mutual exclusion
    - not deadlock free (stuck if one thread finishes)
  - Peterson's algorithm : combine both flag[i] = true; victim = i; while (flag[j] && victim == i) {};
    - source : write B (flag B = true) -> write B (victim = B)
    - source : write A (victim = A) -> read A (flag B) -> read A (victim)
    - assumption : write B (victim = B) -> write A (victim A)
    - observation : write B (flag B = true) -> write B (victim = B) -> write A (victim = A) -> read A (flag B) -> read A (victim)
    - last two step : a read flag B = true and victim = A so it could not have entered the CS
    - deadlock free
    - starvation free : let other first but not twice
- solutions for n
  - filter algorithm : n-1 waiting rooms (levels)
    - at each level : at least one enteres level and at least one blocked if many try
    - for (int L = 1; L < n; L++) { level[i] = L; victim[L] = i; while ((∃ k != i level[k] >= L) && victim[L] == i ); }
    - claim : start at level 0, at most n-L threads enter level L
    - no starvation
    - fairness : threads can be overtaken by others
  - bounded waiting : stronger fairness guaranteees
    - divide lock into 2 parts : doorway interval (finite) and waiting interval (unbounded)
  - Bakery algorithm : wait until lower numbers are severed
    - lexicographic order with tickets and thread id
    - First come first served (FIFO)
    - flag[i] = true; label[i] = max(label[0], …,label[n-1])+1; while (∃k flag[k] && (label[i],i) > (label[k],k));

Synchronization constructs

synchronization hardware
- disable interrupts : too inefficient on multiprocessors
- atomics : non-interruptable
  - test memory words and set value
    - boolean Test&Set (boolean *target) { boolean rv = *target; *target = TRUE; return rv; }
    - while ( TestAndSet ( &lock )); // critical section; lock = FALSE;
  - swap memory words
    - void Swap (boolean *a, boolean *b){ boolean temp = *a; *a = *b; *b = temp; }
    - key = TRUE; while ( key ) Swap ( &lock, &key ); // critical section; lock = FALSE;
  - not pefect : thread wait spinning, huge traffic on the bus, not fair
  - 50% bus utilization : exponential response time
  - add cache test before
    - Test&Test&Set : do { while ( lock ); // lock is 0; } while ( TestAndSet ( &lock )); // critical section; lock = FALSE
    - Test&Swap : do { key = TRUE; while ( lock ); Swap ( &lock, &key ); } while ( key ); // critical section; lock = FALSE;
- transactions
  - atomic instruction sequence
  - all memory changes visible before/after but not during
sophisticated primitives
- semaphores S : atomic integer variable
  - wait : while (S <= 0); S--;
  - signal : S++
  - mutex : binary semaphore (initialized to 1) provide mutual exclusion
  - each wait : there has to be a be a corresponding signal
  - buzy waiting : always check something
  - without busy waiting : implement a waiting queue
    - wait : S->value--; if (S->value < 0) { add this thread to S->list; block(); }
    - signal : S->value++; if (S->value <= 0) { remove a thread P from S->list; wakeup(P); }
  - deadlock & starvation : if mutex are set in the inverse order at the same time
  - pririty inversion : problem when lower priority thread holds a lock neede by a higher one (solvable by priority inheritance protocol)
- monitors (synchronized method in Java)
  - abstraction data type, variable visible only inside
  - only one thread may be active within the monitor at a time
- conditional variables : conditions on queue in monitors
  - on a condition variable
    - wait suspends the thread until signal
    - signal resumes a thread (if any) that invoked wait
  - options
    - signal & wait : P waits until Q leaves or waits for another condition
    - signal & continue : Q waits until P leaves the monitor or waits for another condition
common problems
- bounded buffer
  - 1 buffer holds N items
  - 1 mutex at 1
  - semaphore full at 0
  - semaphre empty at N
- reader/writer
  - mutex at 1
  - semaphore write at 1
  - readcount : incremented by reader
    - first reader in when write is out
    - last reader out when write is in
- dining philosophers
  - 5 philosopher : either thining or eating
  - 5 chopsticks : need 2 for eating
  - semaphore chopstick[5] at 1
  - deadlock : each philosopher grabs left chopstick (solvable if they grab always the small chopstick id first)
  - solution : monitor + states

Concurrent data structures

contention : when many threads compete for a lock
concurrent object : should increase throughput without contention
coarse-grained synchronization : every method locks (with queue)
- sequential bottleneck : stand in line
- does not improve throughput
fine-grained synchronization : split object into independently-synchonized components
- only conflict : same component at same time
finally : always wrap up try block with unlock in finally
traffic jam : any concurrent data structure based on mutual exclusion has a big weakness (if one has cache fault, every others have to wait on it)
optimistic stategies
- optimistic traversal : traverse without locking & check after lock (give up any guarantees)
- lazy synchronization : break into two part, logical remove (marks component) and physical removal
lock-free data structures
- CAS : compare, if successfull sets it to new value, all of it atomically
linked list : using a list-based set
- set : add, remove, contains
- node : item, key (hashcode), next
- invariant : tail reachable from head, sorted, no duplicates
- coarse-grained locking : one lock to rule them all
- fine-grained locking : each piece has own lock
  - hand-over-hand locking : two consecutive lock (inefficient)
- lock-free : CAS not enough, need logical removal (AtomicMarkableReference)
  - remove : set mark bit in next field, redirect predecessor's pointer
  - boolean[] (argument) : extract and return mark at array index 0
  - public boolean compareAndSet(Object expectedRef, Object updateRef, boolean expectedMark, boolean updateMark);
  - public boolean attemptMark(Object expectedRef, boolean updateMark);
  - timing : advanced pointer $T_p$, perform CAS $T_C$, remove $T_r$
concurrent hash tables
- bucket : avoid collision
- methods (from lock-free list) : add (9%), remove (1%), contains (90%)
- resize : grows array, adjust hash function
- when to resize
  - bucket threshold : number in one exceed
  - global threshold : quarter of buckets exceed this value
- coarse-graind locking : no concurrency
- striped locks : each lock now associated with one buckets
  - array of lock
  - acquire locks in ascending order : make sure table references didn't change once all lock acquired (concurrent resize)
  - allocated new super-sized table
  - resize : lock whole array and each lock now associated with two or more buckets
- fine-grained locking
  - resize : exclusive mode
  - add, remove, contains : shared mode
  - read/write locks : allow two different locks
    - read : locks out writers, allow concurrent readers
    - write : locks out writers and readers
    - safe but liveness hard (continual stream of readers)
    - FIFO R/W lock : as soon as writer requests a lock, no more readers accepted

public boolean add(Object key) {
	int keyHash = key.hashCode() % lock.length;
	synchronized (lock[keyHash]) {
		int tabHash = key.hashCode() % table.length;
		return table[tabHash].add(key);
	}
}
private void resize(int depth, List[] oldTab) {
	synchronized (lock[depth]) {
		if (oldTab == this.table) {
			int next = depth + 1;
			if (next < lock.length) resize (next, oldTab);
			else sequentialResize();
		}
	}
}

lock-free : don't move items, move buckets instead (all items in a single lock-free list, buckets are shortcut)
- quick access, reasonable load balance
- split-order : $2^i$ key sorted in reverse LSB order (Least Significant Bit, allow recursive split)
- resize : split in two or more the buckets ($b=k\mod 2^{i+1}$ stay, $b+2^i=k\mod 2^{i+1}$ move, determined by $i+1$ bit counting backwards)
- sentinel nodes : remove node pointed by shortcut by adding some
  - MSB bit to one before reverse for all non regular keys
- split a bucket : add new sentinel, point from the bucket to the sentinel
- list : add take key argument, constructor take previous list and key
split-ordered set
- set size / number of buckets ratio with threshold
- intialize buckets : originally null, if you find one, initialize it

public class SOSet {
	protected LockFreeList[] table;
	protected AtomicInteger tableSize;
	protected AtomicInteger setSize;
	public SOSet(int capacity) {
		table = new LockFreeList[capacity];
		table[0] = new LockFreeList();
		tableSize = new AtomicInteger(2);
		setSize = new AtomicInteger(0);
	}
}
public boolean add(Object object) {
	int hash = object.hashCode();
	int bucket = hash % tableSize.get();
	int key = makeRegularKey(hash);
	LockFreeList list = getBucketList(bucket);
	if (!list.add(object, key))
		return false;
	resizeCheck();
	return true;
}
void initializeBucket(int bucket) {
	int parent = getParent(bucket);
	if (table[parent] == null)
		initializeBucket(parent);
	int key = makeSentinelKey(bucket);
	LockFreeList list = new LockFreeList(table[parent], key);
}

Schedulding & work distribution

threads
- overhead : memory for stacks, setup, teardown, scheduler
- short-lived thread (~200 $\mu$s) : bad ratio work/overhead
- pool abstraction (long-lived) : run task, rejoin pool, wait for next
  - ExecutorService
  - task : Runnable (calls run) or Callable (returns call)
  - submit task to executor (~$\mu$s) : future in return
  - futures : get the result (blocking)
  - efficacity : ratio task length/submission overhead

class FibTask implements Callable<Integer> {
	static ExecutorService exec = Executors.newCachedThreadPool();
	int arg;
	public FibTask(int n) {
		arg = n;
	}
	public Integer call() {
		if (arg > 2) {
			Future<Integer> left = exec.submit(new FibTask(arg-1));
			Future<Integer> right = exec.submit(new FibTask(arg-2));
			return left.get() + right.get();
		} else {
			return 1;
		}
	}
}

Blumofe-Leiserson DAG model : multithreaded program is Directed Acyclic Graph that unfolds dynamically
timing
- time on $P$ processor : $T_p$
- work : $T_1$
- critical path : $T_\infty$ (longest dependency path)
- parallelism : $T_1/T_\infty$
- work law : $T_P\ge T_1/P$ (cannot do more than $P$ work in one step)
- critical path law : $T_P\ge T_\infty$ (finite resources)
- speedup on $P$ processors : ratio $T_1/T_P$ (linear $\propto P$, max $T_1/T_\infty$)
work dealing : everyone passes extra work to others
- lock-free work stealing : each thread has a pool, remove without synchronization, steal someone at random
- double-ended queue ideal : lock-free, constant time, linearizable
  - pushBottom, popBottom, popTop
  - need to seal stealing access : stamp/top-bottom

public class BDEQueue {
	AtomicStampedReference<Integer> top;
	volatile int bottom;
	Runnable[] tasks;
}
void pushBottom(Runnable r){
	tasks[bottom] = r;
	bottom++;
}
public Runnable popTop() {
	int[] stamp = new int[1];
	int oldTop = top.get(stamp), newTop = oldTop + 1;
	int oldStamp = stamp[0], newStamp = oldStamp + 1;
	if (bottom <= oldTop)
		return null;
	Runnable r = tasks[oldTop];
	if (top.CAS(oldTop, newTop, oldStamp, newStamp))
		return r;
	return null;
}
Runnable popBottom() {
	if (bottom == 0) return null;
	bottom--;
	Runnable r = tasks[bottom];
	int[] stamp = new int[1];
	int oldTop = top.get(stamp), newTop = 0;
	int oldStamp = stamp[0], newStamp = oldStamp + 1;
	if (bottom > oldTop) return r;
	if (bottom == oldTop){
		bottom = 0;
		if (top.CAS(oldTop, newTop, oldStamp, newStamp))
			return r;
	}
	top.set(newTop,newStamp); return null;
	bottom = 0;
}

work balancing : stealing is costly, move more than one task randomly

Data parallel computing

perform same task on different data
map reduce
- divide up the data among servers
- compute the stats
- reduce
vector porcessors : pipelined execution, single instruction multiple data (SIMD)
- reduce ops by 1.2x
- reduce instructions by 20x
- automatic code vectorization with some compiler (but for (i =0; i < N; i++) a[i] = a[b[i]] * fade; not)
- ISA SIMD support (x86 also) : streaming SIMD extenstions (SSE)
  - vector register with : integer (16 bytes, 8 shorts, 4 int), single precision (4), double precision (2)
  - instruction : add, sub, mul, div, sqrt, max, min, and, or, xor, andn, cmpps
  - in ARM (NEON) : 32x64bits registers
- micro architecture : functional units & vector registers
graphic processing units (GPU) : dense grid of alu, single instruction multiple threads (SIMT)
- tousands tiny cores : little cache mostly alu
- integrated GPU (AMD) : shared cache hierarchy, one memory
- discrete GPU (nVidia) : specialized memory, must move data
- CPU/GPU connections
  - CPU <-> memory : 30 GB/s
  - GPU <-> GPU memory : 300 GB/s
  - CPU <-> GPU : 3 GB/s
- GPU thread : little overhead (microsecond), full hardware, thousands
- help : parallel computation, memory overlap
- execution timeline : copy data to GPU, execution configuration launch kernel, synchronize, copy data to CPU
- CUDA (nVidia, market leader) vs OpenCL (CUDA superset, new)

Cuda

no execution order
computation partioning
- thread in block in grid
- thread within same block can communicate/synchronize
- grid : machine dependent
- thread : max 1024
memory model
- thread : local memory, local register, all within block have shared memory (on-chip)
- block : global memory (may be cached), constant memory (read-only), texture memory (read-only)
CUDA function : excution from caller
- __device__ : device from device
- __global__ : device from host (return void)
- __host__ : host from host
- device and host can be used at same time
CUDA memory copy : cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction)
- cudaMemcpyHostToDevice
- cudaMemcpyDeviceToHost
- cudaMemcpyDeviceToDevice

__global__ void fade (int *a, int fade, int N) {
	int x = blockDim.x * blockIdx.x + threadIdx.x;
	int y = blockDim.y * blockIdx.y + threadIdx.y
	int offset = y * (blockDim.x * gridDim.x) + x;
	if (offset > N) return;
	int v = a[offset];
	v = v * fade;
	a[offset] = v;
}
cudaMalloc((void **) &da, sizeof (float) * N);
int threads_block = 64;
int blocks = (N + threads_block – 1) / threads_block;
arradd<<<blocks, threads_block>>(da, 10f, N);
cudaDeviceSynchronize (); // forces CPU to wait
cudaMemCpy((void *) ha, (void *) da, sizeof (float) * N, cudaMemcpyDeviceToHost); // dest, source
cudaFree(da);

FLOPs : floating operations per second
shared memory : fastest computation
tiled kernel : matrix multiply
- __syncthreads require : no loading and computing at same time

dim3 dimBlock(TILE_WIDTH, TILE_WIDTH);
dim3 dimGrid(Width / TILE_WIDTH, Width / TILE_WIDTH);
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) {
	__shared__float Mds[TILE_WIDTH][TILE_WIDTH];
	__shared__float Nds[TILE_WIDTH][TILE_WIDTH];
	int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y;
	int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; float Pvalue = 0;
	for (int m = 0; m < Width/TILE_WIDTH; ++m) {
		// Collaborative loading of Md and Nd tiles into shared memory
		Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
		Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];
		__syncthreads();
		for (int k = 0; k < TILE_WIDTH; ++k)
			Pvalue += Mds[ty][k] * Nds[k][tx];
		__syncthreads();
	}
	Pd[Row*Width+Col] = Pvalue;
}

wrap : subgroup of a block
banks : memory division to achieve high bandwidth (1 access per 2 cycles)
- serialization : threads accessing different words in same bank (sequential)
- multicasting : threads accessing same word in same bank (parallel)
- less bank conflict, fastest it is (cost = max simultaneous access to single bank)
- conflict free if no common factors shared with bank number
reduce operations
- pairewise : $\log_2(n)$
- strategy
  - interleaved accesses : shared memory, each step reduce by 2
  - non-divergent threads : group all active threads together
  - thread-sequential accesses : 2-way conflict at every step
  - read two elements and do the first step
  - unrolling the last 6 iterations

__global__ void reduce0(int *g_idata, int *g_odata, int n) {
	extern __shared__ int sdata[];
	// each thread loads one element from global to shared mem
	unsigned int tid = threadIdx.x;
	unsigned int i = blockIdx.x* blockDim.x * 2 + threadIdx.x;
	sdata[tid] = g_idata[i] + g_idata[i + blockDim.x];
	__syncthreads();
	for (unsigned int s = blockDim.x/2; s > 32; s /= 2) {
		if (tid < s) {
			sdata[tid] += sdata[tid + s];
		}
		__syncthreads();
	}
	if (tid < 32) {
		sdata[tid] += sdata[tid + 32];
		sdata[tid] += sdata[tid + 16];
		sdata[tid] += sdata[tid + 8];
		sdata[tid] += sdata[tid + 4];
		sdata[tid] += sdata[tid + 2];
		sdata[tid] += sdata[tid + 1];
	}
	if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cs206.md

cs206.md

CS206

Introduction

Basic architecture

Performance and efficiency

Mutal exclusion

Synchronization constructs

Concurrent data structures

Schedulding & work distribution

Data parallel computing

Cuda

Files

cs206.md

Latest commit

History

cs206.md

File metadata and controls

CS206

Introduction

Basic architecture

Performance and efficiency

Mutal exclusion

Synchronization constructs

Concurrent data structures

Schedulding & work distribution

Data parallel computing

Cuda