diff --git a/content/blog/2023/08/22-first-conformant-m1-gpu-driver.md b/content/blog/2023/08/22-first-conformant-m1-gpu-driver.md new file mode 100644 index 0000000..f80a9f4 --- /dev/null +++ b/content/blog/2023/08/22-first-conformant-m1-gpu-driver.md @@ -0,0 +1,215 @@ ++++ +date = "2023-08-22T23:30:00+09:00" +draft = false +title = "The first conformant M1 GPU driver" +slug = "first-conformant-m1-gpu-driver" +author = "Alyssa Rosenzweig" ++++ + +Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs. +That means the drivers are compatible with any OpenGL ES 3.1 application. +Interested? [Just install Linux!](https://fedora-asahi-remix.org/) + +For existing [Asahi Linux](https://asahilinux.org/) users, +upgrade your system with dnf +upgrade (Fedora) or pacman -Syu +(Arch) for the latest drivers. + +Our reverse-engineered, free and [open source graphics +drivers](https://gitlab.freedesktop.org/asahi/mesa) are the world's ***only*** +conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics +hardware. That means our driver passed tens of thousands of tests to +demonstrate correctness and is now recognized by the industry. + +To become conformant, an "implementation" must pass the official conformance +test suite, designed to verify every feature in the specification. The test +results are submitted to Khronos, the standards body. After a [30-day review +period](https://www.khronos.org/conformance/adopters/), if no issues are found, +the implementation becomes conformant. The Khronos website lists all conformant +implementations, including our drivers for the +[M1](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1007), +[M1 +Pro/Max/Ultra](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1014), +[M2](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1016), +and [M2 +Pro/Max](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1017). + +Today's milestone isn't just about OpenGL ES. We're releasing the first +conformant implementation of *any* graphics standard for the M1. And we don't +plan to stop here ;-) + +[![Teaser of the "Vulkan instancing" demo running on Asahi Linux](/img/blog/2023/08/vkinstancing2.webp)](/img/blog/2023/08/vkinstancing.webp) + +Unlike ours, the manufacturer's M1 drivers are unfortunately not conformant for _any_ +standard graphics API, whether Vulkan or OpenGL or OpenGL ES. That means that +there is no guarantee that applications using the standards will work on your M1/M2 (if you're +not running Linux). This isn't just a theoretical issue. Consider Vulkan. +The third-party [MoltenVK](https://github.com/KhronosGroup/MoltenVK) +layers a subset of Vulkan on top of the proprietary drivers. However, those drivers +lack key functionality, breaking valid Vulkan applications. That hinders +developers and users alike, if they haven't yet switched their M1/M2 computers +to Linux. + +Why did *we* pursue standards conformance when the manufacturer did not? Above +all, our commitment to quality. We want our users to know that they can depend +on our Linux drivers. We want standard software to run without M1-specific +hacks or porting. We want to set the right example for the ecosystem: the way forward is +implementing open standards, conformant to the specifications, without +compromises for "portability". We are not satisfied with proprietary +drivers, proprietary APIs, and refusal to implement standards. The rest of the +industry knows that progress comes from cross-vendor collaboration. We know it, +too. Achieving conformance is a win for our community, for open source, and for +open graphics. + +Of course, [Asahi Lina](https://vt.social/@lina/) and I are two individuals +with minimal funding. It's a little awkward that we beat the big corporation... + +It's not too late though. They should follow our lead! + +--- + +OpenGL ES 3.1 updates the experimental [OpenGL ES 3.0 and OpenGL +3.1](/img/blog/2024/02/blog/opengl3-on-asahi-linux.html) we shipped in +June. Notably, ES 3.1 adds compute shaders, typically used to accelerate +general computations within graphics applications. For example, a 3D game could +run its physics simulations in a compute shader. The simulation results can +then be used for rendering, eliminating stalls that would otherwise be required +to synchronize the GPU with a CPU physics simulation. That lets the game run +faster. + +Let's zoom in on one new feature: atomics on images. Older versions of +OpenGL ES allowed an application to read an image in order to display it on screen. +ES 3.1 allows the application to *write* to the image, typically from a +compute shader. This new feature enables flexible image processing algorithms, which +previously needed to fit into the fixed-function 3D pipeline. However, GPUs +are massively parallel, running thousands of threads at the same time. If two +threads write to the same location, there is a conflict: depending which thread +runs first, the result will be different. We have a race condition. + +"Atomic" access to memory provides a solution to race conditions. With atomics, +special hardware in the memory subsystem guarantees consistent, well-defined +results for select operations, regardless of the order of the threads. Modern +graphics hardware supports various atomic operations, like addition, +serving as building blocks to complex parallel algorithms. + +Can we put these two features together to write to an image atomically? + +Yes. A ubiquitous OpenGL ES +[extension](https://registry.khronos.org/OpenGL/extensions/OES/OES_shader_image_atomic.txt), +required for ES 3.2, adds atomics operating on pixels in an image. For +example, a compute shader could atomically increment the value at pixel (10, +20). + +Other GPUs have dedicated instructions to perform atomics on an images, making +the driver implementation straightforward. For us, the story is more +complicated. The M1 lacks hardware instructions for image atomics, even though +it has non-image atomics and non-atomic images. We need to reframe the +problem. + +The idea is simple: to perform an atomic on a pixel, we instead calculate +the address of the pixel in memory and perform a regular atomic on that +address. Since the hardware supports regular atomics, our task is "just" +calculating the pixel's address. + +If the image were laid out linearly in memory, this would be straightforward: +multiply the Y-coordinate by the number of bytes per row ("stride"), multiply +the X-coordinate by the number of bytes per pixel, and add. That gives the +pixel's offset in bytes relative to the first pixel of the image. To get the +final address, we add that offset to the address of the first pixel. + +Address of (X, Y) equals Address of (0, 0) + Y times Stride + X times Bytes Per Pixel + +Alas, images are rarely linear in memory. To improve cache +efficiency, modern graphics hardware interleaves the X- and Y-coordinates. +Instead of one row after the next, pixels in memory follow a [spiral-like +curve](https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/). + +We need to amend our previous equation to interleave the coordinates. We could +use many instructions to mask one bit at a time, shifting to construct the +interleaved result, but that's inefficient. We can do better. + +There is a well-known ["bit twiddling" algorithm to interleave +bits](https://graphics.stanford.edu/~seander/bithacks.html#InterleaveBMN). +Rather than shuffle one bit at a time, the algorithm shuffles groups of bits, +parallelizing the problem. Implementing this algorithm in shader code improves +performance. + +In practice, only the lower 7-bits (or less) of each coordinate are +interleaved. That lets us use 32-bit instructions to "vectorize" the +interleave, by putting the X- and Y-coordinates in the low and high 16-bits of +a 32-bit register. Those 32-bit instructions let us interleave X and Y at the +same time, halving the instruction count. Plus, we can exploit the GPU's +combined shift-and-add instruction. Putting the tricks together, we interleave +in 10 instructions of M1 GPU assembly: + +```asm +# Inputs x, y in r0l, r0h. +# Output in r1. + +add r2, #0, r0, lsl 4 +or r1, r0, r2 +and r1, r1, #0xf0f0f0f +add r2, #0, r1, lsl 2 +or r1, r1, r2 +and r1, r1, #0x33333333 +add r2, #0, r1, lsl 1 +or r1, r1, r2 +and r1, r1, #0x55555555 +add r1, r1l, r1h, lsl 1 +``` + +We could stop here, but what if there's a *dedicated* instruction to interleave +bits? PowerVR has a "shuffle" instruction +[`shfl`](https://docs.imgtec.com/reference-manuals/powervr-instruction-set-reference/topics/bitwise-instructions/SHFL.html), +and the M1 GPU borrows from PowerVR. Perhaps that instruction was borrowed too. +Unfortunately, even if it was, the proprietary compiler won't use it when +compiling our test shaders. That makes it difficult to reverse-engineer the +instruction -- if it exists -- by observing compiled shaders. + +It's time to dust off a powerful reverse-engineering technique from +magic kindergarten: guess and check. + +[Dougall Johnson]( https://mastodon.social/@dougall) provided the guess. +When considering the instructions we already know about, he took special notice +of the "reverse bits" instruction. Since reversing bits is a type of bit +shuffle, the interleave instruction should be encoded similarly. The bit +reverse instruction has a two-bit field specifying the operation, with value +`01`. Related instructions to _count the number of set bits_ and _find the +first set bit_ have values `10` and `11` respectively. That encompasses all +known "complex bit manipulation" instructions. + +There is one value of the two-bit enumeration that is unobserved and unknown: +`00`. If this interleave instruction exists, it's probably encoded like the bit +reverse but with operation code `00` instead of `01`. + +There's a difficulty: the three known instructions have one single input +source, but our instruction interleaves two sources. Where does the second +source go? We can make a guess based on symmetry. Presumably to simplify the +hardware decoder, M1 GPU instructions usually encode their sources +in consistent locations across instructions. The other three instructions have +a gap where we would expect the second source to be, in a two-source +arithmetic instruction. Probably the second source is there. + +Armed with a guess, it's our turn to check. Rather than handwrite GPU assembly, +we can hack our compiler to replace some two-source integer operation (like +multiply) with our guessed encoding of "interleave". Then we write a compute +shader using this operation (by "multiplying" numbers) and run it with the +newfangled compute support in our driver. + +All that's left is writing a +[shader](/img/blog/2024/02/blog/interleave.shader_test) that checks that +the mystery instruction returns the interleaved result for each possible input. +Since the instruction takes two 16-bit sources, there are about 4 billion +($2^32$) inputs. With our driver, the M1 GPU manages to check them all in under +a second, and the verdict is in: this is our interleave instruction. + +As for our clever vectorized assembly to interleave coordinates? We can replace +it with one instruction. It's anticlimactic, but it's fast and it passes +the conformance tests. + +And that's what matters. + +--- + +_Thank you to [Khronos](https://www.khronos.org/) and [Software in the Public Interest](https://www.spi-inc.org/) for supporting open +drivers._ diff --git a/content/blog/2024/02/14-conformant-gl46-on-the-m1.md b/content/blog/2024/02/14-conformant-gl46-on-the-m1.md new file mode 100644 index 0000000..fa2ac9f --- /dev/null +++ b/content/blog/2024/02/14-conformant-gl46-on-the-m1.md @@ -0,0 +1,308 @@ ++++ +date = "2024-02-14T12:00:00+09:00" +draft = false +title = "Conformant OpenGL 4.6 on the M1" +slug = "conformant-gl46-on-the-m1" +author = "Alyssa Rosenzweig" ++++ + +For years, the M1 has only supported OpenGL 4.1. That changes +today -- with our release of full OpenGL® 4.6 and OpenGL® ES 3.2! +[Install Fedora](https://fedora-asahi-remix.org/) for the latest M1/M2-series +drivers. + +Already installed? Just dnf upgrade \-\-refresh. + +Unlike the vendor's non-conformant 4.1 drivers, our [open +source](https://gitlab.freedesktop.org/asahi/mesa) Linux drivers are +**conformant** to the latest OpenGL versions, finally promising broad +compatibility with modern OpenGL workloads, like +[Blender](https://www.blender.org/). + + + +Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The +official list of conformant drivers now includes [our OpenGL +4.6]() +and [ES +3.2](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1045). + +While the vendor doesn't yet support graphics standards like modern OpenGL, we do. For this Valentine's Day, we want to profess our love for +interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special +ports. For that, we need standards conformance. Six months ago, we became the [first +conformant driver for any standard graphics API for the +M1](/blog/first-conformant-m1-gpu-driver.html) with the release of OpenGL ES +3.1 drivers. Today, we've finished OpenGL with the full 4.6... and we're well +on the road to Vulkan. + +--- + +Compared to 4.1, OpenGL 4.6 adds dozens of required features, including: + +* Robustness +* SPIR-V +* [Clip control](/blog/asahi-gpu-part-6.html) +* Cull distance +* [Compute shaders](/blog/first-conformant-m1-gpu-driver.html) +* Upgraded transform feedback + +Regrettably, the M1 doesn't map well to any graphics standard newer than OpenGL +ES 3.1. While Vulkan makes some of these features optional, the missing features are +required to layer DirectX and OpenGL on top. No existing solution on M1 gets +past the OpenGL 4.1 feature set. + +How do we break the 4.1 barrier? Without hardware support, new features need +new tricks. Geometry shaders, tessellation, and transform feedback become +compute shaders. Cull distance becomes a transformed interpolated value. Clip +control becomes a vertex shader epilogue. The list goes on. + +For a taste of the challenges we overcame, let's look at **robustness**. + +Built for gaming, GPUs traditionally prioritize raw performance over safety. +Invalid application code, like a shader that reads a buffer out-of-bounds, +can trigger undefined behaviour. Drivers exploit that to maximize performance. + +For applications like web browsers, that trade-off is undesirable. +Browsers handle untrusted shaders, which they must sanitize to ensure stability +and security. Clicking a malicious link should not crash the browser. While +some sanitization is necessary as graphics APIs are not security barriers, +reducing undefined behaviour in the API can assist "defence in depth". + +"Robustness" features can help. Without robustness, out-of-bounds buffer access +in a shader can crash. With robustness, the application can opt for +defined out-of-bounds behaviour, trading some performance for less attack +surface. + +All modern cross-vendor APIs include robustness. Many games even +(accidentally?) rely on robustness. Strangely, the vendor's proprietary +API omits buffer robustness. We must do better for conformance, correctness, +and compatibility. + +Let's first define the problem. Different APIs have different definitions of +what an out-of-bounds load returns when robustness is enabled: + +* Zero (Direct3D, Vulkan with `robustBufferAccess2`) +* Either zero or some data in the buffer (OpenGL, Vulkan with + `robustBufferAccess`) +* Arbitrary values, but can't crash (OpenGL ES) + +OpenGL uses the second definition: return zero or data from the buffer. +One approach is to return the *last* element of the buffer for +out-of-bounds access. Given the buffer size, we can calculate the last index. +Now consider the *minimum* of the index being accessed and the last index. That +equals the index being accessed if it is valid, and some other valid index +otherwise. Loading the minimum index is safe and gives a spec-compliant result. + +As an example, a uniform buffer load without robustness might look like: + +```asm +load.i32 result, buffer, index +``` + +Robustness adds a single unsigned minimum (`umin`) instruction: + +```asm +umin idx, index, last +load.i32 result, buffer, idx +``` + +Is the robust version slower? It can be. The difference should be small +percentage-wise, as arithmetic is faster than memory. With thousands +of threads running in parallel, the arithmetic cost may even be hidden by the +load's latency. + +There's another trick that speeds up robust uniform buffers. Like other GPUs, +the M1 supports "preambles". The idea is simple: instead of calculating the +same value in every thread, it's faster to calculate once and reuse the result. +The compiler identifies eligible calculations and moves them to a preamble +executed before the main shader. These redundancies are common, so preambles +provide a nice speed-up. + +We usually move uniform buffer loads to the preamble when every thread loads +the same index. Since the size of a uniform buffer is fixed, extra robustness +arithmetic is *also* moved to the preamble. The robustness is "free" for the +main shader. For robust storage buffers, the clamping might move to the +preamble even if the load or store cannot. + +Armed with robust uniform and storage buffers, let's consider robust "vertex +buffers". In graphics APIs, the application can set vertex buffers with a base +GPU address and a chosen layout of "attributes" within each buffer. Each +attribute has an offset and a format, and the buffer has a "stride" indicating +the number of bytes per vertex. The vertex shader can then read attributes, +implicitly indexing by the vertex. To do so, the shader loads the address: + +Base plus stride times vertex plus offset + +Some hardware implements robust vertex fetch natively. Other hardware has +bounds-checked buffers to accelerate robust software vertex fetch. +Unfortunately, the M1 has neither. We need to implement vertex fetch with raw +memory loads. + +One instruction set feature helps. In addition to a 64-bit base address, +the M1 GPU's memory loads also take an offset in *elements*. The hardware shifts the offset and +adds to the 64-bit base to determine the address to +fetch. Additionally, the M1 has a combined integer multiply-add instruction +`imad`. Together, these features let us implement vertex loads in two +instructions. For example, a 32-bit attribute load looks like: + +```asm +imad idx, stride/4, vertex, offset/4 +load.i32 result, base, idx +``` + +The hardware load can perform an additional small shift. Suppose our attribute +is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction: + +```asm +load.v4i32 result, base, vertex << 2 +``` + +...with the hardware calculating the address: + +Base plus 4 times vertex left shifted 2, which equals Base plus 16 times vertex + +What about robustness? + +We want to implement robustness with a clamp, like we did for uniform +buffers. The problem is that the vertex buffer size is given in bytes, while +our optimized load takes an index in "vertices". A single vertex buffer can +contain multiple attributes with different formats and offsets, so we can't +convert the size in bytes to a size in "vertices". + +Let's handle the latter problem. We can rewrite the addressing equation as: + +Base plus offset, which is the attribute base, plus stride times vertex + +That is: one buffer with many attributes at different offsets is equivalent to +many buffers with one attribute and no offset. This gives an alternate +perspective on the same data layout. Is this an improvement? It avoids an +addition in the shader, at the cost of passing more data -- addresses are +64-bit while attribute offsets are +[16-bit](https://vulkan.gpuinfo.org/listreports.php?limit=maxVertexInputAttributeOffset&value=4294967295&platform=all0). +More importantly, it lets us translate the vertex buffer size in bytes into a +size in "vertices" for *each* vertex attribute. Instead of clamping the offset, +we clamp the vertex index. We still make full use of the hardware addressing +modes, now with robustness: + +```asm +umin idx, vertex, last valid +load.v4i32 result, base, idx << 2 +``` + +We need to calculate the last valid vertex index ahead-of-time for each +attribute. Each attribute has a format with a particular size. Manipulating +the addressing equation, we can calculate the last *byte* accessed in the +buffer (plus 1) relative to the base: + +Offset plus stride times vertex plus format + +The load is valid when that value is bounded by the buffer size in bytes. We +solve the integer inequality as: + +Vertex less than or equal to the floor of size minus offset minus format divided by stride + +The driver calculates the right-hand side and passes it into the shader. + +One last problem: what if a buffer is too small to load *anything*? Clamping +won't save us -- the code would clamp to a negative index. In that case, +the attribute is entirely invalid, so we swap the application's buffer for a +small buffer of zeroes. Since we gave each attribute its own base address, +this determination is per-attribute. Then clamping the index to zero +correctly loads zeroes. + +Putting it together, a little driver math gives us robust buffers at the +cost of one `umin` instruction. + +--- + +In addition to buffer robustness, we need image robustness. Like its buffer +counterpart, image robustness requires that out-of-bounds image loads return +zero. That formalizes a guarantee that reasonable hardware already makes. + +...But it would be no fun if our hardware was reasonable. + +Running the conformance tests for image robustness, there is a single +test failure affecting "mipmapping". + +For background, mipmapped images contain multiple "levels of detail". The base +level is the original image; each successive level is the previous level +downscaled. When rendering, the hardware selects the level closest to matching +the on-screen size, improving efficiency and visual quality. + +With robustness, the specifications all agree that image loads return... + +* Zero if the X- or Y-coordinate is out-of-bounds +* Zero if the level is out-of-bounds + +Meanwhile, image loads on the M1 GPU return... + +* Zero if the X- or Y-coordinate is out-of-bounds +* Values from the last level if the level is out-of-bounds + +Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware +clamps the level and returns nonzero values. It's a mystery why. +The vendor does not document their hardware publicly, forcing us to +rely on reverse engineering to build drivers. Without documentation, +we don't know if this behaviour is intentional or a hardware bug. Either way, +we need a workaround to pass conformance. + +The obvious workaround is to never load from an invalid level: + +```glsl +if (level <= levels) { + return imageLoad(x, y, level); +} else { + return 0; +} +``` + +That involves branching, which is inefficient. Loading an out-of-bounds level +doesn't crash, so we can speculatively load and then use a compare-and-select +operation instead of branching: + +```glsl +vec4 data = imageLoad(x, y, level); + +return (level <= levels) ? data : 0; +``` + +This workaround is okay, but it could be improved. While the M1 GPU has combined +compare-and-select instructions, the instruction set is *scalar*. Each thread +processes one value at a time, not a vector of multiple values. However, image +loads return a vector of four components (red, green, blue, alpha). While the +pseudo-code looks efficient, the resulting assembly is not: + +```asm +image_load R, x, y, level +ulesel R[0], level, levels, R[0], 0 +ulesel R[1], level, levels, R[1], 0 +ulesel R[2], level, levels, R[2], 0 +ulesel R[3], level, levels, R[3], 0 +``` + +Fortunately, the vendor driver has a trick. We know the hardware returns zero +if either X or Y is out-of-bounds, so we can *force* a zero output by *setting* +X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X +greater than 16384 is out-of-bounds. That justifies an alternate workaround: + +```glsl +bool valid = (level <= levels); +int x_ = valid ? x : 20000; + +return imageLoad(x_, y, level); +``` + +Why is this better? We only change a single scalar, not a whole vector, +compiling to compact scalar assembly: + +```asm +ulesel x_, level, levels, x, #20000 +image_load R, x_, y, level +``` + +If we preload the constant to a uniform register, the workaround is a single +instruction. That's optimal -- and it passes conformance. + +--- + +_Blender ["Wanderer"](https://download.blender.org/demo/eevee/wanderer/wanderer.blend) demo by [Daniel Bystedt](https://www.artstation.com/dbystedt), licensed CC BY-SA._ diff --git a/content/blog/2024/06/05-vk13-on-the-m1-in-1-month.md b/content/blog/2024/06/05-vk13-on-the-m1-in-1-month.md new file mode 100644 index 0000000..09f970b --- /dev/null +++ b/content/blog/2024/06/05-vk13-on-the-m1-in-1-month.md @@ -0,0 +1,412 @@ ++++ +date = "2024-06-05T12:00:00+09:00" +draft = false +title = "Vulkan 1.3 on the M1 in 1 month" +slug = "vk13-on-the-m1-in-1-month" +author = "Alyssa Rosenzweig" ++++ + + + +Finally, conformant Vulkan for the M1! The new "Honeykrisp" driver is the first +[conformant +Vulkan®](https://www.khronos.org/conformance/adopters/conformant-products/vulkan#submission_780) +for Apple hardware on any operating system, implementing the full 1.3 spec +without "portability" waivers. + +Honeykrisp is **not yet released** for end users. We're +continuing to add features, improve performance, and port to more hardware. +[Source +code](https://gitlab.freedesktop.org/alyssa/mesa/-/tree/honeykrisp-20240506-2/src/asahi/vulkan?ref_type=heads) +is available for developers. + +
HoloCure running on Honeykrisp ft. DXVK, FEX, and Proton.
+ +Honeykrisp is not based on prior M1 Vulkan efforts, but rather +[Faith Ekstrand](https://mastodon.gamedev.place/@gfxstrand)'s open source [NVK +driver](https://www.collabora.com/news-and-blog/news-and-events/introducing-nvk.html) +for NVIDIA GPUs. In her words: + +> All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan +> driver and started by copying+pasting from it. My hope is that NVK +> will eventually become the driver that everyone copies and pastes from. To +> that end, I'm building NVK with all the best practices we've developed for +> Vulkan drivers over the last 7.5 years and trying to keep the code-base clean +> and well-organized. + +Why spend years implementing features from scratch when we can reuse NVK? +There will be friction starting out, given NVIDIA's desktop architecture +differs from the M1's mobile roots. In exchange, we get a modern driver +designed for desktop games. + +We'll need to pass a half-million tests ensuring correctness, [submit the +results](https://www.khronos.org/conformance/adopters), and then we'll become +conformant after 30 days of industry review. Starting from NVK and our OpenGL +4.6 driver... can we write a driver passing the Vulkan 1.3 conformance test +suite *faster* than the 30 day review period? + +It's unprecedented... + +Challenge accepted. + +### April 2 + +It begins with a text. + +> _Faith... I think I want to write a Vulkan driver._ + +Her advice? + +> _Just start typing._ + +There's no copy-pasting yet -- we just add M1 code to NVK and +remove NVIDIA as we go. Since the kernel mediates our access to the hardware, we +begin connecting "NVK" to [Asahi Lina](https://vt.social/@lina)'s kernel +driver using code shared with OpenGL. Then we plug in our shader +compiler and hit the hay. + +### April 3 + +To access resources, GPUs use "descriptors" containing the address, format, and +size of a resource. Vulkan bundles descriptors into "sets" per the application's "descriptor +set layout". When compiling shaders, the driver lowers descriptor accesses to +marry the set layout with the hardware's data structures. As our descriptors +differ from NVIDIA's, our next task is adapting NVK's descriptor set lowering. +We start with a simple but correct approach, deleting far more code than we +add. + +### April 4 + +With working descriptors, we can compile compute shaders. Now we program +the fixed-function hardware to dispatch compute. We first add +bookkeeping to map Vulkan command buffers to lists of M1 "control streams", +then we generate a compute control stream. We copy that code from our OpenGL +driver, translate the GL into Vulkan, and compute works. + +That's enough to move on to "copies" of buffers and images. We implement +Vulkan's copies with compute shaders, internally dispatched +with Vulkan commands as if we were the application. The first copy test +passes. + +### April 5 + +Fleshing out yesterday's code, *all* copy tests pass. + +### April 6 + +We're ready to tackle graphics. The novelty is handling graphics state like +depth/stencil. That's straightforward, but there's a *lot* +of state to handle. Faith's code collects all "dynamic state" into a single +structure, which we translate into hardware control words. As usual, we grab +that translation from our OpenGL driver, blend with NVK, and move on. + +### April 7 + +What makes state "dynamic"? Dynamic state can change without +recompiling shaders. By contrast, static state is baked into shader +binaries called "pipelines". If games create all their pipelines +during a loading screen, there is no compiler "stutter" during gameplay. The +idea hasn't quite panned out: many game developers don't know their state +ahead-of-time so cannot create pipelines early. In response, Vulkan has +[made](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_extended_dynamic_state.html) +[ever](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_extended_dynamic_state2.html) +[more](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_extended_dynamic_state3.html) +[state](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_vertex_input_dynamic_state.html) +[dynamic](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_graphics_pipeline_library.html), punctuated with the +[`EXT_shader_object`](https://www.khronos.org/blog/you-can-use-vulkan-without-pipelines-today) +extension that makes pipelines *optional*. + +We want full dynamic state and shader objects. Unfortunately, the M1 bakes +random state into shaders: vertex attributes, fragment outputs, blending, even +linked interpolation qualifiers. Like most of the industry in the 2010s, the +M1's designers bet on pipelines. + +Faced with this hardware, a reasonable driver developer would double-down on +pipelines. DXVK would stutter, but we'd pass conformance. + +I am not reasonable. + +To eliminate stuttering in OpenGL, we make state dynamic with four strategies: + +* Conditional code. +* Precompiled variants. +* Indirection. +* Prologs and epilogs. + +Wait, what-a-logs? + +AMD also bakes state into shaders... with a twist. They divide the +hardware binary into three parts: a *prolog*, the shader, and an *epilog*. +Confining dynamic state to the periphery eliminates shader variants. They +compile prologs and epilogs on the fly, but that's fast and doesn't stutter. +Linking shader parts is a quick concatenation, or long jumps avoid linking +altogether. This strategy works for the M1, too. + +For Honeykrisp, let's follow NVK's lead and treat _all_ state as dynamic. +No other Vulkan driver has implemented full dynamic state and shader objects +this early on, but it avoids refactoring later. Today we add the code to build, +compile, and cache prologs and epilogs. + +Putting it together, we get a (dynamic) triangle: + +[![Classic rainbow triangle](/img/blog/2024/06/hk-triangle.avif)](/img/blog/2024/06/hk-triangle.png) + +### April 8 + +Guided by the list of failing tests, we wire up the little bits missed along +the way, like translating border colours. + +```c +/* Translate an American VkBorderColor into a Canadian agx_border_colour */ +enum agx_border_colour +translate_border_color(VkBorderColor color) +{ + switch (color) { + case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK: + return AGX_BORDER_COLOUR_TRANSPARENT_BLACK; + ... + } +} +``` + +Test results are getting there. + +> **Pass**: 149770, **Fail**: 7741, **Crash**: 2396 + +That's good enough for [vkQuake](https://github.com/Novum/vkQuake). + +[![Vulkan port of Quake running on Honeykrisp](/img/blog/2024/06/vkquake.avif)](/img/blog/2024/06/vkquake.png)\ + +### April 9 + +Lots of little fixes bring us to a 99.6% pass rate... for Vulkan 1.1. Why stop +there? NVK is 1.3 conformant, so let's claim 1.3 and skip to the finish line. + +> **Pass**: 255209, **Fail**: 3818, **Crash**: 599 + +98.3% pass rate for 1.3 on our 1 week anniversary. + +Not bad. + +### April 10 + +SuperTuxKart has a Vulkan renderer. + +[![SuperTuxKart rendering with Honeykrisp, showing Pepper (from Pepper and Carrot) riding her broomstick in the STK Enterprise](/img/blog/2024/06/hkr-stk.avif)](/img/blog/2024/06/hkr-stk.png) + +### April 11 + +[Zink](https://docs.mesa3d.org/drivers/zink.html) works too. + +[![SuperTuxKart rendering with Zink on Honeykrisp, same scene but with better lighting](/img/blog/2024/06/hkr-stk-zink.avif)](/img/blog/2024/06/hkr-stk-zink.png) + +### April 12 + +I tracked down some fails to a test bug, where an arbitrary verification +threshold was too strict to pass on some devices. I filed a bug report, and it's +[resolved](https://github.com/KhronosGroup/VK-GL-CTS/commit/5fd73c841d775dff1ad52d8340d79dc120d64696) +within a few weeks. + +### April 16 + +The tests for "descriptor indexing" revealed a compiler bug affecting subgroup +shuffles in non-uniform control flow. The M1's shuffle instruction is quirky, +but it's easy to workaround. Fixing that fixes the descriptor indexing tests. + +### April 17 + +A few tests crash inside our register allocator. Their shaders contain a +peculiar construction: + +```c +if (condition) { + while (true) { } +} +``` + +`condition` is always false, but the compiler doesn't know that. + +Infinite loops are nominally invalid since shaders must terminate in finite +time, but this shader is syntactically valid. "All loops contain a break" seems +obvious for a shader, but it's false. It's straightforward to fix register +allocation, but what a doozy. + +### April 18 + +Remember copies? They're slow, and every frame currently requires a copy to get +on screen. + +For "zero copy" rendering, we need enough Linux window system integration to +negotiate an efficient surface layout across process boundaries. Linux uses +"modifiers" for this purpose, so we implement the +[`EXT_image_drm_format_modifier`](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_image_drm_format_modifier.html) +extension. And by implement, I mean copy. + +Copies to avoid copies. + +### April 20 + +> _"I'd like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux Vulkan Mac."_ +> +> ... +> +> _"Ma'am, this is a Wendy's."_ + +### April 22 + +As bug fixing slows down, we step back and check our driver architecture. +Since we treat all state as dynamic, we don't pre-pack control words during +pipeline creation. That adds theoretical CPU overhead. + +Is that a problem? After some optimization, +[vkoverhead](https://github.com/zmike/vkoverhead) says we're pushing 100 +million draws per second. + +I think we're okay. + +### April 24 + +Time to light up YCbCr. If we don't use special YCbCr hardware, +this feature is "software-only". However, it touches a *lot* of code. + +It touches so much code that [Mohamed +Ahmed](https://mohamexiety.github.io/posts/final_report/) spent an entire +summer adding it to NVK. + +Which means he spent a summer adding it to Honeykrisp. + +Thanks, Mohamed ;-) + +### April 25 + +Query copies are next. In Vulkan, the application can query the number of samples rendered, +writing the result into an opaque +"query pool". The result can be copied from the query pool on the CPU or GPU. + +For the CPU, the driver maps the pool's internal data structure and copies the +result. This may require nontrivial repacking. + +For the GPU, we need to repack in a compute shader. That's harder, because +we can't just run C code on the GPU, right? + +...Actually, we can. + +A little witchcraft makes GPU query copies as easy as C. + +```c +void copy_query(struct params *p, int i) { + uintptr_t dst = p->dest + i * p->stride; + int query = p->first + i; + + if (p->available[query] || p->partial) { + int q = p->index[query]; + write_result(dst, p->_64, p->results[q]); + } + + ... +} +``` + +### April 26 + +The final boss: border colours, hard mode. + +Direct3D lets the application choose an arbitrary border colour when +creating a sampler. By contrast, Vulkan only requires three border colours: + +* **`(0, 0, 0, 0)`** -- transparent black +* **`(0, 0, 0, 1)`** -- opaque black +* **`(1, 1, 1, 1)`** -- opaque white + +We handled these on April 8. Unfortunately, there are two problems. + +First, we need custom border colours for Direct3D compatibility. Both [DXVK](https://github.com/doitsujin/dxvk) and +[vkd3d-proton](https://github.com/HansKristian-Work/vkd3d-proton) require the +[`EXT_custom_border_color`](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_custom_border_color.html) +extension. + +Second, there's a subtle problem with our hardware, causing dozens of fails +even without custom border colours. To understand the issue, let's revisit +texture descriptors, which contain a +pixel _format_ and a component reordering _swizzle_. + +Some formats are implicitly reordered. Common "BGRA" formats swap red and blue +for [historical +reasons](https://stackoverflow.com/questions/74924790/why-bgra-instead-of-rgba). +The M1 does not directly support these formats. Instead, the driver composes +the swizzle with the format's reordering. If the application uses a `BARB` +swizzle with a `BGRA` format, the driver uses an `RABR` swizzle with an +`RGBA` format. + +There's a catch: swizzles apply to the border colour, but formats do not. We +need to *undo* the format reordering when programming the border colour for +correct results after the hardware applies the composed swizzle. Our OpenGL +driver implements border colours this way, because it knows the texture format +when creating the sampler. Unfortunately, Vulkan doesn't give us that +information. + +Without custom border colour support, we "should" be okay. Swapping red +and blue doesn't change anything if the colour is white or black. + +There's an even *subtler* catch. Vulkan mandates support for a +packed 16-bit format with 4-bit components. The M1 supports a similar format... +but with reversed "endianness", swapping red and *alpha*. + +That still seems okay. For transparent black (all zero) and opaque white (all +one), swapping components doesn't change the result. + +The problem is opaque black: (0, 0, 0, +1). Swapping red and alpha gives (1, +0, 0, 0). Transparent red? Uh-oh. + +We're stuck. No known hardware configuration implements correct Vulkan +semantics. + +Is hope lost? + +Do we give up? + +A reasonable person would. + +I am not reasonable. + +Let's jump into the deep end. If we implement custom border colours, opaque +black becomes a special case. But how? The M1's custom border colours entangle +the texture format with the sampler. A reasonable person would skip Direct3D +support. + +As you know, I am not reasonable. + +Although the hardware is unsuitable, we control software. Whenever a shader +samples a texture, we'll inject code to fix up the border colour. This +emulation is simple, correct, and slow. We'll use dirty driver +tricks to speed it up later. For now, we eat the cost, advertise full custom border +colours, and pass the opaque black tests. + +### April 27 + +All that's left is some last minute bug fixing, and... + +> **Pass**: 686930, **Fail**: 0 + +Success. + +### The future + +The next task is implementing everything that +[DXVK](https://github.com/doitsujin/dxvk/blob/master/VP_DXVK_requirements.json) +and +[vkd3d-proton](https://github.com/HansKristian-Work/vkd3d-proton/blob/master/VP_D3D12_VKD3D_PROTON_profile.json) +require to layer Direct3D. That includes esoteric extensions like +[transform feedback](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_transform_feedback.html). Then [Wine](https://www.winehq.org/) and an [open source x86 +emulator](https://github.com/FEX-Emu/FEX) will run Windows games on [Asahi +Linux](https://asahilinux.org/). + +That's getting ahead of ourselves. In the mean time, enjoy Linux games with +our [conformant OpenGL +4.6](/img/blog/2024/06/blog/conformant-gl46-on-the-m1.html) drivers... and +stay tuned. + +
Baby Storm running on Honeykrisp ft. DXVK, FEX, and Proton.
+ +--- diff --git a/static/img/blog/2023/08/vkinstancing.webp b/static/img/blog/2023/08/vkinstancing.webp new file mode 100644 index 0000000..b3d47ed Binary files /dev/null and b/static/img/blog/2023/08/vkinstancing.webp differ diff --git a/static/img/blog/2023/08/vkinstancing2.webp b/static/img/blog/2023/08/vkinstancing2.webp new file mode 100644 index 0000000..39d925e Binary files /dev/null and b/static/img/blog/2023/08/vkinstancing2.webp differ diff --git a/static/img/blog/2024/02/Blender-Wanderer-high.avif b/static/img/blog/2024/02/Blender-Wanderer-high.avif new file mode 100644 index 0000000..60ae75f Binary files /dev/null and b/static/img/blog/2024/02/Blender-Wanderer-high.avif differ diff --git a/static/img/blog/2024/02/Blender-Wanderer.avif b/static/img/blog/2024/02/Blender-Wanderer.avif new file mode 100644 index 0000000..ecd31da Binary files /dev/null and b/static/img/blog/2024/02/Blender-Wanderer.avif differ diff --git a/static/img/blog/2024/06/babystorm.avif b/static/img/blog/2024/06/babystorm.avif new file mode 100644 index 0000000..e1132a5 Binary files /dev/null and b/static/img/blog/2024/06/babystorm.avif differ diff --git a/static/img/blog/2024/06/babystorm.png b/static/img/blog/2024/06/babystorm.png new file mode 100644 index 0000000..711231a Binary files /dev/null and b/static/img/blog/2024/06/babystorm.png differ diff --git a/static/img/blog/2024/06/hk-triangle.avif b/static/img/blog/2024/06/hk-triangle.avif new file mode 100644 index 0000000..5c58531 Binary files /dev/null and b/static/img/blog/2024/06/hk-triangle.avif differ diff --git a/static/img/blog/2024/06/hk-triangle.png b/static/img/blog/2024/06/hk-triangle.png new file mode 100644 index 0000000..da8b2f9 Binary files /dev/null and b/static/img/blog/2024/06/hk-triangle.png differ diff --git a/static/img/blog/2024/06/hkr-stk-zink.avif b/static/img/blog/2024/06/hkr-stk-zink.avif new file mode 100644 index 0000000..31f8e24 Binary files /dev/null and b/static/img/blog/2024/06/hkr-stk-zink.avif differ diff --git a/static/img/blog/2024/06/hkr-stk-zink.png b/static/img/blog/2024/06/hkr-stk-zink.png new file mode 100644 index 0000000..5912aa6 Binary files /dev/null and b/static/img/blog/2024/06/hkr-stk-zink.png differ diff --git a/static/img/blog/2024/06/hkr-stk.avif b/static/img/blog/2024/06/hkr-stk.avif new file mode 100644 index 0000000..e8ce80a Binary files /dev/null and b/static/img/blog/2024/06/hkr-stk.avif differ diff --git a/static/img/blog/2024/06/hkr-stk.png b/static/img/blog/2024/06/hkr-stk.png new file mode 100644 index 0000000..408f181 Binary files /dev/null and b/static/img/blog/2024/06/hkr-stk.png differ diff --git a/static/img/blog/2024/06/holocure.avif b/static/img/blog/2024/06/holocure.avif new file mode 100644 index 0000000..84832c4 Binary files /dev/null and b/static/img/blog/2024/06/holocure.avif differ diff --git a/static/img/blog/2024/06/holocure.png b/static/img/blog/2024/06/holocure.png new file mode 100644 index 0000000..a3929bb Binary files /dev/null and b/static/img/blog/2024/06/holocure.png differ diff --git a/static/img/blog/2024/06/vkquake.avif b/static/img/blog/2024/06/vkquake.avif new file mode 100644 index 0000000..5e6967a Binary files /dev/null and b/static/img/blog/2024/06/vkquake.avif differ diff --git a/static/img/blog/2024/06/vkquake.png b/static/img/blog/2024/06/vkquake.png new file mode 100644 index 0000000..7ce58c6 Binary files /dev/null and b/static/img/blog/2024/06/vkquake.png differ