diff --git a/content/blog/2023/08/22-first-conformant-m1-gpu-driver.md b/content/blog/2023/08/22-first-conformant-m1-gpu-driver.md
new file mode 100644
index 0000000..f80a9f4
--- /dev/null
+++ b/content/blog/2023/08/22-first-conformant-m1-gpu-driver.md
@@ -0,0 +1,215 @@
++++
+date = "2023-08-22T23:30:00+09:00"
+draft = false
+title = "The first conformant M1 GPU driver"
+slug = "first-conformant-m1-gpu-driver"
+author = "Alyssa Rosenzweig"
++++
+
+Conformant OpenGL® ES 3.1 drivers are now available for M1- and M2-family GPUs.
+That means the drivers are compatible with any OpenGL ES 3.1 application.
+Interested? [Just install Linux!](https://fedora-asahi-remix.org/)
+
+For existing [Asahi Linux](https://asahilinux.org/) users,
+upgrade your system with dnf
+upgrade (Fedora) or pacman -Syu
+(Arch) for the latest drivers.
+
+Our reverse-engineered, free and [open source graphics
+drivers](https://gitlab.freedesktop.org/asahi/mesa) are the world's ***only***
+conformant OpenGL ES 3.1 implementation for M1- and M2-family graphics
+hardware. That means our driver passed tens of thousands of tests to
+demonstrate correctness and is now recognized by the industry.
+
+To become conformant, an "implementation" must pass the official conformance
+test suite, designed to verify every feature in the specification. The test
+results are submitted to Khronos, the standards body. After a [30-day review
+period](https://www.khronos.org/conformance/adopters/), if no issues are found,
+the implementation becomes conformant. The Khronos website lists all conformant
+implementations, including our drivers for the
+[M1](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1007),
+[M1
+Pro/Max/Ultra](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1014),
+[M2](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1016),
+and [M2
+Pro/Max](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1017).
+
+Today's milestone isn't just about OpenGL ES. We're releasing the first
+conformant implementation of *any* graphics standard for the M1. And we don't
+plan to stop here ;-)
+
+[![Teaser of the "Vulkan instancing" demo running on Asahi Linux](/img/blog/2023/08/vkinstancing2.webp)](/img/blog/2023/08/vkinstancing.webp)
+
+Unlike ours, the manufacturer's M1 drivers are unfortunately not conformant for _any_
+standard graphics API, whether Vulkan or OpenGL or OpenGL ES. That means that
+there is no guarantee that applications using the standards will work on your M1/M2 (if you're
+not running Linux). This isn't just a theoretical issue. Consider Vulkan.
+The third-party [MoltenVK](https://github.com/KhronosGroup/MoltenVK)
+layers a subset of Vulkan on top of the proprietary drivers. However, those drivers
+lack key functionality, breaking valid Vulkan applications. That hinders
+developers and users alike, if they haven't yet switched their M1/M2 computers
+to Linux.
+
+Why did *we* pursue standards conformance when the manufacturer did not? Above
+all, our commitment to quality. We want our users to know that they can depend
+on our Linux drivers. We want standard software to run without M1-specific
+hacks or porting. We want to set the right example for the ecosystem: the way forward is
+implementing open standards, conformant to the specifications, without
+compromises for "portability". We are not satisfied with proprietary
+drivers, proprietary APIs, and refusal to implement standards. The rest of the
+industry knows that progress comes from cross-vendor collaboration. We know it,
+too. Achieving conformance is a win for our community, for open source, and for
+open graphics.
+
+Of course, [Asahi Lina](https://vt.social/@lina/) and I are two individuals
+with minimal funding. It's a little awkward that we beat the big corporation...
+
+It's not too late though. They should follow our lead!
+
+---
+
+OpenGL ES 3.1 updates the experimental [OpenGL ES 3.0 and OpenGL
+3.1](/img/blog/2024/02/blog/opengl3-on-asahi-linux.html) we shipped in
+June. Notably, ES 3.1 adds compute shaders, typically used to accelerate
+general computations within graphics applications. For example, a 3D game could
+run its physics simulations in a compute shader. The simulation results can
+then be used for rendering, eliminating stalls that would otherwise be required
+to synchronize the GPU with a CPU physics simulation. That lets the game run
+faster.
+
+Let's zoom in on one new feature: atomics on images. Older versions of
+OpenGL ES allowed an application to read an image in order to display it on screen.
+ES 3.1 allows the application to *write* to the image, typically from a
+compute shader. This new feature enables flexible image processing algorithms, which
+previously needed to fit into the fixed-function 3D pipeline. However, GPUs
+are massively parallel, running thousands of threads at the same time. If two
+threads write to the same location, there is a conflict: depending which thread
+runs first, the result will be different. We have a race condition.
+
+"Atomic" access to memory provides a solution to race conditions. With atomics,
+special hardware in the memory subsystem guarantees consistent, well-defined
+results for select operations, regardless of the order of the threads. Modern
+graphics hardware supports various atomic operations, like addition,
+serving as building blocks to complex parallel algorithms.
+
+Can we put these two features together to write to an image atomically?
+
+Yes. A ubiquitous OpenGL ES
+[extension](https://registry.khronos.org/OpenGL/extensions/OES/OES_shader_image_atomic.txt),
+required for ES 3.2, adds atomics operating on pixels in an image. For
+example, a compute shader could atomically increment the value at pixel (10,
+20).
+
+Other GPUs have dedicated instructions to perform atomics on an images, making
+the driver implementation straightforward. For us, the story is more
+complicated. The M1 lacks hardware instructions for image atomics, even though
+it has non-image atomics and non-atomic images. We need to reframe the
+problem.
+
+The idea is simple: to perform an atomic on a pixel, we instead calculate
+the address of the pixel in memory and perform a regular atomic on that
+address. Since the hardware supports regular atomics, our task is "just"
+calculating the pixel's address.
+
+If the image were laid out linearly in memory, this would be straightforward:
+multiply the Y-coordinate by the number of bytes per row ("stride"), multiply
+the X-coordinate by the number of bytes per pixel, and add. That gives the
+pixel's offset in bytes relative to the first pixel of the image. To get the
+final address, we add that offset to the address of the first pixel.
+
+
+
+Alas, images are rarely linear in memory. To improve cache
+efficiency, modern graphics hardware interleaves the X- and Y-coordinates.
+Instead of one row after the next, pixels in memory follow a [spiral-like
+curve](https://fgiesen.wordpress.com/2011/01/17/texture-tiling-and-swizzling/).
+
+We need to amend our previous equation to interleave the coordinates. We could
+use many instructions to mask one bit at a time, shifting to construct the
+interleaved result, but that's inefficient. We can do better.
+
+There is a well-known ["bit twiddling" algorithm to interleave
+bits](https://graphics.stanford.edu/~seander/bithacks.html#InterleaveBMN).
+Rather than shuffle one bit at a time, the algorithm shuffles groups of bits,
+parallelizing the problem. Implementing this algorithm in shader code improves
+performance.
+
+In practice, only the lower 7-bits (or less) of each coordinate are
+interleaved. That lets us use 32-bit instructions to "vectorize" the
+interleave, by putting the X- and Y-coordinates in the low and high 16-bits of
+a 32-bit register. Those 32-bit instructions let us interleave X and Y at the
+same time, halving the instruction count. Plus, we can exploit the GPU's
+combined shift-and-add instruction. Putting the tricks together, we interleave
+in 10 instructions of M1 GPU assembly:
+
+```asm
+# Inputs x, y in r0l, r0h.
+# Output in r1.
+
+add r2, #0, r0, lsl 4
+or r1, r0, r2
+and r1, r1, #0xf0f0f0f
+add r2, #0, r1, lsl 2
+or r1, r1, r2
+and r1, r1, #0x33333333
+add r2, #0, r1, lsl 1
+or r1, r1, r2
+and r1, r1, #0x55555555
+add r1, r1l, r1h, lsl 1
+```
+
+We could stop here, but what if there's a *dedicated* instruction to interleave
+bits? PowerVR has a "shuffle" instruction
+[`shfl`](https://docs.imgtec.com/reference-manuals/powervr-instruction-set-reference/topics/bitwise-instructions/SHFL.html),
+and the M1 GPU borrows from PowerVR. Perhaps that instruction was borrowed too.
+Unfortunately, even if it was, the proprietary compiler won't use it when
+compiling our test shaders. That makes it difficult to reverse-engineer the
+instruction -- if it exists -- by observing compiled shaders.
+
+It's time to dust off a powerful reverse-engineering technique from
+magic kindergarten: guess and check.
+
+[Dougall Johnson]( https://mastodon.social/@dougall) provided the guess.
+When considering the instructions we already know about, he took special notice
+of the "reverse bits" instruction. Since reversing bits is a type of bit
+shuffle, the interleave instruction should be encoded similarly. The bit
+reverse instruction has a two-bit field specifying the operation, with value
+`01`. Related instructions to _count the number of set bits_ and _find the
+first set bit_ have values `10` and `11` respectively. That encompasses all
+known "complex bit manipulation" instructions.
+
+There is one value of the two-bit enumeration that is unobserved and unknown:
+`00`. If this interleave instruction exists, it's probably encoded like the bit
+reverse but with operation code `00` instead of `01`.
+
+There's a difficulty: the three known instructions have one single input
+source, but our instruction interleaves two sources. Where does the second
+source go? We can make a guess based on symmetry. Presumably to simplify the
+hardware decoder, M1 GPU instructions usually encode their sources
+in consistent locations across instructions. The other three instructions have
+a gap where we would expect the second source to be, in a two-source
+arithmetic instruction. Probably the second source is there.
+
+Armed with a guess, it's our turn to check. Rather than handwrite GPU assembly,
+we can hack our compiler to replace some two-source integer operation (like
+multiply) with our guessed encoding of "interleave". Then we write a compute
+shader using this operation (by "multiplying" numbers) and run it with the
+newfangled compute support in our driver.
+
+All that's left is writing a
+[shader](/img/blog/2024/02/blog/interleave.shader_test) that checks that
+the mystery instruction returns the interleaved result for each possible input.
+Since the instruction takes two 16-bit sources, there are about 4 billion
+($2^32$) inputs. With our driver, the M1 GPU manages to check them all in under
+a second, and the verdict is in: this is our interleave instruction.
+
+As for our clever vectorized assembly to interleave coordinates? We can replace
+it with one instruction. It's anticlimactic, but it's fast and it passes
+the conformance tests.
+
+And that's what matters.
+
+---
+
+_Thank you to [Khronos](https://www.khronos.org/) and [Software in the Public Interest](https://www.spi-inc.org/) for supporting open
+drivers._
diff --git a/content/blog/2024/02/14-conformant-gl46-on-the-m1.md b/content/blog/2024/02/14-conformant-gl46-on-the-m1.md
new file mode 100644
index 0000000..fa2ac9f
--- /dev/null
+++ b/content/blog/2024/02/14-conformant-gl46-on-the-m1.md
@@ -0,0 +1,308 @@
++++
+date = "2024-02-14T12:00:00+09:00"
+draft = false
+title = "Conformant OpenGL 4.6 on the M1"
+slug = "conformant-gl46-on-the-m1"
+author = "Alyssa Rosenzweig"
++++
+
+For years, the M1 has only supported OpenGL 4.1. That changes
+today -- with our release of full OpenGL® 4.6 and OpenGL® ES 3.2!
+[Install Fedora](https://fedora-asahi-remix.org/) for the latest M1/M2-series
+drivers.
+
+Already installed? Just dnf upgrade \-\-refresh.
+
+Unlike the vendor's non-conformant 4.1 drivers, our [open
+source](https://gitlab.freedesktop.org/asahi/mesa) Linux drivers are
+**conformant** to the latest OpenGL versions, finally promising broad
+compatibility with modern OpenGL workloads, like
+[Blender](https://www.blender.org/).
+
+
+
+Conformant 4.6/3.2 drivers must pass over 100,000 tests to ensure correctness. The
+official list of conformant drivers now includes [our OpenGL
+4.6]()
+and [ES
+3.2](https://www.khronos.org/conformance/adopters/conformant-products/opengles#submission_1045).
+
+While the vendor doesn't yet support graphics standards like modern OpenGL, we do. For this Valentine's Day, we want to profess our love for
+interoperable open standards. We want to free users and developers from lock-in, enabling applications to run anywhere the heart wants without special
+ports. For that, we need standards conformance. Six months ago, we became the [first
+conformant driver for any standard graphics API for the
+M1](/blog/first-conformant-m1-gpu-driver.html) with the release of OpenGL ES
+3.1 drivers. Today, we've finished OpenGL with the full 4.6... and we're well
+on the road to Vulkan.
+
+---
+
+Compared to 4.1, OpenGL 4.6 adds dozens of required features, including:
+
+* Robustness
+* SPIR-V
+* [Clip control](/blog/asahi-gpu-part-6.html)
+* Cull distance
+* [Compute shaders](/blog/first-conformant-m1-gpu-driver.html)
+* Upgraded transform feedback
+
+Regrettably, the M1 doesn't map well to any graphics standard newer than OpenGL
+ES 3.1. While Vulkan makes some of these features optional, the missing features are
+required to layer DirectX and OpenGL on top. No existing solution on M1 gets
+past the OpenGL 4.1 feature set.
+
+How do we break the 4.1 barrier? Without hardware support, new features need
+new tricks. Geometry shaders, tessellation, and transform feedback become
+compute shaders. Cull distance becomes a transformed interpolated value. Clip
+control becomes a vertex shader epilogue. The list goes on.
+
+For a taste of the challenges we overcame, let's look at **robustness**.
+
+Built for gaming, GPUs traditionally prioritize raw performance over safety.
+Invalid application code, like a shader that reads a buffer out-of-bounds,
+can trigger undefined behaviour. Drivers exploit that to maximize performance.
+
+For applications like web browsers, that trade-off is undesirable.
+Browsers handle untrusted shaders, which they must sanitize to ensure stability
+and security. Clicking a malicious link should not crash the browser. While
+some sanitization is necessary as graphics APIs are not security barriers,
+reducing undefined behaviour in the API can assist "defence in depth".
+
+"Robustness" features can help. Without robustness, out-of-bounds buffer access
+in a shader can crash. With robustness, the application can opt for
+defined out-of-bounds behaviour, trading some performance for less attack
+surface.
+
+All modern cross-vendor APIs include robustness. Many games even
+(accidentally?) rely on robustness. Strangely, the vendor's proprietary
+API omits buffer robustness. We must do better for conformance, correctness,
+and compatibility.
+
+Let's first define the problem. Different APIs have different definitions of
+what an out-of-bounds load returns when robustness is enabled:
+
+* Zero (Direct3D, Vulkan with `robustBufferAccess2`)
+* Either zero or some data in the buffer (OpenGL, Vulkan with
+ `robustBufferAccess`)
+* Arbitrary values, but can't crash (OpenGL ES)
+
+OpenGL uses the second definition: return zero or data from the buffer.
+One approach is to return the *last* element of the buffer for
+out-of-bounds access. Given the buffer size, we can calculate the last index.
+Now consider the *minimum* of the index being accessed and the last index. That
+equals the index being accessed if it is valid, and some other valid index
+otherwise. Loading the minimum index is safe and gives a spec-compliant result.
+
+As an example, a uniform buffer load without robustness might look like:
+
+```asm
+load.i32 result, buffer, index
+```
+
+Robustness adds a single unsigned minimum (`umin`) instruction:
+
+```asm
+umin idx, index, last
+load.i32 result, buffer, idx
+```
+
+Is the robust version slower? It can be. The difference should be small
+percentage-wise, as arithmetic is faster than memory. With thousands
+of threads running in parallel, the arithmetic cost may even be hidden by the
+load's latency.
+
+There's another trick that speeds up robust uniform buffers. Like other GPUs,
+the M1 supports "preambles". The idea is simple: instead of calculating the
+same value in every thread, it's faster to calculate once and reuse the result.
+The compiler identifies eligible calculations and moves them to a preamble
+executed before the main shader. These redundancies are common, so preambles
+provide a nice speed-up.
+
+We usually move uniform buffer loads to the preamble when every thread loads
+the same index. Since the size of a uniform buffer is fixed, extra robustness
+arithmetic is *also* moved to the preamble. The robustness is "free" for the
+main shader. For robust storage buffers, the clamping might move to the
+preamble even if the load or store cannot.
+
+Armed with robust uniform and storage buffers, let's consider robust "vertex
+buffers". In graphics APIs, the application can set vertex buffers with a base
+GPU address and a chosen layout of "attributes" within each buffer. Each
+attribute has an offset and a format, and the buffer has a "stride" indicating
+the number of bytes per vertex. The vertex shader can then read attributes,
+implicitly indexing by the vertex. To do so, the shader loads the address:
+
+
+
+Some hardware implements robust vertex fetch natively. Other hardware has
+bounds-checked buffers to accelerate robust software vertex fetch.
+Unfortunately, the M1 has neither. We need to implement vertex fetch with raw
+memory loads.
+
+One instruction set feature helps. In addition to a 64-bit base address,
+the M1 GPU's memory loads also take an offset in *elements*. The hardware shifts the offset and
+adds to the 64-bit base to determine the address to
+fetch. Additionally, the M1 has a combined integer multiply-add instruction
+`imad`. Together, these features let us implement vertex loads in two
+instructions. For example, a 32-bit attribute load looks like:
+
+```asm
+imad idx, stride/4, vertex, offset/4
+load.i32 result, base, idx
+```
+
+The hardware load can perform an additional small shift. Suppose our attribute
+is a vector of 4 32-bit values, densely packed into a buffer with no offset. We can load that attribute in one instruction:
+
+```asm
+load.v4i32 result, base, vertex << 2
+```
+
+...with the hardware calculating the address:
+
+
+
+What about robustness?
+
+We want to implement robustness with a clamp, like we did for uniform
+buffers. The problem is that the vertex buffer size is given in bytes, while
+our optimized load takes an index in "vertices". A single vertex buffer can
+contain multiple attributes with different formats and offsets, so we can't
+convert the size in bytes to a size in "vertices".
+
+Let's handle the latter problem. We can rewrite the addressing equation as:
+
+
+
+That is: one buffer with many attributes at different offsets is equivalent to
+many buffers with one attribute and no offset. This gives an alternate
+perspective on the same data layout. Is this an improvement? It avoids an
+addition in the shader, at the cost of passing more data -- addresses are
+64-bit while attribute offsets are
+[16-bit](https://vulkan.gpuinfo.org/listreports.php?limit=maxVertexInputAttributeOffset&value=4294967295&platform=all0).
+More importantly, it lets us translate the vertex buffer size in bytes into a
+size in "vertices" for *each* vertex attribute. Instead of clamping the offset,
+we clamp the vertex index. We still make full use of the hardware addressing
+modes, now with robustness:
+
+```asm
+umin idx, vertex, last valid
+load.v4i32 result, base, idx << 2
+```
+
+We need to calculate the last valid vertex index ahead-of-time for each
+attribute. Each attribute has a format with a particular size. Manipulating
+the addressing equation, we can calculate the last *byte* accessed in the
+buffer (plus 1) relative to the base:
+
+
+
+The load is valid when that value is bounded by the buffer size in bytes. We
+solve the integer inequality as:
+
+
+
+The driver calculates the right-hand side and passes it into the shader.
+
+One last problem: what if a buffer is too small to load *anything*? Clamping
+won't save us -- the code would clamp to a negative index. In that case,
+the attribute is entirely invalid, so we swap the application's buffer for a
+small buffer of zeroes. Since we gave each attribute its own base address,
+this determination is per-attribute. Then clamping the index to zero
+correctly loads zeroes.
+
+Putting it together, a little driver math gives us robust buffers at the
+cost of one `umin` instruction.
+
+---
+
+In addition to buffer robustness, we need image robustness. Like its buffer
+counterpart, image robustness requires that out-of-bounds image loads return
+zero. That formalizes a guarantee that reasonable hardware already makes.
+
+...But it would be no fun if our hardware was reasonable.
+
+Running the conformance tests for image robustness, there is a single
+test failure affecting "mipmapping".
+
+For background, mipmapped images contain multiple "levels of detail". The base
+level is the original image; each successive level is the previous level
+downscaled. When rendering, the hardware selects the level closest to matching
+the on-screen size, improving efficiency and visual quality.
+
+With robustness, the specifications all agree that image loads return...
+
+* Zero if the X- or Y-coordinate is out-of-bounds
+* Zero if the level is out-of-bounds
+
+Meanwhile, image loads on the M1 GPU return...
+
+* Zero if the X- or Y-coordinate is out-of-bounds
+* Values from the last level if the level is out-of-bounds
+
+Uh-oh. Rather than returning zero for out-of-bounds levels, the hardware
+clamps the level and returns nonzero values. It's a mystery why.
+The vendor does not document their hardware publicly, forcing us to
+rely on reverse engineering to build drivers. Without documentation,
+we don't know if this behaviour is intentional or a hardware bug. Either way,
+we need a workaround to pass conformance.
+
+The obvious workaround is to never load from an invalid level:
+
+```glsl
+if (level <= levels) {
+ return imageLoad(x, y, level);
+} else {
+ return 0;
+}
+```
+
+That involves branching, which is inefficient. Loading an out-of-bounds level
+doesn't crash, so we can speculatively load and then use a compare-and-select
+operation instead of branching:
+
+```glsl
+vec4 data = imageLoad(x, y, level);
+
+return (level <= levels) ? data : 0;
+```
+
+This workaround is okay, but it could be improved. While the M1 GPU has combined
+compare-and-select instructions, the instruction set is *scalar*. Each thread
+processes one value at a time, not a vector of multiple values. However, image
+loads return a vector of four components (red, green, blue, alpha). While the
+pseudo-code looks efficient, the resulting assembly is not:
+
+```asm
+image_load R, x, y, level
+ulesel R[0], level, levels, R[0], 0
+ulesel R[1], level, levels, R[1], 0
+ulesel R[2], level, levels, R[2], 0
+ulesel R[3], level, levels, R[3], 0
+```
+
+Fortunately, the vendor driver has a trick. We know the hardware returns zero
+if either X or Y is out-of-bounds, so we can *force* a zero output by *setting*
+X or Y out-of-bounds. As the maximum image size is 16384 pixels wide, any X
+greater than 16384 is out-of-bounds. That justifies an alternate workaround:
+
+```glsl
+bool valid = (level <= levels);
+int x_ = valid ? x : 20000;
+
+return imageLoad(x_, y, level);
+```
+
+Why is this better? We only change a single scalar, not a whole vector,
+compiling to compact scalar assembly:
+
+```asm
+ulesel x_, level, levels, x, #20000
+image_load R, x_, y, level
+```
+
+If we preload the constant to a uniform register, the workaround is a single
+instruction. That's optimal -- and it passes conformance.
+
+---
+
+_Blender ["Wanderer"](https://download.blender.org/demo/eevee/wanderer/wanderer.blend) demo by [Daniel Bystedt](https://www.artstation.com/dbystedt), licensed CC BY-SA._
diff --git a/content/blog/2024/06/05-vk13-on-the-m1-in-1-month.md b/content/blog/2024/06/05-vk13-on-the-m1-in-1-month.md
new file mode 100644
index 0000000..09f970b
--- /dev/null
+++ b/content/blog/2024/06/05-vk13-on-the-m1-in-1-month.md
@@ -0,0 +1,412 @@
++++
+date = "2024-06-05T12:00:00+09:00"
+draft = false
+title = "Vulkan 1.3 on the M1 in 1 month"
+slug = "vk13-on-the-m1-in-1-month"
+author = "Alyssa Rosenzweig"
++++
+
+
+
+Finally, conformant Vulkan for the M1! The new "Honeykrisp" driver is the first
+[conformant
+Vulkan®](https://www.khronos.org/conformance/adopters/conformant-products/vulkan#submission_780)
+for Apple hardware on any operating system, implementing the full 1.3 spec
+without "portability" waivers.
+
+Honeykrisp is **not yet released** for end users. We're
+continuing to add features, improve performance, and port to more hardware.
+[Source
+code](https://gitlab.freedesktop.org/alyssa/mesa/-/tree/honeykrisp-20240506-2/src/asahi/vulkan?ref_type=heads)
+is available for developers.
+
+
+
+Honeykrisp is not based on prior M1 Vulkan efforts, but rather
+[Faith Ekstrand](https://mastodon.gamedev.place/@gfxstrand)'s open source [NVK
+driver](https://www.collabora.com/news-and-blog/news-and-events/introducing-nvk.html)
+for NVIDIA GPUs. In her words:
+
+> All Vulkan drivers in Mesa trace their lineage to the Intel Vulkan
+> driver and started by copying+pasting from it. My hope is that NVK
+> will eventually become the driver that everyone copies and pastes from. To
+> that end, I'm building NVK with all the best practices we've developed for
+> Vulkan drivers over the last 7.5 years and trying to keep the code-base clean
+> and well-organized.
+
+Why spend years implementing features from scratch when we can reuse NVK?
+There will be friction starting out, given NVIDIA's desktop architecture
+differs from the M1's mobile roots. In exchange, we get a modern driver
+designed for desktop games.
+
+We'll need to pass a half-million tests ensuring correctness, [submit the
+results](https://www.khronos.org/conformance/adopters), and then we'll become
+conformant after 30 days of industry review. Starting from NVK and our OpenGL
+4.6 driver... can we write a driver passing the Vulkan 1.3 conformance test
+suite *faster* than the 30 day review period?
+
+It's unprecedented...
+
+Challenge accepted.
+
+### April 2
+
+It begins with a text.
+
+> _Faith... I think I want to write a Vulkan driver._
+
+Her advice?
+
+> _Just start typing._
+
+There's no copy-pasting yet -- we just add M1 code to NVK and
+remove NVIDIA as we go. Since the kernel mediates our access to the hardware, we
+begin connecting "NVK" to [Asahi Lina](https://vt.social/@lina)'s kernel
+driver using code shared with OpenGL. Then we plug in our shader
+compiler and hit the hay.
+
+### April 3
+
+To access resources, GPUs use "descriptors" containing the address, format, and
+size of a resource. Vulkan bundles descriptors into "sets" per the application's "descriptor
+set layout". When compiling shaders, the driver lowers descriptor accesses to
+marry the set layout with the hardware's data structures. As our descriptors
+differ from NVIDIA's, our next task is adapting NVK's descriptor set lowering.
+We start with a simple but correct approach, deleting far more code than we
+add.
+
+### April 4
+
+With working descriptors, we can compile compute shaders. Now we program
+the fixed-function hardware to dispatch compute. We first add
+bookkeeping to map Vulkan command buffers to lists of M1 "control streams",
+then we generate a compute control stream. We copy that code from our OpenGL
+driver, translate the GL into Vulkan, and compute works.
+
+That's enough to move on to "copies" of buffers and images. We implement
+Vulkan's copies with compute shaders, internally dispatched
+with Vulkan commands as if we were the application. The first copy test
+passes.
+
+### April 5
+
+Fleshing out yesterday's code, *all* copy tests pass.
+
+### April 6
+
+We're ready to tackle graphics. The novelty is handling graphics state like
+depth/stencil. That's straightforward, but there's a *lot*
+of state to handle. Faith's code collects all "dynamic state" into a single
+structure, which we translate into hardware control words. As usual, we grab
+that translation from our OpenGL driver, blend with NVK, and move on.
+
+### April 7
+
+What makes state "dynamic"? Dynamic state can change without
+recompiling shaders. By contrast, static state is baked into shader
+binaries called "pipelines". If games create all their pipelines
+during a loading screen, there is no compiler "stutter" during gameplay. The
+idea hasn't quite panned out: many game developers don't know their state
+ahead-of-time so cannot create pipelines early. In response, Vulkan has
+[made](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_extended_dynamic_state.html)
+[ever](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_extended_dynamic_state2.html)
+[more](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_extended_dynamic_state3.html)
+[state](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_vertex_input_dynamic_state.html)
+[dynamic](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_graphics_pipeline_library.html), punctuated with the
+[`EXT_shader_object`](https://www.khronos.org/blog/you-can-use-vulkan-without-pipelines-today)
+extension that makes pipelines *optional*.
+
+We want full dynamic state and shader objects. Unfortunately, the M1 bakes
+random state into shaders: vertex attributes, fragment outputs, blending, even
+linked interpolation qualifiers. Like most of the industry in the 2010s, the
+M1's designers bet on pipelines.
+
+Faced with this hardware, a reasonable driver developer would double-down on
+pipelines. DXVK would stutter, but we'd pass conformance.
+
+I am not reasonable.
+
+To eliminate stuttering in OpenGL, we make state dynamic with four strategies:
+
+* Conditional code.
+* Precompiled variants.
+* Indirection.
+* Prologs and epilogs.
+
+Wait, what-a-logs?
+
+AMD also bakes state into shaders... with a twist. They divide the
+hardware binary into three parts: a *prolog*, the shader, and an *epilog*.
+Confining dynamic state to the periphery eliminates shader variants. They
+compile prologs and epilogs on the fly, but that's fast and doesn't stutter.
+Linking shader parts is a quick concatenation, or long jumps avoid linking
+altogether. This strategy works for the M1, too.
+
+For Honeykrisp, let's follow NVK's lead and treat _all_ state as dynamic.
+No other Vulkan driver has implemented full dynamic state and shader objects
+this early on, but it avoids refactoring later. Today we add the code to build,
+compile, and cache prologs and epilogs.
+
+Putting it together, we get a (dynamic) triangle:
+
+[![Classic rainbow triangle](/img/blog/2024/06/hk-triangle.avif)](/img/blog/2024/06/hk-triangle.png)
+
+### April 8
+
+Guided by the list of failing tests, we wire up the little bits missed along
+the way, like translating border colours.
+
+```c
+/* Translate an American VkBorderColor into a Canadian agx_border_colour */
+enum agx_border_colour
+translate_border_color(VkBorderColor color)
+{
+ switch (color) {
+ case VK_BORDER_COLOR_INT_TRANSPARENT_BLACK:
+ return AGX_BORDER_COLOUR_TRANSPARENT_BLACK;
+ ...
+ }
+}
+```
+
+Test results are getting there.
+
+> **Pass**: 149770, **Fail**: 7741, **Crash**: 2396
+
+That's good enough for [vkQuake](https://github.com/Novum/vkQuake).
+
+[![Vulkan port of Quake running on Honeykrisp](/img/blog/2024/06/vkquake.avif)](/img/blog/2024/06/vkquake.png)\
+
+### April 9
+
+Lots of little fixes bring us to a 99.6% pass rate... for Vulkan 1.1. Why stop
+there? NVK is 1.3 conformant, so let's claim 1.3 and skip to the finish line.
+
+> **Pass**: 255209, **Fail**: 3818, **Crash**: 599
+
+98.3% pass rate for 1.3 on our 1 week anniversary.
+
+Not bad.
+
+### April 10
+
+SuperTuxKart has a Vulkan renderer.
+
+[![SuperTuxKart rendering with Honeykrisp, showing Pepper (from Pepper and Carrot) riding her broomstick in the STK Enterprise](/img/blog/2024/06/hkr-stk.avif)](/img/blog/2024/06/hkr-stk.png)
+
+### April 11
+
+[Zink](https://docs.mesa3d.org/drivers/zink.html) works too.
+
+[![SuperTuxKart rendering with Zink on Honeykrisp, same scene but with better lighting](/img/blog/2024/06/hkr-stk-zink.avif)](/img/blog/2024/06/hkr-stk-zink.png)
+
+### April 12
+
+I tracked down some fails to a test bug, where an arbitrary verification
+threshold was too strict to pass on some devices. I filed a bug report, and it's
+[resolved](https://github.com/KhronosGroup/VK-GL-CTS/commit/5fd73c841d775dff1ad52d8340d79dc120d64696)
+within a few weeks.
+
+### April 16
+
+The tests for "descriptor indexing" revealed a compiler bug affecting subgroup
+shuffles in non-uniform control flow. The M1's shuffle instruction is quirky,
+but it's easy to workaround. Fixing that fixes the descriptor indexing tests.
+
+### April 17
+
+A few tests crash inside our register allocator. Their shaders contain a
+peculiar construction:
+
+```c
+if (condition) {
+ while (true) { }
+}
+```
+
+`condition` is always false, but the compiler doesn't know that.
+
+Infinite loops are nominally invalid since shaders must terminate in finite
+time, but this shader is syntactically valid. "All loops contain a break" seems
+obvious for a shader, but it's false. It's straightforward to fix register
+allocation, but what a doozy.
+
+### April 18
+
+Remember copies? They're slow, and every frame currently requires a copy to get
+on screen.
+
+For "zero copy" rendering, we need enough Linux window system integration to
+negotiate an efficient surface layout across process boundaries. Linux uses
+"modifiers" for this purpose, so we implement the
+[`EXT_image_drm_format_modifier`](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_image_drm_format_modifier.html)
+extension. And by implement, I mean copy.
+
+Copies to avoid copies.
+
+### April 20
+
+> _"I'd like a 4K x86 Windows Direct3D PC game on a 16K arm64 Linux Vulkan Mac."_
+>
+> ...
+>
+> _"Ma'am, this is a Wendy's."_
+
+### April 22
+
+As bug fixing slows down, we step back and check our driver architecture.
+Since we treat all state as dynamic, we don't pre-pack control words during
+pipeline creation. That adds theoretical CPU overhead.
+
+Is that a problem? After some optimization,
+[vkoverhead](https://github.com/zmike/vkoverhead) says we're pushing 100
+million draws per second.
+
+I think we're okay.
+
+### April 24
+
+Time to light up YCbCr. If we don't use special YCbCr hardware,
+this feature is "software-only". However, it touches a *lot* of code.
+
+It touches so much code that [Mohamed
+Ahmed](https://mohamexiety.github.io/posts/final_report/) spent an entire
+summer adding it to NVK.
+
+Which means he spent a summer adding it to Honeykrisp.
+
+Thanks, Mohamed ;-)
+
+### April 25
+
+Query copies are next. In Vulkan, the application can query the number of samples rendered,
+writing the result into an opaque
+"query pool". The result can be copied from the query pool on the CPU or GPU.
+
+For the CPU, the driver maps the pool's internal data structure and copies the
+result. This may require nontrivial repacking.
+
+For the GPU, we need to repack in a compute shader. That's harder, because
+we can't just run C code on the GPU, right?
+
+...Actually, we can.
+
+A little witchcraft makes GPU query copies as easy as C.
+
+```c
+void copy_query(struct params *p, int i) {
+ uintptr_t dst = p->dest + i * p->stride;
+ int query = p->first + i;
+
+ if (p->available[query] || p->partial) {
+ int q = p->index[query];
+ write_result(dst, p->_64, p->results[q]);
+ }
+
+ ...
+}
+```
+
+### April 26
+
+The final boss: border colours, hard mode.
+
+Direct3D lets the application choose an arbitrary border colour when
+creating a sampler. By contrast, Vulkan only requires three border colours:
+
+* **`(0, 0, 0, 0)`** -- transparent black
+* **`(0, 0, 0, 1)`** -- opaque black
+* **`(1, 1, 1, 1)`** -- opaque white
+
+We handled these on April 8. Unfortunately, there are two problems.
+
+First, we need custom border colours for Direct3D compatibility. Both [DXVK](https://github.com/doitsujin/dxvk) and
+[vkd3d-proton](https://github.com/HansKristian-Work/vkd3d-proton) require the
+[`EXT_custom_border_color`](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_custom_border_color.html)
+extension.
+
+Second, there's a subtle problem with our hardware, causing dozens of fails
+even without custom border colours. To understand the issue, let's revisit
+texture descriptors, which contain a
+pixel _format_ and a component reordering _swizzle_.
+
+Some formats are implicitly reordered. Common "BGRA" formats swap red and blue
+for [historical
+reasons](https://stackoverflow.com/questions/74924790/why-bgra-instead-of-rgba).
+The M1 does not directly support these formats. Instead, the driver composes
+the swizzle with the format's reordering. If the application uses a `BARB`
+swizzle with a `BGRA` format, the driver uses an `RABR` swizzle with an
+`RGBA` format.
+
+There's a catch: swizzles apply to the border colour, but formats do not. We
+need to *undo* the format reordering when programming the border colour for
+correct results after the hardware applies the composed swizzle. Our OpenGL
+driver implements border colours this way, because it knows the texture format
+when creating the sampler. Unfortunately, Vulkan doesn't give us that
+information.
+
+Without custom border colour support, we "should" be okay. Swapping red
+and blue doesn't change anything if the colour is white or black.
+
+There's an even *subtler* catch. Vulkan mandates support for a
+packed 16-bit format with 4-bit components. The M1 supports a similar format...
+but with reversed "endianness", swapping red and *alpha*.
+
+That still seems okay. For transparent black (all zero) and opaque white (all
+one), swapping components doesn't change the result.
+
+The problem is opaque black: (0, 0, 0,
+1). Swapping red and alpha gives (1,
+0, 0, 0). Transparent red? Uh-oh.
+
+We're stuck. No known hardware configuration implements correct Vulkan
+semantics.
+
+Is hope lost?
+
+Do we give up?
+
+A reasonable person would.
+
+I am not reasonable.
+
+Let's jump into the deep end. If we implement custom border colours, opaque
+black becomes a special case. But how? The M1's custom border colours entangle
+the texture format with the sampler. A reasonable person would skip Direct3D
+support.
+
+As you know, I am not reasonable.
+
+Although the hardware is unsuitable, we control software. Whenever a shader
+samples a texture, we'll inject code to fix up the border colour. This
+emulation is simple, correct, and slow. We'll use dirty driver
+tricks to speed it up later. For now, we eat the cost, advertise full custom border
+colours, and pass the opaque black tests.
+
+### April 27
+
+All that's left is some last minute bug fixing, and...
+
+> **Pass**: 686930, **Fail**: 0
+
+Success.
+
+### The future
+
+The next task is implementing everything that
+[DXVK](https://github.com/doitsujin/dxvk/blob/master/VP_DXVK_requirements.json)
+and
+[vkd3d-proton](https://github.com/HansKristian-Work/vkd3d-proton/blob/master/VP_D3D12_VKD3D_PROTON_profile.json)
+require to layer Direct3D. That includes esoteric extensions like
+[transform feedback](https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VK_EXT_transform_feedback.html). Then [Wine](https://www.winehq.org/) and an [open source x86
+emulator](https://github.com/FEX-Emu/FEX) will run Windows games on [Asahi
+Linux](https://asahilinux.org/).
+
+That's getting ahead of ourselves. In the mean time, enjoy Linux games with
+our [conformant OpenGL
+4.6](/img/blog/2024/06/blog/conformant-gl46-on-the-m1.html) drivers... and
+stay tuned.
+
+
+
+---
diff --git a/static/img/blog/2023/08/vkinstancing.webp b/static/img/blog/2023/08/vkinstancing.webp
new file mode 100644
index 0000000..b3d47ed
Binary files /dev/null and b/static/img/blog/2023/08/vkinstancing.webp differ
diff --git a/static/img/blog/2023/08/vkinstancing2.webp b/static/img/blog/2023/08/vkinstancing2.webp
new file mode 100644
index 0000000..39d925e
Binary files /dev/null and b/static/img/blog/2023/08/vkinstancing2.webp differ
diff --git a/static/img/blog/2024/02/Blender-Wanderer-high.avif b/static/img/blog/2024/02/Blender-Wanderer-high.avif
new file mode 100644
index 0000000..60ae75f
Binary files /dev/null and b/static/img/blog/2024/02/Blender-Wanderer-high.avif differ
diff --git a/static/img/blog/2024/02/Blender-Wanderer.avif b/static/img/blog/2024/02/Blender-Wanderer.avif
new file mode 100644
index 0000000..ecd31da
Binary files /dev/null and b/static/img/blog/2024/02/Blender-Wanderer.avif differ
diff --git a/static/img/blog/2024/06/babystorm.avif b/static/img/blog/2024/06/babystorm.avif
new file mode 100644
index 0000000..e1132a5
Binary files /dev/null and b/static/img/blog/2024/06/babystorm.avif differ
diff --git a/static/img/blog/2024/06/babystorm.png b/static/img/blog/2024/06/babystorm.png
new file mode 100644
index 0000000..711231a
Binary files /dev/null and b/static/img/blog/2024/06/babystorm.png differ
diff --git a/static/img/blog/2024/06/hk-triangle.avif b/static/img/blog/2024/06/hk-triangle.avif
new file mode 100644
index 0000000..5c58531
Binary files /dev/null and b/static/img/blog/2024/06/hk-triangle.avif differ
diff --git a/static/img/blog/2024/06/hk-triangle.png b/static/img/blog/2024/06/hk-triangle.png
new file mode 100644
index 0000000..da8b2f9
Binary files /dev/null and b/static/img/blog/2024/06/hk-triangle.png differ
diff --git a/static/img/blog/2024/06/hkr-stk-zink.avif b/static/img/blog/2024/06/hkr-stk-zink.avif
new file mode 100644
index 0000000..31f8e24
Binary files /dev/null and b/static/img/blog/2024/06/hkr-stk-zink.avif differ
diff --git a/static/img/blog/2024/06/hkr-stk-zink.png b/static/img/blog/2024/06/hkr-stk-zink.png
new file mode 100644
index 0000000..5912aa6
Binary files /dev/null and b/static/img/blog/2024/06/hkr-stk-zink.png differ
diff --git a/static/img/blog/2024/06/hkr-stk.avif b/static/img/blog/2024/06/hkr-stk.avif
new file mode 100644
index 0000000..e8ce80a
Binary files /dev/null and b/static/img/blog/2024/06/hkr-stk.avif differ
diff --git a/static/img/blog/2024/06/hkr-stk.png b/static/img/blog/2024/06/hkr-stk.png
new file mode 100644
index 0000000..408f181
Binary files /dev/null and b/static/img/blog/2024/06/hkr-stk.png differ
diff --git a/static/img/blog/2024/06/holocure.avif b/static/img/blog/2024/06/holocure.avif
new file mode 100644
index 0000000..84832c4
Binary files /dev/null and b/static/img/blog/2024/06/holocure.avif differ
diff --git a/static/img/blog/2024/06/holocure.png b/static/img/blog/2024/06/holocure.png
new file mode 100644
index 0000000..a3929bb
Binary files /dev/null and b/static/img/blog/2024/06/holocure.png differ
diff --git a/static/img/blog/2024/06/vkquake.avif b/static/img/blog/2024/06/vkquake.avif
new file mode 100644
index 0000000..5e6967a
Binary files /dev/null and b/static/img/blog/2024/06/vkquake.avif differ
diff --git a/static/img/blog/2024/06/vkquake.png b/static/img/blog/2024/06/vkquake.png
new file mode 100644
index 0000000..7ce58c6
Binary files /dev/null and b/static/img/blog/2024/06/vkquake.png differ