pytorch · janeyx99 · Jan 28, 2025 · Jan 22, 2025 · Jan 22, 2025 · Jan 24, 2025
diff --git a/advanced_source/cpp_custom_ops.rst b/advanced_source/cpp_custom_ops.rst
@@ -62,20 +62,78 @@ Using ``cpp_extension`` is as simple as writing the following ``setup.py``:
 
   setup(name="extension_cpp",
         ext_modules=[
-            cpp_extension.CppExtension("extension_cpp", ["muladd.cpp"])],
-        cmdclass={'build_ext': cpp_extension.BuildExtension})
+            cpp_extension.CppExtension(
+              "extension_cpp",
+              ["muladd.cpp"],
+              # define Py_LIMITED_API with min version 3.9 to expose only the stable
+              # limited API subset from Python.h
+              extra_compile_args={"cxx": ["-DPy_LIMITED_API=0x03090000"]}, 
+              py_limited_api=True)],  # Build 1 wheel across multiple Python versions
+        cmdclass={'build_ext': cpp_extension.BuildExtension},
+        options={"bdist_wheel": {"py_limited_api": "cp39"}}  # 3.9 is minimum supported Python version
+  )
 
 If you need to compile CUDA code (for example, ``.cu`` files), then instead use
 `torch.utils.cpp_extension.CUDAExtension <https://pytorch.org/docs/stable/cpp_extension.html#torch.utils.cpp_extension.CUDAExtension>`_.
 Please see `extension-cpp <https://github.com/pytorch/extension-cpp>`_ for an
 example for how this is set up.
 
-Starting with PyTorch 2.6, you can now build a single wheel for multiple CPython
-versions (similar to what you would do for pure python packages). In particular,
-if your custom library adheres to the `CPython Stable Limited API
-<https://docs.python.org/3/c-api/stable.html>`_ or avoids CPython entirely, you
-can build one Python agnostic wheel against a minimum supported CPython version
-through setuptools' ``py_limited_api`` flag, like so:
+The above example represents what we refer to as a CPython agnostic wheel, meaning we are
+building a single wheel that can be run across multiple CPython versions (similar to pure
+Python packages). CPython agnosticism is desirable in minimizing the number of wheels your
+custom library needs to support and release. The minimum version we'd like to support is
+3.9, since it is the oldest supported version currently, so we use the corresponding hexcode
+and specifier throughout the setup code. We suggest building the extension in the same
+environment as the minimum CPython version you'd like to support to minimize unknown behavior,
+so, here, we build the extension in a CPython 3.9 environment. When built, this single wheel
+will be runnable in any CPython environment 3.9+. To achieve this, there are three key lines
+to note.
+
+The first is the specification of ``Py_LIMITED_API`` in ``extra_compile_args`` to the
+minimum CPython version you would like to support:
+
+.. code-block:: python
+
+  extra_compile_args={"cxx": ["-DPy_LIMITED_API=0x03090000"]},
+
+Defining the ``Py_LIMITED_API`` flag helps verify that the extension is in fact
+only using the `CPython Stable Limited API <https://docs.python.org/3/c-api/stable.html>`_,
+which is a requirement for the building a CPython agnostic wheel. If this requirement
+is not met, it is possible to build a wheel that looks CPython agnostic but will crash,
+or worse, be silently incorrect, in another CPython environment. Take care to avoid
+using unstable CPython APIs, for example APIs from libtorch_python (in particular
+pytorch/python bindings,) and to only use APIs from libtorch (ATen objects, operators
+and the dispatcher). We strongly recommend defining the ``Py_LIMITED_API`` flag to
+help ascertain the extension is compliant and safe as a CPython agnostic wheel. Note that
+defining this flag is not a full guarantee that the built wheel is CPython agnostic, but
+it is better than the wild wild west. There are several caveats mentioned in the
+`Python docs <https://docs.python.org/3/c-api/stable.html#limited-api-caveats>`_,
+and you should test and verify yourself that the wheel is truly agnostic for the relevant
+CPython versions.
+
+The second and third lines specifying ``py_limited_api`` inform setuptools that you intend
+to build a CPython agnostic wheel and will influence the naming of the wheel accordingly:
+
+.. code-block:: python
+
+  setup(name="extension_cpp",
+        ext_modules=[
+            cpp_extension.CppExtension(
+              ...,
+              py_limited_api=True)],  # Build 1 wheel across multiple Python versions
+        ...,
+        options={"bdist_wheel": {"py_limited_api": "cp39"}}  # 3.9 is minimum supported Python version
+  )
+
+It is necessary to specify ``py_limited_api=True`` as an argument to CppExtension/
+CUDAExtension and also as an option to the ``"bdist_wheel"`` command with the minimal
+supported CPython version (in this case, 3.9). Consequently, the ``setup`` in our
+tutorial would build one properly named wheel that could be installed across multiple
+CPython versions ``>=3.9``.
+
+If your extension uses CPython APIs outside the stable limited set, then you cannot
+build a CPython agnostic wheel! You should build one wheel per CPython version instead,
+like so:
 
 .. code-block:: python
 
@@ -86,28 +144,10 @@ through setuptools' ``py_limited_api`` flag, like so:
         ext_modules=[
             cpp_extension.CppExtension(
               "extension_cpp",
-              ["python_agnostic_code.cpp"],
-              py_limited_api=True)],
+              ["muladd.cpp"])],
         cmdclass={'build_ext': cpp_extension.BuildExtension},
-        options={"bdist_wheel": {"py_limited_api": "cp39"}}
   )
 
-Note that you must specify ``py_limited_api=True`` both within ``setup``
-and also as an option to the ``"bdist_wheel"`` command with the minimal supported
-Python version (in this case, 3.9). This ``setup`` would build one wheel that could
-be installed across multiple Python versions ``python>=3.9``. Please see
-`torchao <https://github.com/pytorch/ao>`_ for an example.
-
-.. note::
-
-  You must verify independently that the built wheel is truly Python agnostic.
-  Specifying ``py_limited_api`` does not check for any guarantees, so it is possible
-  to build a wheel that looks Python agnostic but will crash, or worse, be silently
-  incorrect, in another Python environment. Take care to avoid using unstable CPython
-  APIs, for example APIs from libtorch_python (in particular pytorch/python bindings,)
-  and to only use APIs from libtorch (aten objects, operators and the dispatcher).
-  For example, to give access to custom ops from Python, the library should register
-  the ops through the dispatcher (covered below!).
 
 Defining the custom op and adding backend implementations
 ---------------------------------------------------------
@@ -252,16 +292,89 @@ matters (importing in the wrong order will lead to an error).
 
 To use the custom operator with hybrid Python/C++ registrations, we must
 first load the C++ library that holds the custom operator definition
-and then call the ``torch.library`` registration APIs. This can happen in one
-of two ways:
-
-1. If you're following this tutorial, importing the Python C extension module
-   we created will load the C++ custom operator definitions.
-2. If your C++ custom operator is located in a shared library object, you can
-   also use ``torch.ops.load_library("/path/to/library.so")`` to load it. This
-   is the blessed path for Python agnosticism, as you will not have a Python C
-   extension module to import. See `torchao __init__.py <https://github.com/pytorch/ao/blob/881e84b4398eddcea6fee4d911fc329a38b5cd69/torchao/__init__.py#L26-L28>`_
-   for an example.
+and then call the ``torch.library`` registration APIs. This can happen in
+three ways:
+
+
+1. The first way to load the C++ library that holds the custom operator definition
+   is to define a dummy Python module for _C. Then, in Python, when you import the
+   module with ``import _C``, the ``.so`` files corresponding to the extension will
+   be loaded and the ``TORCH_LIBRARY`` and ``TORCH_LIBRARY_IMPL`` static initializers
+   will run. One can create a dummy Python module with ``PYBIND11_MODULE`` like below,
+   but you will notice that this does not compile with ``Py_LIMITED_API``, because
+   ``pybind11`` does not promise to only use the stable limited CPython API! With
+   the below code, you sadly cannot build a CPython agnostic wheel for your extension!
+   (Foreshadowing: I wonder what the second way is ;) ).
+
+.. code-block:: cpp
+
+  // in, say, not_agnostic/csrc/extension_BAD.cpp
+  #include <pybind11/pybind11.h>
+
+  PYBIND11_MODULE("_C", m) {}
+
+.. code-block:: python
+
+  # in, say, extension/__init__.py
+  from . import _C
+
+2. In this tutorial, because we value being able to build a single wheel across multiple
+   CPython versions, we will replace the unstable ``PYBIND11`` call with stable API calls.
+   The below code compiles with ``-DPy_LIMITED_API=0x03090000`` and successfully creates
+   a dummy Python module for our ``_C`` extension so that it can be imported from Python.
+   See `extension_cpp/__init__.py <https://github.com/pytorch/extension-cpp/blob/38ec45e/extension_cpp/__init__.py>`_
+   and `extension_cpp/csrc/muladd.cpp  <https://github.com/pytorch/extension-cpp/blob/38ec45e/extension_cpp/csrc/muladd.cpp>`_
+   for more details:
+
+.. code-block:: cpp
+
+  #include <Python.h>
+
+  extern "C" {
+    /* Creates a dummy empty _C module that can be imported from Python.
+      The import from Python will load the .so consisting of this file
+      in this extension, so that the TORCH_LIBRARY static initializers
+      below are run. */
+    PyObject* PyInit__C(void)
+    {
+        static struct PyModuleDef module_def = {
+            PyModuleDef_HEAD_INIT,
+            "_C",   /* name of module */
+            NULL,   /* module documentation, may be NULL */
+            -1,     /* size of per-interpreter state of the module,
+                      or -1 if the module keeps state in global variables. */
+            NULL,   /* methods */
+        };
+        return PyModule_Create(&module_def);
+    }
+  }
+
+.. code-block:: python
+
+  # in, say, extension/__init__.py
+  from . import _C
+
+3. If you want to avoid ``Python.h`` entirely in your C++ custom operator, you may
+   use ``torch.ops.load_library("/path/to/library.so")`` in Python to load the ``.so``
+   file(s) compiled from the extension. Note that, with this method, there is no ``_C``
+   Python module created for the extension so you cannot call ``import _C`` from Python.
+   Instead of relying on the import statement to trigger the custom operators to be
+   registered, ``torch.ops.load_library("/path/to/library.so")`` will do the trick.
+   The challenge then is shifted towards understanding where the ``.so`` files are
+   located so that you can load them, which is not always trivial:
+
+.. code-block:: python
+
+  import torch
+  from pathlib import Path
+
+  so_files = list(Path(__file__).parent.glob("_C*.so"))
+  assert (
+      len(so_files) == 1
+  ), f"Expected one _C*.so file, found {len(so_files)}"
+  torch.ops.load_library(so_files[0])
+
+  from . import ops
 
 
 Adding training (autograd) support for an operator