[Question] Could use Python bindings instead of Cython #1887

penguin-wwy · 2024-10-16T10:51:51Z

From the code, the Buffer in Cython is just a wrapper around the Buffer in cpp. It might be beneficial to use Python bindings directly, as this could reduce performance overhead and the cost of code maintenance.

chaokunyang · 2024-10-16T11:59:28Z

You mean implement Buffer using python only instead of CBuffer? We implement some varint encoding in Cpp Buffer. I'm not sure using python will make it faster, maybe need some benchmarks

penguin-wwy · 2024-10-16T12:10:36Z

My suggestion is to use pybind11 or directly write a C-API to call the cpp fury::Buffer. The current Cython-generated C++ code still calls the fury::Buffer, which I think is unnecessary.

static CYTHON_INLINE int64_t __pyx_f_6pyfury_5_util_6Buffer_get_int64(struct __pyx_obj_6pyfury_5_util_Buffer *__pyx_v_self, uint32_t __pyx_v_offset, CYTHON_UNUSED int __pyx_skip_dispatch) {
  ...

  /* "pyfury/_util.pyx":140
 *     cpdef inline int64_t get_int64(self, uint32_t offset):
 *         self.check_bound(offset, <int32_t>8)
 *         return self.c_buffer.get().GetInt64(offset)             # <<<<<<<<<<<<<<
 * 
 *     cpdef inline float get_float(self, uint32_t offset):
 */
  __pyx_r = __pyx_v_self->c_buffer.get()->GetInt64(__pyx_v_offset);  // This is an indirect call, which is not really necessary.
  goto __pyx_L0;

  /* function exit code */
  ...
}

If you agree, I will try to provide a simple demo to verify the performance and code maintenance benefits.

chaokunyang · 2024-10-16T12:36:12Z

That would be great, if we get a big performance gain, we can compromise some code maintenance costs

penguin-wwy · 2024-10-23T11:15:53Z

Hi, I conducted comparative experiments, trying out pybind11, nanobind, and directly writing C-API code.

pybind11 had the worst performance, which aligns with my understanding. It doesn't perform any specific optimizations for different scenarios and has relatively complex type conversion operations. However, its maintenance code is the simplest. For code that only requires API binding, it can be written as follows:

PYBIND11_MODULE(_pyutil, util_mod) {
  py::class_<fury::Buffer>(util_mod, "Buffer")
      .def(py::init<>())
      .def("own_data", &fury::Buffer::own_data)
      .def("reserve", &fury::Buffer::Reserve)
      .def("put_bool", [](fury::Buffer &self, uint32_t offset,
                          bool v) { self.UnsafePutByte(offset, v); })
      .def("put_int8", [](fury::Buffer &self, uint32_t offset,
                          int8_t v) { self.UnsafePutByte(offset, v); })
      .def("get_bool", &fury::Buffer::GetBool)
      .def("get_int8", &fury::Buffer::GetInt8)
      ...
      .def_static("allocate", [](uint32_t size) { return fury::AllocateBuffer(size); });
}

Nanobind's performance is slightly better than Cython's, and its binding method is not much different from pybind11. However, it only supports Python 3.8+.
Directly writing C-API code can perform better than Cython if optimized for different versions (especially >= 3.11). However, is detrimental to the goal of maintaining code more easily. For example:
- Cython generates redundant checks when creating get_bool, and due to the unreasonable setting of ml_flag (it should choose METH_O instead of METH_FASTCALL | METH_KEYWORDS), parameter parsing also introduces additional overhead.

static PyObject *
cbuffer_get_bool(CBufferObject *self, PyObject *offset)
{
    long off_val = PyLong_AsLong(offset);
    assert(off_val <= UINT32_MAX);
    return self->buffer->GetBool(off_val) ? Py_NewRef(Py_True) : Py_NewRef(Py_False);
}

static PyMethodDef cbuffer_methods[] = {
    {"get_bool", (PyCFunction)cbuffer_get_bool, METH_O, nullptr},
    ...
    {NULL, NULL}           /* sentinel */
};

Additionally, after analyzing the Cython code, I found that some performance optimizations can be achieved by directly calling certain C-API functions in the .pyx file. The principle behind this is to use some higher-level knowledge to avoid Cython generating certain guard code. I will attempt to submit these optimizations as a PR in the future.

chaokunyang · 2024-10-23T14:53:16Z

Thanks for sharing so insightful experiments. I used to think pybind is fast since it avoid generating code like cython. It turns out it's the slowest, this superised me a little. Cython seems provide some annotations to help generate c++ code, are you going to employ such kind of optimization?

penguin-wwy added the question Further information is requested label Oct 16, 2024

penguin-wwy mentioned this issue Oct 24, 2024

feat(python): guide cython to optimize code generation #1905

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Could use Python bindings instead of Cython #1887

[Question] Could use Python bindings instead of Cython #1887

penguin-wwy commented Oct 16, 2024 •

edited

Loading

chaokunyang commented Oct 16, 2024 •

edited

Loading

penguin-wwy commented Oct 16, 2024 •

edited

Loading

chaokunyang commented Oct 16, 2024 •

edited

Loading

penguin-wwy commented Oct 23, 2024

chaokunyang commented Oct 23, 2024

[Question] Could use Python bindings instead of Cython #1887

[Question] Could use Python bindings instead of Cython #1887

Comments

penguin-wwy commented Oct 16, 2024 • edited Loading

chaokunyang commented Oct 16, 2024 • edited Loading

penguin-wwy commented Oct 16, 2024 • edited Loading

chaokunyang commented Oct 16, 2024 • edited Loading

penguin-wwy commented Oct 23, 2024

chaokunyang commented Oct 23, 2024

penguin-wwy commented Oct 16, 2024 •

edited

Loading

chaokunyang commented Oct 16, 2024 •

edited

Loading

penguin-wwy commented Oct 16, 2024 •

edited

Loading

chaokunyang commented Oct 16, 2024 •

edited

Loading