Skip to content

Commit

Permalink
use internal str Py_UNICODE array, avoid extra encoding and copy
Browse files Browse the repository at this point in the history
  • Loading branch information
anthrotype committed Dec 11, 2018
1 parent d381921 commit ebdf61e
Showing 1 changed file with 6 additions and 11 deletions.
17 changes: 6 additions & 11 deletions src/uharfbuzz/_harfbuzz.pyx
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
#cython: language_level=3
from charfbuzz cimport *
from cpython.unicode cimport PyUnicode_AS_UNICODE, PyUnicode_GET_SIZE
from libc.stdint cimport uint16_t, uint32_t
from libc.stdlib cimport free, malloc
from libc.string cimport const_char
from typing import Callable, Dict, List, Tuple
import array
from cpython cimport array


cdef bint PY_NARROW_UNICODE = sizeof(Py_UNICODE) != 4
Expand Down Expand Up @@ -181,29 +181,24 @@ cdef class Buffer:

def add_str(self, text: str,
item_offset: int = None, item_length: int = None) -> None:
cdef array.array packed
# handle both "wide" and "narrow" python builds; strip the BOM
if PY_NARROW_UNICODE:
packed = array.array("H", text.encode("UTF-16")[2:])
else:
packed = array.array("I", text.encode("UTF-32")[4:])
cdef unsigned int size = len(packed)
cdef Py_UNICODE* array = PyUnicode_AS_UNICODE(text)

This comment has been minimized.

Copy link
@khaledhosny

khaledhosny Dec 11, 2018

Collaborator

PyUnicode_AS_UNICODE() is actually deprecated and can also return NULL, see https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AS_UNICODE.

This comment has been minimized.

Copy link
@anthrotype

anthrotype Dec 11, 2018

Author Member

right.. after PEP393 (python >= 3.3), the internal unicode representation no longer uses Py_UNICODE* (2 or 4-byte wchar_t) but a more compact 1, 2 or 4 bytes depending on the highest codepoint in the string.
Calling the deprecated API (which is guaranteed to be there for a while, until python 4.0) on python3 no longer reads the internal buffer but has to create it (hence it may fail if there's no memory left).

I would prefer to just check for NULL right now. The alternative is to go back to using array.array and do the explicit encoding (btw, we haven't thought about endianess..). That's probably fine.
We can't just take the inner void* with PyUnicode_DATA because here we need to explicitly get either an array of uint16_t (for narrow py2 builds) or uint32_t, whereas the former can be an array of Py_UCS1, Py_UCS2 or Py_UCS4 depending on the PyUnicode_KIND.

There's PyUnicode_READ to get a Py_UCS4 codepoint (one at a time) from the inner data obtained by PyUnicode_DATA(). We could call that in a loop to initialize our temporary array.array of unsigned integers and then call hb_buffer_add_codepoints.
However this new API for unicode strings from PEP393 only works for python3.3 and above.

I may just do that, and thus kill python2.7 support (which never really existed).

This comment has been minimized.

Copy link
@khaledhosny

khaledhosny Dec 11, 2018

Collaborator

I think you can check the PyUnicode_KIND and then can add_utf8, add_utf16 or add_utf32 as appropriate, the cluster indices would still match the Python string indices and you avoid recreating the buffer.

cdef Py_ssize_t size = PyUnicode_GET_SIZE(text)
if item_offset is None:
item_offset = 0
if item_length is None:
item_length = size
if PY_NARROW_UNICODE:
hb_buffer_add_utf16(
self._hb_buffer,
packed.data.as_ushorts,
<uint16_t*>array,
size,
item_offset,
item_length,
)
else:
hb_buffer_add_utf32(
self._hb_buffer,
packed.data.as_uints,
<uint32_t*>array,
size,
item_offset,
item_length,
Expand Down

0 comments on commit ebdf61e

Please sign in to comment.