Musings on PyUnicodeObject

Posted by: Salman Ali

Original Post Date: March 20, 2016

Last modified: June 23, 2016

The reasons for my delving into CPython internals are a story for another day, yet quite early on it became clear that I needed to better understand how Python 3 deals with text. Most of us probably interact mainly with the Python interpreter via text input and output, and with Python 3 all of these interactions are encapsulated inside PyUnicodeObjects. Common advice is to learn by putting a bunch of print statements in the code, but print... what? Not being able to decipher these objects in CPython is akin to groping around the code with a metaphorical blindfold on.

To help me learn I did search some on the Web but was surprised at how little detail is out there. Philip Guo has a nice set of videos on YouTube on Python internals yet those deal with Python 2, which handles text differently. Ultimately I had to roll my sleeves up and dive into the code and tinker until I pieced a few things together.

I thought I'd share my learnings as I'm sure there are others out there who were wondering the same thing. I've found that sometimes it takes just a small yet key piece of the puzzle to make everything else fall into place in terms of understanding. This was certainly the case with me and I hope it may be for you as well. As a result, this blog page is geared towards relative newbies to CPython internals, yet if you have more experience with the subject matter I certainly welcome any constructive feedback.

So What Exactly is a PyUnicodeObject?

To understand PyUnicodeObject, you need to look in Include/unicodeobject.h, which defines it as below:

PyObject

⇑

PyASCIIObject

⇑

PyCompactUnicodeObject

⇑

PyUnicodeObject

I'm showing inheritance loosely here as C doesn't formally support inheritance, although embedding structs within structs and defining all variables as void * works convincingly enough.

A PyObject plays the role of an abstract base class and is never instantiated so PyASCIIObject is where everything starts. A PyASCIIObject contains the following:

PyObject header

length
hash
state

interned
kind
compact
ascii
ready
(and a 24-bit buffer)

wstr

The key to understanding PyUnicodeObjects is in state, as explained in the comments in the struct definition of PyASCIIObject where the rules are all laid out.

      /* There are 4 forms of Unicode strings:

         - PyASCIIObject (compact ascii):
	           * kind = PyUnicode_1BYTE_KIND
	           * compact = 1
	           * ascii = 1
	           * ready = 1
	           * (length is the length of the utf8 and wstr strings)
	           * (data starts just after the structure)
	           * (since ASCII is decoded from UTF-8, the utf8 string are the data)

         - PyCompactUnicodeObject (compact):
	           * kind = PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or
	             PyUnicode_4BYTE_KIND
	           * compact = 1
	           * ready = 1
	           * ascii = 0
	           * utf8 is not shared with data
	           * utf8_length = 0 if utf8 is NULL
	           * wstr is shared with data and wstr_length=length
	             if kind=PyUnicode_2BYTE_KIND and sizeof(wchar_t)=2
	             or if kind=PyUnicode_4BYTE_KIND and sizeof(wchar_t)=4
	           * wstr_length = 0 if wstr is NULL
	           * (data starts just after the structure)

         - PyUnicodeObject (legacy string, not ready):
	           * length = 0 (use wstr_length)
	           * hash = -1
	           * kind = PyUnicode_WCHAR_KIND
	           * compact = 0
	           * ascii = 0
	           * ready = 0
	           * interned = SSTATE_NOT_INTERNED
	           * wstr is not NULL
	           * data.any is NULL
	           * utf8 is NULL
	           * utf8_length = 0

         - PyUnicodeObject (legacy string, ready):
	           * kind = PyUnicode_1BYTE_KIND, PyUnicode_2BYTE_KIND or
	             PyUnicode_4BYTE_KIND
	           * compact = 0
	           * ready = 1
	           * data.any is not NULL
	           * utf8 is shared and utf8_length = length with data.any if ascii = 1
	           * utf8_length = 0 if utf8 is NULL
	           * wstr is shared with data.any and wstr_length = length
	             if kind=PyUnicode_2BYTE_KIND and sizeof(wchar_t)=2
	             or if kind=PyUnicode_4BYTE_KIND and sizeof(wchar_4)=4
	           * wstr_length = 0 if wstr is NULL

         Compact strings use only one memory block (structure + characters),
         whereas legacy strings use one block for the structure and one block
         for characters.

         Legacy strings are created by PyUnicode_FromUnicode() and
         PyUnicode_FromStringAndSize(NULL, size) functions. They become ready
         when PyUnicode_READY() is called.

An Example

Note, most of what I'm showing in the example is to help expose what is fundamentally going on in the data structures of CPython and I wouldn't recommend these techniques for use with programming for the Python/C API. Please reference the Python/C API documentation for the set of helper functions and macros to best handle working with PyUnicodeObjects programmatically. But these techniques might work well if you just want to put in a few quick print statements while debugging.

On to the example. Say you are interested in the LOAD_NAME opcode in the main ceval loop during a print statement execution. The first command processing that opcode is:

PyObject *name = GETITEM(names, oparg);

If we were to examine name with gdb, we would see the following:

(gdb) print name->ob_type $1 = (struct _typeobject *) 0x89f840 <PyUnicode_Type>

Confirming that this is a PyUnicodeObject

(gdb) print ((PyUnicodeObject *)name)->_base->_base->state $2 = {interned = 1, kind = 1, compact = 1, ascii = 1, ready = 1}

Specifically this is a PyASCIIObject based on the state flags

(gdb) print ((PyASCIIObject *)name)->state $3 = {interned = 1, kind = 1, compact = 1, ascii = 1, ready = 1}

Also possible when cast as a PyASCIIObject

(gdb) print *(PyASCIIObject *)name $4 = {ob_base = {_ob_next = 0x7ffff7f63d00, _ob_prev = 0x7ffff7f63d78, ob_refcnt = 14, ob_type = 0x89f840 <PyUnicode_Type>}, length = 5, hash = 4032701448170679507, state = {interned = 1, kind = 1, compact = 1, ascii = 1, ready = 1}, wstr = 0x0}

Looking at the entire PyASCIIObject struct, you notice there's a problem: Where's the actual ASCII data?

(gdb) print (char *)(name + 1) $5 = 0x7ffff7f5ff68 "\005" (gdb) print (char *)(name + 2) $6 = 0x7ffff7f5ff88 "print"

Wait, say what? A quick recap on C pointer arithmetic: adding 1 to a pointer adds the size of an int to a pointer. In this case an int is 32 bytes (0x20 in hex).

Per the comments in the PyASCIIObject struct shown above, the data starts just after the struct, and the PyASCIIObject is 64 bytes long. So that explains the (name + 2), but what's going on with (name + 1)? Well, a PyObject is 32 bytes and the length of the PyASCIIObject data is the next data item after the PyObject header in PyASCIIObject.

That's it, add 2 to the PyASCIIObject pointer and you can access the underlying ASCII data! Happy hacking!

This work is licensed under a Creative Commons Attribution 4.0 International License