Page requests to the metal - Backend implementation

This post is part of the page requests to the metal series where we cover all the steps from a web request down to the electrons that move in the CPU, see the intro/overview page to see where this fits in.

Language implementation/execution

From this point on we have left the realm of "web frameworks" and are now looking at the Python language implementation details that underpin many different paradigms and programs.

Python code execution

Python is a language specification which doesn't specify exact implementation details. So depending on the implementation of Python being used the execution of the code can vary. For the sake of this example we are using the CPython implementation because it's the reference implementation but also because the way in which CPython executes the code it's easier to trace exactly what's happening. (Figuring out exactly what is happening in an implementation such as PyPy would be substantially more difficult because it would require a discussion about the tracing JIT approach PyPy uses. Specifically because of the JIT strategy that PyPy uses exact execution details can vary from invocation to invocation of the same function.)

As the name hints, the CPython interpreter is written in C and since the CPython repo is now on GitHub you can have a look at the code yourself! From here on we will deal with some of internals of CPython.

Adding the numbers: Python bytecode

Whenever you are using CPython you might have been away of the presence of those *.pyc and *.pyo files. These are the files that contain the Python bytecode of your programs.

One thing every production-ready Python implementation does is a pre-compilation step where the source code is first parsed from text into some intermediate representation that can run faster. In the case of CPython the source is first parsed and then compiled into a cross platform bytecode so that no slow text parsing has to occur at run time. (It is at this step where syntax errors are raised and is the reason why you cannot catch that exception from within your code because this step runs before your code is executed).

When we are adding the 2 numbers together this is accomplished via interpreted Python bytecode. This is implementation dependent and in the case of CPython we can see how in the CPython source code. CPython doesn't take many liberties with how it executes the bytecode, the design is based on being correct and reliable, not fast. This is a fairly simple stack based approach which, conveniently, makes it really easy to follow how the execution of the bytecode happens. (If you want faster execution check out PyPy or ways to incorporate compiled code

We can use the dis module to see the bytecode from the function:

def add_two_numbers(first, second):
    """Add two numbers together
    :first: first number
    :second: second number
    """
    result = first + second
    return result

import dis
print(dis.dis(add_two_numbers))

Bytecode:

11           0 LOAD_FAST                0 (first)
             3 LOAD_FAST                1 (second)
             6 BINARY_ADD
             7 STORE_FAST               2 (result)

12          10 LOAD_FAST                2 (result)
            13 RETURN_VALUE

So we see on line 11 we load the 2 variables with LOAD_FAST then we perform a BINARY_ADD then store those into the right hand side of the expression on line 11 with STORE_FAST. Then on line 12 we fetch the result from the addition with LOAD_FAST and RETURN_VALUE to return that from the function.

Execution of the CPython code:

When the CPython interpreter is run it generates the machine code instructions that carry out the addition. This is done via compiled code that constitutes the core of the Python implementation. In practice this step goes from the C source to machine code via some internal state of the compiler. In the case of using GCC the source code is first compiled into GNU Intermediate Language then output with the appropriate backend into the needed machine instructions. In this case we are on x86 so the backend used by GCC is the x86 one and the CPython interpreter has compiled such that we are running x86 machine code.

Specifically on the first run through the code the CPython interpreter compiles the source code down to bytecode. This bytecode is then executed via the interpreter at runtime. Understanding how the bytecode is interpreted is therefore the next step in our journey.

Going into the CPython 3.7 source we find the execution of bytecode occurs in ceval.c within which there's a switch statement that starts with:

switch (opcode) {

The opcode in this case is the opcode for the Python bytecode currently being encountered. In our case we are dealing with the bytecode to add two entities on the top of the bytecode stack. After falling though the many different preceding cases we get to the case handled by BINARY_ADD (note that TARGET is a macro containing the case statement) which we enter:

TARGET(BINARY_ADD) {
    PyObject *right = POP();
    PyObject *left = TOP();
    PyObject *sum;
    /* NOTE(haypo): Please don't try to micro-optimize int+int on
       CPython using bytecode, it is simply worthless.
       See http://bugs.python.org/issue21955 and
       http://bugs.python.org/issue10044 for the discussion. In short,
       no patch shown any impact on a realistic benchmark, only a minor
       speedup on microbenchmarks. */
    if (PyUnicode_CheckExact(left) &&
             PyUnicode_CheckExact(right)) {
        sum = unicode_concatenate(left, right, f, next_instr);
        /* unicode_concatenate consumed the ref to left */
    }
    else {
        sum = PyNumber_Add(left, right);
        Py_DECREF(left);
    }
    Py_DECREF(right);
    SET_TOP(sum);
    if (sum == NULL)
        goto error;
    DISPATCH();
}

First we retrieve the different numbers off the stack, note that these are PyObject's and not integers so some more work must be done before being able to evaluate. We first follow the check to see if they types are strings since the plus operator is overloaded. Because we are not dealing with unicode strings we follow the else branch:

sum = PyNumber_Add(left, right);
Py_DECREF(left);

This calls PyNumberAdd to perform the actual addition. Once PyNumberAdd is finished we decrease the reference count of the left and right with:

    Py_DECREF(left);
}
Py_DECREF(right);

Before setting the top of the bytecode execution stack with SET_TOP which is a macro defined by:

#define SET_TOP(v) (stack_pointer[-1] = (v))

At this point the sum of the numbers is on the top of the stack ready to be used.

How we get from Python objects to adding the underlying integers (PyNumber_Add and down)

Warning rabbit-hole alert!!! skip past this if you aren't interested in gory Python implementation details.

In abstract.c we see the definition of PyNumber_Add:

PyObject *
PyNumber_Add(PyObject *v, PyObject *w)
{
    PyObject *result = binary_op1(v, w, NB_SLOT(nb_add));
    if (result == Py_NotImplemented) {
        PySequenceMethods *m = v->ob_type->tp_as_sequence;
        Py_DECREF(result);
        if (m && m->sq_concat) {
            return (*m->sq_concat)(v, w);
        }
        result = binop_type_error(v, w, "+");
    }
    return result;
}

So in this case we called this with 2 parameters of python type int. (Note that this is Python's type system, not C's, so a Python int is stored in a C PyObject)

We have to then apply binary_op1 (which is also found in abstract.c) to evaluate:

/*
  Calling scheme used for binary operations:
  Order operations are tried until either a valid result or error:
    w.op(v,w)[*], v.op(v,w), w.op(v,w)
  [*] only when v->ob_type != w->ob_type && w->ob_type is a subclass of
      v->ob_type
 */
static PyObject *
binary_op1(PyObject *v, PyObject *w, const int op_slot)
{
    PyObject *x;
    binaryfunc slotv = NULL;
    binaryfunc slotw = NULL;

    if (v->ob_type->tp_as_number != NULL)
        slotv = NB_BINOP(v->ob_type->tp_as_number, op_slot);
    if (w->ob_type != v->ob_type &&
        w->ob_type->tp_as_number != NULL) {
        slotw = NB_BINOP(w->ob_type->tp_as_number, op_slot);
        if (slotw == slotv)
            slotw = NULL;
    }
    if (slotv) {
        if (slotw && PyType_IsSubtype(w->ob_type, v->ob_type)) {
            x = slotw(v, w);
            if (x != Py_NotImplemented)
                return x;
            Py_DECREF(x); /* can't do it */
            slotw = NULL;
        }
        x = slotv(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
    }
    if (slotw) {
        x = slotw(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
    }
    Py_RETURN_NOTIMPLEMENTED;
}

At first glance I thought this was over-complicated. But I then remembered that Python has the NotImplemented language construct which allows an expression with the left hand side not implementing the operation to not fail (Documentation page for NotImplemented). If the left hand side doesn't support the operation it then gets tried by what the right hand side of the expression defines in __radd__ . The comment above the implementation in the C source directly references this. This is the way in which Python can handle a binary operation (in this case addition) with only one side being implemented without failing. In the case of adding some integers however we just add them directly. (Note how if the operation can't be done Py_RETURN_NOTIMPLEMENTED will be returned for us.)

So to see how this flows through the code we have to see what if (v->ob_type->tp_as_number != NULL) evaluates to which requires seeing how how PyObject is defined. PyObject is a typedef that refers to the _typeobect struct. This is found in cpython/Include/object.h

Being that this is a rather large struct we will just put the most relevant parts to what we are doing in here:

typedef struct _typeobject {
    PyObject_VAR_HEAD
    const char *tp_name; /* For printing, in format "<module>.<name>" */
    Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */

    /* Methods to implement standard operations */

    destructor tp_dealloc;
    printfunc tp_print;
    getattrfunc tp_getattr;
    setattrfunc tp_setattr;
    PyAsyncMethods *tp_as_async; /* formerly known as tp_compare (Python 2)
                                    or tp_reserved (Python 3) */
    reprfunc tp_repr;

    /* Method suites for standard classes */

    PyNumberMethods *tp_as_number;
    PySequenceMethods *tp_as_sequence;
    PyMappingMethods *tp_as_mapping;


    /* more code */
} PyTypeObject;

We see some of the standard things like a type needing to define it's name and it's size for allocation purposes. Then we see some "method suites" which define what methods this type supports. So we see that tp_as_number is of type pointer to PyNumberMethods. PyNumberMethods is a struct that contains a whole bunch of things like binary functions and unary functions, essentially it's a table of which functions the type supports. If it's NULL then the type doesn't support number operations at all and hence doing something like trying to add a number to it will fail and cause a type error. But more about that in a moment.

So in our case when we go to evaluate:

    if (v->ob_type->tp_as_number != NULL)

We find that since v is of type integer and the integer type defines tp_as_number so we enter into that branch.

So if this is not NULL in the struct we know that the type actually implements the required methods.

typedef struct {
    /* Number implementations must check *both*
       arguments for proper type and implement the necessary conversions
       in the slot functions themselves. */

    binaryfunc nb_add;
    binaryfunc nb_subtract;
    binaryfunc nb_multiply;
    binaryfunc nb_remainder;
    binaryfunc nb_divmod;
    ternaryfunc nb_power;
    unaryfunc nb_negative;
    unaryfunc nb_positive;
    unaryfunc nb_absolute;
    inquiry nb_bool;
    unaryfunc nb_invert;
    binaryfunc nb_lshift;
    binaryfunc nb_rshift;
    binaryfunc nb_and;
    binaryfunc nb_xor;
    binaryfunc nb_or;
    unaryfunc nb_int;
    void *nb_reserved;  /* the slot formerly known as nb_long */
    unaryfunc nb_float;

    binaryfunc nb_inplace_add;
    binaryfunc nb_inplace_subtract;
    binaryfunc nb_inplace_multiply;
    binaryfunc nb_inplace_remainder;
    ternaryfunc nb_inplace_power;
    binaryfunc nb_inplace_lshift;
    binaryfunc nb_inplace_rshift;
    binaryfunc nb_inplace_and;
    binaryfunc nb_inplace_xor;
    binaryfunc nb_inplace_or;

    binaryfunc nb_floor_divide;
    binaryfunc nb_true_divide;
    binaryfunc nb_inplace_floor_divide;
    binaryfunc nb_inplace_true_divide;

    unaryfunc nb_index;

    binaryfunc nb_matrix_multiply;
    binaryfunc nb_inplace_matrix_multiply;
} PyNumberMethods;

Seeing as we are interested in nb_add which is of type binaryfunc so lets look at the typedef for that:

typedef PyObject * (*binaryfunc)(PyObject *, PyObject *);

If you found this is a bit cryptic don't despair, I had to re-read it a bunch of times too (I tried cdecl first too!). Basically this defines a pointer to a function that takes two parameters of type PyObject pointer and returns a PyObject pointer.

slotv = NB_BINOP(v->ob_type->tp_as_number, op_slot);

To figure out what this does we have yet another macro to look up:

/* Binary operators */

#define NB_SLOT(x) offsetof(PyNumberMethods, x)
#define NB_BINOP(nb_methods, slot) \
(*(binaryfunc*)(& ((char*)nb_methods)[slot]))

NB_BINOP here is doing something a bit nasty, but is essentially allowing us to select the function pointer we want from a number that indexes into the struct. Or in other words because the memory layout in C is deterministic it's just picking the function out of the struct in the order in which it appears. So it hides us all the implementation nastiness here which means NB_BINOP is essentially a lookup from the tp_as_number table which picks out the relevant function from the slot passed in as the op_slot parameter. (There is as far as I know no language feature in C that will tell you in which order the members of a struct are defined, if I'm wrong about this please let me know so I can correct this)

So if op_slot == 0 we would take the 0th member of the PyNumberMethods struct. Which in this case now sets slotv to be the binary function that was at position 0 for the integer type.

Moving along to the next line we find this check:

    if (w->ob_type != v->ob_type &&
        w->ob_type->tp_as_number != NULL)

Because the operand types are the same on each side we skip this branch.

if (slotv) {

Since slotv was successfully fetched above (and is the nb_add for the numbers now) we enter this branch:

    if (slotw && PyType_IsSubtype(w->ob_type, v->ob_type)) {
        x = slotw(v, w);
        if (x != Py_NotImplemented)
            return x;
        Py_DECREF(x); /* can't do it */
        slotw = NULL;
    }
    x = slotv(v, w);
    if (x != Py_NotImplemented)
        return x;
    Py_DECREF(x); /* can't do it */

However slotw was not set before so we skip the if statement here. Leaving us with:

    x = slotv(v, w);
    if (x != Py_NotImplemented)
        return x;
    Py_DECREF(x); /* can't do it */

Now we try to perform the function that the slotv function pointer is referring to. So this effectively in our case is now whatever was stored at nb_add for the integers. Given we are using Python3 here (thankfully!!!) all the integral numbers are unified under the long type (see the official PEP 237 for more information). This is extremely convenient for the developer as they don't need to worry about integer overflow and other such considerations when coding (anyone who's dealt with code that overflowed or silently failed because the size of long integers changed from under them on another system will know what I mean).

Going to Objects/longobject.c we can find the appropriate definition for the PyNumberMethods for that type:

static PyNumberMethods long_as_number = {
    (binaryfunc)long_add,       /*nb_add*/
    (binaryfunc)long_sub,       /*nb_subtract*/
    (binaryfunc)long_mul,       /*nb_multiply*/
    long_mod,                   /*nb_remainder*/
    long_divmod,                /*nb_divmod*/
    long_pow,                   /*nb_power*/
    (unaryfunc)long_neg,        /*nb_negative*/
    (unaryfunc)long_long,       /*tp_positive*/
    (unaryfunc)long_abs,        /*tp_absolute*/
    (inquiry)long_bool,         /*tp_bool*/
    (unaryfunc)long_invert,     /*nb_invert*/
    long_lshift,                /*nb_lshift*/
    (binaryfunc)long_rshift,    /*nb_rshift*/
    long_and,                   /*nb_and*/
    long_xor,                   /*nb_xor*/
    long_or,                    /*nb_or*/
    long_long,                  /*nb_int*/
    0,                          /*nb_reserved*/
    long_float,                 /*nb_float*/
    0,                          /* nb_inplace_add */
    0,                          /* nb_inplace_subtract */
    0,                          /* nb_inplace_multiply */
    0,                          /* nb_inplace_remainder */
    0,                          /* nb_inplace_power */
    0,                          /* nb_inplace_lshift */
    0,                          /* nb_inplace_rshift */
    0,                          /* nb_inplace_and */
    0,                          /* nb_inplace_xor */
    0,                          /* nb_inplace_or */
    long_div,                   /* nb_floor_divide */
    long_true_divide,           /* nb_true_divide */
    0,                          /* nb_inplace_floor_divide */
    0,                          /* nb_inplace_true_divide */
    long_long,                  /* nb_index */
};

Since op_slot == 0 we are using the first function that appears in the struct, long_add , to actually perform the addition. Making our code effectively:

    return long_add(v, w);

This logical simplification is because this is implemented and hence will be the return value. The long_add function has now correctly been chosen to perform the actual addition of the longs now, the definition of this function is also found in Objects/longobject.c :

static PyObject *
long_add(PyLongObject *a, PyLongObject *b)
{
    PyLongObject *z;

    CHECK_BINOP(a, b);

    if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
        return PyLong_FromLong(MEDIUM_VALUE(a) + MEDIUM_VALUE(b));
    }
    if (Py_SIZE(a) < 0) {
        if (Py_SIZE(b) < 0) {
            z = x_add(a, b);
            if (z != NULL) {
                /* x_add received at least one multiple-digit int,
                   and thus z must be a multiple-digit int.
                   That also means z is not an element of
                   small_ints, so negating it in-place is safe. */
                assert(Py_REFCNT(z) == 1);
                Py_SIZE(z) = -(Py_SIZE(z));
            }
        }
        else
            z = x_sub(b, a);
    }
    else {
        if (Py_SIZE(b) < 0)
            z = x_sub(a, b);
        else
            z = x_add(a, b);
    }
    return (PyObject *)z;
}

We are first creating a PyLongObject *z to store the result of the calculation.

CHECK_BINOP is a macro that does some sanity checking of inputs:

#define CHECK_BINOP(v,w)                                \
    do {                                                \
        if (!PyLong_Check(v) || !PyLong_Check(w))       \
            Py_RETURN_NOTIMPLEMENTED;                   \
} while(0)

So if a non-long type somehow ended up here it would bail out by returning a NotImplemented. But seeing as we have the correct types passed here we pass this check and continue on.

Since all integers are unified in Python3 with the long type arithmetic on arbitrary length integers can be performed. So cases where arbitrary precision arithmetic needs to be performed has to be handled in the implementation. But just because we can do arbitrary precision arithmetic having to do all additions with arbitrary precision would be a massive pessimization, so we first have a check for if the numbers are small enough to just add directly. If they are we can just do a fixed precision operation which will be much faster on most hardware. (You can see the code that runs the arbitrary precision arithmetic in the Objects/longobject.c file, it's rather complex, and this is the sort of overhead you don't want when enumerating over a small list of objects.)

if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
    return PyLong_FromLong(MEDIUM_VALUE(a) + MEDIUM_VALUE(b));
}

So in this snippet we have a few things going on, so let's break it down. First we have to check the size of the parameters with Py_SIZE

The structure and Macro for Py_SIZE are found in Include/object.h:

typedef struct {
 PyObject ob_base;
 Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;


#define Py_SIZE(ob) (((PyVarObject*)(ob))->ob_size)

Which is basically accessing the ob_size of the parameter. Now Py_ssize_t is defined in PEP 353 if like me you see this code and ask "why didn't they just use size_t?" you probably would enjoy reading that PEP. ob_size is the size of the object.

Py_ABS is just an absolute size macro, see Include/pymacro.h for the definition. While I know the benefits of coding defensively in these situations I must admit I don't entirely understand what situation occurs that requires this check to be performed, if you know please post a comment so we can improve this article.

So now we have to apply the MEDIUM_VALUE macro to the parameter, this macro is found in Objects/longobject.c:

/* convert a PyLong of size 1, 0 or -1 to an sdigit */
#define MEDIUM_VALUE(x) (assert(-1 <= Py_SIZE(x) && Py_SIZE(x) <= 1),   \
         Py_SIZE(x) < 0 ? -(sdigit)(x)->ob_digit[0] :                   \
             (Py_SIZE(x) == 0 ? (sdigit)0 :                             \
             (sdigit)(x)->ob_digit[0]))

sdigit is a typedef to a platform specific type that has the width of a signed digit see Include/longintrepr.h:

#if PYLONG_BITS_IN_DIGIT == 30
typedef uint32_t digit;
typedef int32_t sdigit; /* signed variant of digit */
/* code snipped */
#elif PYLONG_BITS_IN_DIGIT == 15
typedef unsigned short digit;
typedef short sdigit; /* signed variant of digit */

So in this case on my machine sdigit is int32_t and now we have performed the casts to add as though the data types with int32_t's.

When the result of this is obtained we construct a Python long object to wrap this and we start going back up the abstraction layers. However we aren't done just yet, there's more digging down the abstraction layers yet to come...

TODO: verify in this case if it's just adding directly via a fast-path

The addition is performed!

If you followed above you will have seen all the intermediary steps that need to be taken to properly deal with dynamic types and dispatching the correct functions.

At this point we are preforming the C code equivalent of:

int32_t a = 1;
int32_t b = 2;
return a + b;

and it will be effectively this because in this code path we passed all the checks for bad types and overflow already. So we know that a+b will fit within 32 bits.

A really useful tool for investigating how code maps to assembly is the compiler-explorer project. This tool has a super useful web interface https://gcc.godbolt.org/ which we are using here.

Let's see how the following C++ code disassembles (I know this isn't C but compiler-explorer does not support C on the website but this is going to have a very similar if not same assembly output):

#include <cstdint>
int32_t add_two_numbers(int32_t a, int32_t b) {
    return a + b;
}

Assembly in Intel syntax from the above code:

add_two_numbers(int, int):
  push rbp
  mov rbp, rsp
  mov DWORD PTR [rbp-4], edi
  mov DWORD PTR [rbp-8], esi
  mov edx, DWORD PTR [rbp-4]
  mov eax, DWORD PTR [rbp-8]
  add eax, edx
  pop rbp
  ret

(source https://godbolt.org/g/5aQTAJ)

We are dealing with a stack here (much like how the Python bytecode works) that has to deal with the underlying stack pointers.

We first push rbp on to the stack to store the result of the addition. After we have loaded the inputs into registers we execute the add command which will stores the result in rbp and finally we pop that value before returning.

We have given the machine some instructions to add some numbers but how exactly is this is this addition with the add actually performed?

See the next article in the series to find out!

Published: Mon 25 November 2019

By Janis Lesinskis

In Software-engineering

Tags: series web-development python CPython implementation-details nginx apache abstraction software-architecture

This post is part 6 of the "PageRequestsToTheMetal" series: