# Page requests to the metal - Backend implementation

This post is part of the page requests to the metal series where we cover all the steps from a web request down to the electrons that move in the CPU, see the intro/overview page to see where this fits in.

## Language implementation/execution

From this point on we have left the realm of "web frameworks" and are now looking at the Python language implementation details that underpin many different paradigms and programs.

### Python code execution

Python is a language specification which doesn't specify exact implementation details. So depending on the implementation of Python being used the execution of the code can vary. For the sake of this example we are using the CPython implementation because it's the reference implementation but also because the way in which CPython executes the code it's easier to trace exactly what's happening. (Figuring out exactly what is happening in an implementation such as PyPy would be substantially more difficult because it would require a discussion about the tracing JIT approach PyPy uses. Specifically because of the JIT strategy that PyPy uses exact execution details can vary from invocation to invocation of the same function.)

As the name hints, the CPython interpreter is written in C and since the CPython repo is now on GitHub you can have a look at the code yourself! From here on we will deal with some of internals of CPython.

### Adding the numbers: Python bytecode

Whenever you are using CPython you might have been away of the presence of those *.pyc and *.pyo files. These are the files that contain the Python bytecode of your programs.

One thing every production-ready Python implementation does is a pre-compilation step where the source code is first parsed from text into some intermediate representation that can run faster. In the case of CPython the source is first parsed and then compiled into a cross platform bytecode so that no slow text parsing has to occur at run time. (It is at this step where syntax errors are raised and is the reason why you cannot catch that exception from within your code because this step runs before your code is executed).

When we are adding the 2 numbers together this is accomplished via interpreted Python bytecode. This is implementation dependent and in the case of CPython we can see how in the CPython source code. CPython doesn't take many liberties with how it executes the bytecode, the design is based on being correct and reliable, not fast. This is a fairly simple stack based approach which, conveniently, makes it really easy to follow how the execution of the bytecode happens. (If you want faster execution check out PyPy or ways to incorporate compiled code)

We can use the dis module to see the bytecode from the function:

```
def add_two_numbers(first, second):
"""Add two numbers together
:first: first number
:second: second number
"""
result = first + second
return result
import dis
print(dis.dis(add_two_numbers))
```

Bytecode:

```
11 0 LOAD_FAST 0 (first)
3 LOAD_FAST 1 (second)
6 BINARY_ADD
7 STORE_FAST 2 (result)
12 10 LOAD_FAST 2 (result)
13 RETURN_VALUE
```

So we see on line 11 we load the 2 variables with `LOAD_FAST`

then we
perform a `BINARY_ADD`

then store those into the right hand side of the
expression on line 11 with `STORE_FAST`

. Then on line 12 we fetch the
result from the addition with `LOAD_FAST`

and `RETURN_VALUE`

to return
that from the function.

### Execution of the CPython code:

When the CPython interpreter is run it generates the machine code instructions that carry out the addition. This is done via compiled code that constitutes the core of the Python implementation. In practice this step goes from the C source to machine code via some internal state of the compiler. In the case of using GCC the source code is first compiled into GNU Intermediate Language then output with the appropriate backend into the needed machine instructions. In this case we are on x86 so the backend used by GCC is the x86 one and the CPython interpreter has compiled such that we are running x86 machine code.

Specifically on the first run through the code the CPython interpreter compiles the source code down to bytecode. This bytecode is then executed via the interpreter at runtime. Understanding how the bytecode is interpreted is therefore the next step in our journey.

Going into the CPython 3.7 source we find the execution of bytecode occurs in ceval.c within which there's a switch statement that starts with:

```
switch (opcode) {
```

The opcode in this case is the opcode for the Python bytecode currently
being encountered. In our case we are dealing with the bytecode to add
two entities on the top of the bytecode stack. After falling though the
many different preceding cases we get to the case handled by
`BINARY_ADD`

(note that `TARGET`

is a macro containing the case
statement) which we enter:

```
TARGET(BINARY_ADD) {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *sum;
/* NOTE(haypo): Please don't try to micro-optimize int+int on
CPython using bytecode, it is simply worthless.
See http://bugs.python.org/issue21955 and
http://bugs.python.org/issue10044 for the discussion. In short,
no patch shown any impact on a realistic benchmark, only a minor
speedup on microbenchmarks. */
if (PyUnicode_CheckExact(left) &&
PyUnicode_CheckExact(right)) {
sum = unicode_concatenate(left, right, f, next_instr);
/* unicode_concatenate consumed the ref to left */
}
else {
sum = PyNumber_Add(left, right);
Py_DECREF(left);
}
Py_DECREF(right);
SET_TOP(sum);
if (sum == NULL)
goto error;
DISPATCH();
}
```

First we retrieve the different numbers off the stack, note that these
are `PyObject`

's and not integers so some more work must be done before
being able to evaluate. We first follow the check to see if they types
are strings since the plus operator is overloaded. Because we are not
dealing with unicode
strings we follow the
else branch:

```
sum = PyNumber_Add(left, right);
Py_DECREF(left);
```

This calls `PyNumberAdd`

to perform the actual addition. Once
`PyNumberAdd`

is finished we decrease the reference count of the left
and right with:

```
Py_DECREF(left);
}
Py_DECREF(right);
```

Before setting the top of the bytecode execution stack with `SET_TOP`

which is a macro defined by:

```
#define SET_TOP(v) (stack_pointer[-1] = (v))
```

At this point the sum of the numbers is on the top of the stack ready to be used.

### How we get from Python objects to adding the underlying integers (PyNumber_Add and down)

Warning rabbit-hole alert!!! skip past this if you aren't interested in gory Python implementation details.

In
abstract.c
we see the definition of `PyNumber_Add`

:

```
PyObject *
PyNumber_Add(PyObject *v, PyObject *w)
{
PyObject *result = binary_op1(v, w, NB_SLOT(nb_add));
if (result == Py_NotImplemented) {
PySequenceMethods *m = v->ob_type->tp_as_sequence;
Py_DECREF(result);
if (m && m->sq_concat) {
return (*m->sq_concat)(v, w);
}
result = binop_type_error(v, w, "+");
}
return result;
}
```

So in this case we called this with 2 parameters of python type `int`

.
(Note that this is Python's type system, not C's, so a Python `int`

is
stored in a C `PyObject`

)

We have to then apply `binary_op1`

(which is also found in
abstract.c)
to evaluate:

```
/*
Calling scheme used for binary operations:
Order operations are tried until either a valid result or error:
w.op(v,w)[*], v.op(v,w), w.op(v,w)
[*] only when v->ob_type != w->ob_type && w->ob_type is a subclass of
v->ob_type
*/
static PyObject *
binary_op1(PyObject *v, PyObject *w, const int op_slot)
{
PyObject *x;
binaryfunc slotv = NULL;
binaryfunc slotw = NULL;
if (v->ob_type->tp_as_number != NULL)
slotv = NB_BINOP(v->ob_type->tp_as_number, op_slot);
if (w->ob_type != v->ob_type &&
w->ob_type->tp_as_number != NULL) {
slotw = NB_BINOP(w->ob_type->tp_as_number, op_slot);
if (slotw == slotv)
slotw = NULL;
}
if (slotv) {
if (slotw && PyType_IsSubtype(w->ob_type, v->ob_type)) {
x = slotw(v, w);
if (x != Py_NotImplemented)
return x;
Py_DECREF(x); /* can't do it */
slotw = NULL;
}
x = slotv(v, w);
if (x != Py_NotImplemented)
return x;
Py_DECREF(x); /* can't do it */
}
if (slotw) {
x = slotw(v, w);
if (x != Py_NotImplemented)
return x;
Py_DECREF(x); /* can't do it */
}
Py_RETURN_NOTIMPLEMENTED;
}
```

At first glance I thought this was over-complicated. But I then
remembered that Python has the `NotImplemented`

language construct which allows an
expression with the left hand side not implementing the operation to not
fail (Documentation page for
NotImplemented).
If the left hand side doesn't support the operation it then gets tried
by what the right hand side of the expression defines in `__radd__`

. The
comment above the implementation in the C source directly references
this. This is the way in which Python can handle a binary operation (in
this case addition) with only one side being implemented without
failing. In the case of adding some integers however we just add them
directly. (Note how if the operation can't be done
`Py_RETURN_NOTIMPLEMENTED`

will be returned for us.)

So to see how this flows through the code we have to see what
`if (v->ob_type->tp_as_number != NULL)`

evaluates to which requires seeing how how
`PyObject`

is defined. `PyObject`

is a typedef that refers to the
`_typeobect`

struct. This is found in
cpython/Include/object.h

Being that this is a rather large struct we will just put the most relevant parts to what we are doing in here:

```
typedef struct _typeobject {
PyObject_VAR_HEAD
const char *tp_name; /* For printing, in format "<module>.<name>" */
Py_ssize_t tp_basicsize, tp_itemsize; /* For allocation */
/* Methods to implement standard operations */
destructor tp_dealloc;
printfunc tp_print;
getattrfunc tp_getattr;
setattrfunc tp_setattr;
PyAsyncMethods *tp_as_async; /* formerly known as tp_compare (Python 2)
or tp_reserved (Python 3) */
reprfunc tp_repr;
/* Method suites for standard classes */
PyNumberMethods *tp_as_number;
PySequenceMethods *tp_as_sequence;
PyMappingMethods *tp_as_mapping;
/* more code */
} PyTypeObject;
```

We see some of the standard things like a type needing to define it's
name and it's size for allocation purposes. Then we see some "method
suites" which define what methods this type supports. So we see that
`tp_as_number`

is of type pointer to `PyNumberMethods`

.
`PyNumberMethods`

is a struct that contains a whole bunch of things like
binary functions and unary functions, essentially it's a table of which
functions the type supports. If it's `NULL`

then the type doesn't
support number operations at all and hence doing something like trying
to add a number to it will fail and cause a type error. But more about
that in a moment.

So in our case when we go to evaluate:

```
if (v->ob_type->tp_as_number != NULL)
```

We find that since `v`

is of type integer and the integer type defines
`tp_as_number`

so we enter into that branch.

So if this is not `NULL`

in the struct we know that the type actually
implements the required methods.

```
typedef struct {
/* Number implementations must check *both*
arguments for proper type and implement the necessary conversions
in the slot functions themselves. */
binaryfunc nb_add;
binaryfunc nb_subtract;
binaryfunc nb_multiply;
binaryfunc nb_remainder;
binaryfunc nb_divmod;
ternaryfunc nb_power;
unaryfunc nb_negative;
unaryfunc nb_positive;
unaryfunc nb_absolute;
inquiry nb_bool;
unaryfunc nb_invert;
binaryfunc nb_lshift;
binaryfunc nb_rshift;
binaryfunc nb_and;
binaryfunc nb_xor;
binaryfunc nb_or;
unaryfunc nb_int;
void *nb_reserved; /* the slot formerly known as nb_long */
unaryfunc nb_float;
binaryfunc nb_inplace_add;
binaryfunc nb_inplace_subtract;
binaryfunc nb_inplace_multiply;
binaryfunc nb_inplace_remainder;
ternaryfunc nb_inplace_power;
binaryfunc nb_inplace_lshift;
binaryfunc nb_inplace_rshift;
binaryfunc nb_inplace_and;
binaryfunc nb_inplace_xor;
binaryfunc nb_inplace_or;
binaryfunc nb_floor_divide;
binaryfunc nb_true_divide;
binaryfunc nb_inplace_floor_divide;
binaryfunc nb_inplace_true_divide;
unaryfunc nb_index;
binaryfunc nb_matrix_multiply;
binaryfunc nb_inplace_matrix_multiply;
} PyNumberMethods;
```

Seeing as we are interested in `nb_add`

which is of type `binaryfunc`

so
lets look at the typedef for that:

```
typedef PyObject * (*binaryfunc)(PyObject *, PyObject *);
```

If you found this is a bit cryptic don't despair, I had to re-read it a
bunch of times too (I tried cdecl first too!).
Basically this defines a pointer to a function that takes two parameters
of type `PyObject`

pointer and returns a `PyObject`

pointer.

```
slotv = NB_BINOP(v->ob_type->tp_as_number, op_slot);
```

To figure out what this does we have yet another macro to look up:

```
/* Binary operators */
#define NB_SLOT(x) offsetof(PyNumberMethods, x)
#define NB_BINOP(nb_methods, slot) \
(*(binaryfunc*)(& ((char*)nb_methods)[slot]))
```

`NB_BINOP`

here is doing something a bit nasty, but is essentially
allowing us to select the function pointer we want from a number that
indexes into the struct. Or in other words because the memory layout in
C is deterministic it's just picking the function out of the struct in
the order in which it appears. So it hides us all the implementation
nastiness here which means `NB_BINOP`

is essentially a lookup from the
`tp_as_number table`

which picks out the relevant function from the slot
passed in as the `op_slot`

parameter. (There is as far as I know no
language feature in C that will tell you in which order the members of a
struct are defined, if I'm wrong about this please let me know so I can correct this)

So if `op_slot == 0`

we
would take the 0th member of the `PyNumberMethods`

struct. Which in this
case now sets `slotv`

to
be the binary function that was at position 0 for the integer type.

Moving along to the next line we find this check:

```
if (w->ob_type != v->ob_type &&
w->ob_type->tp_as_number != NULL)
```

Because the operand types are the same on each side we skip this branch.

```
if (slotv) {
```

Since `slotv`

was
successfully fetched above (and is the `nb_add`

for the numbers now) we
enter this branch:

```
if (slotw && PyType_IsSubtype(w->ob_type, v->ob_type)) {
x = slotw(v, w);
if (x != Py_NotImplemented)
return x;
Py_DECREF(x); /* can't do it */
slotw = NULL;
}
x = slotv(v, w);
if (x != Py_NotImplemented)
return x;
Py_DECREF(x); /* can't do it */
```

However `slotw`

was **not**
set before so we skip the if statement here. Leaving us with:

```
x = slotv(v, w);
if (x != Py_NotImplemented)
return x;
Py_DECREF(x); /* can't do it */
```

Now we try to perform the function that the slotv function pointer is
referring to. So this effectively in our case is now whatever was stored
at `nb_add`

for the integers. Given we are using Python3 here
(thankfully!!!) all the integral numbers are unified under the long type
(see the official PEP 237
for more information). This is extremely convenient for the developer as
they don't need to worry about integer overflow and other such
considerations when coding (anyone who's dealt with code that
overflowed or silently failed because the size of long
integers
changed from under them on another system will know what I mean).

Going to
Objects/longobject.c
we can find the appropriate definition for the `PyNumberMethods`

for
that type:

```
static PyNumberMethods long_as_number = {
(binaryfunc)long_add, /*nb_add*/
(binaryfunc)long_sub, /*nb_subtract*/
(binaryfunc)long_mul, /*nb_multiply*/
long_mod, /*nb_remainder*/
long_divmod, /*nb_divmod*/
long_pow, /*nb_power*/
(unaryfunc)long_neg, /*nb_negative*/
(unaryfunc)long_long, /*tp_positive*/
(unaryfunc)long_abs, /*tp_absolute*/
(inquiry)long_bool, /*tp_bool*/
(unaryfunc)long_invert, /*nb_invert*/
long_lshift, /*nb_lshift*/
(binaryfunc)long_rshift, /*nb_rshift*/
long_and, /*nb_and*/
long_xor, /*nb_xor*/
long_or, /*nb_or*/
long_long, /*nb_int*/
0, /*nb_reserved*/
long_float, /*nb_float*/
0, /* nb_inplace_add */
0, /* nb_inplace_subtract */
0, /* nb_inplace_multiply */
0, /* nb_inplace_remainder */
0, /* nb_inplace_power */
0, /* nb_inplace_lshift */
0, /* nb_inplace_rshift */
0, /* nb_inplace_and */
0, /* nb_inplace_xor */
0, /* nb_inplace_or */
long_div, /* nb_floor_divide */
long_true_divide, /* nb_true_divide */
0, /* nb_inplace_floor_divide */
0, /* nb_inplace_true_divide */
long_long, /* nb_index */
};
```

Since `op_slot == 0`

we are
using the first function that appears in the struct,
`long_add`

, to actually
perform the addition. Making our code effectively:

```
return long_add(v, w);
```

This logical simplification is because this is implemented and hence
will be the return value. The `long_add`

function has now correctly been chosen to
perform the actual addition of the longs now, the definition of this
function is also found in
Objects/longobject.c
:

```
static PyObject *
long_add(PyLongObject *a, PyLongObject *b)
{
PyLongObject *z;
CHECK_BINOP(a, b);
if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
return PyLong_FromLong(MEDIUM_VALUE(a) + MEDIUM_VALUE(b));
}
if (Py_SIZE(a) < 0) {
if (Py_SIZE(b) < 0) {
z = x_add(a, b);
if (z != NULL) {
/* x_add received at least one multiple-digit int,
and thus z must be a multiple-digit int.
That also means z is not an element of
small_ints, so negating it in-place is safe. */
assert(Py_REFCNT(z) == 1);
Py_SIZE(z) = -(Py_SIZE(z));
}
}
else
z = x_sub(b, a);
}
else {
if (Py_SIZE(b) < 0)
z = x_sub(a, b);
else
z = x_add(a, b);
}
return (PyObject *)z;
}
```

We are first creating a `PyLongObject *z`

to store the result of the calculation.

`CHECK_BINOP`

is a macro that does some sanity checking of inputs:

```
#define CHECK_BINOP(v,w) \
do { \
if (!PyLong_Check(v) || !PyLong_Check(w)) \
Py_RETURN_NOTIMPLEMENTED; \
} while(0)
```

So if a non-long type somehow ended up here it would bail out by
returning a `NotImplemented`

. But seeing as we have the correct types
passed here we pass this check and continue on.

Since all integers are unified in Python3 with the long type arithmetic
on arbitrary length integers can be performed. So cases where arbitrary
precision arithmetic needs to be performed has to be handled in the
implementation. But just because we *can* do arbitrary precision
arithmetic
having to do all additions with arbitrary precision would be a massive
pessimization, so we first have a check for if the numbers are small
enough to just add directly. If they are we can just do a fixed
precision operation which will be **much** faster on most hardware. (You
can see the code that runs the arbitrary precision arithmetic in the
Objects/longobject.c
file, it's rather complex, and this is the sort of overhead you don't
want when enumerating over a small list of objects.)

```
if (Py_ABS(Py_SIZE(a)) <= 1 && Py_ABS(Py_SIZE(b)) <= 1) {
return PyLong_FromLong(MEDIUM_VALUE(a) + MEDIUM_VALUE(b));
}
```

So in this snippet we have a few things going on, so let's break it
down. First we have to check the size of the parameters with `Py_SIZE`

The structure and Macro for `Py_SIZE`

are found in
Include/object.h:

```
typedef struct {
PyObject ob_base;
Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;
#define Py_SIZE(ob) (((PyVarObject*)(ob))->ob_size)
```

Which is basically accessing the `ob_size`

of the parameter. Now
`Py_ssize_t`

is defined in PEP
353 if like me you see this
code and ask "why didn't they just use `size_t`

?" you probably would
enjoy reading that PEP. `ob_size`

is the size of the object.

`Py_ABS`

is just an absolute size macro, see
Include/pymacro.h
for the definition. While I know the benefits of coding defensively in
these situations I must admit I don't entirely understand what
situation occurs that requires this check to be performed, if you know
please post a comment so we can improve this article.

So now we have to apply the `MEDIUM_VALUE`

macro to the parameter, this
macro is found in
Objects/longobject.c:

```
/* convert a PyLong of size 1, 0 or -1 to an sdigit */
#define MEDIUM_VALUE(x) (assert(-1 <= Py_SIZE(x) && Py_SIZE(x) <= 1), \
Py_SIZE(x) < 0 ? -(sdigit)(x)->ob_digit[0] : \
(Py_SIZE(x) == 0 ? (sdigit)0 : \
(sdigit)(x)->ob_digit[0]))
```

`sdigit`

is a typedef to a platform specific type that has the width of
a signed digit see
Include/longintrepr.h:

```
#if PYLONG_BITS_IN_DIGIT == 30
typedef uint32_t digit;
typedef int32_t sdigit; /* signed variant of digit */
/* code snipped */
#elif PYLONG_BITS_IN_DIGIT == 15
typedef unsigned short digit;
typedef short sdigit; /* signed variant of digit */
```

So in this case on my machine `sdigit`

is `int32_t`

and now we have
performed the casts to add as though the data types with `int32_t`

's.

When the result of this is obtained we construct a Python long object to wrap this and we start going back up the abstraction layers. However we aren't done just yet, there's more digging down the abstraction layers yet to come...

TODO: verify in this case if it's just adding directly via a fast-path

### The addition is performed!

If you followed above you will have seen all the intermediary steps that need to be taken to properly deal with dynamic types and dispatching the correct functions.

At this point we are preforming the C code equivalent of:

```
int32_t a = 1;
int32_t b = 2;
return a + b;
```

and it will be effectively this because in this code path we passed all
the checks for bad types and overflow already. So we know that
`a+b`

will fit within 32 bits.

A really useful tool for investigating how code maps to assembly is the compiler-explorer project. This tool has a super useful web interface https://gcc.godbolt.org/ which we are using here.

Let's see how the following C++ code disassembles (I know this isn't C but compiler-explorer does not support C on the website but this is going to have a very similar if not same assembly output):

```
#include <cstdint>
int32_t add_two_numbers(int32_t a, int32_t b) {
return a + b;
}
```

Assembly in Intel syntax from the above code:

```
add_two_numbers(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
```

(source https://godbolt.org/g/5aQTAJ)

We are dealing with a stack here (much like how the Python bytecode works) that has to deal with the underlying stack pointers.

We first push `rbp`

on to the stack to store the result of the addition.
After we have loaded the inputs into registers we execute the `add`

command which will stores the result in `rbp`

and finally we pop that
value before returning.

We have given the machine some instructions to add some numbers but how exactly is this is this addition with the `add`

actually performed?

See the next article in the series to find out!

This post is part 5 of the "PageRequestsToTheMetal" series:

- Page requests to the metal - introduction
- Page requests to the metal - frontend
- Page requests to the metal - Network stack - From frontend to backend
- Page requests to the metal - Backend - What happens on the server
- Page requests to the metal - Backend implementation *
- Page requests to the metal - hardware level