Code for this tutorial found here: https://github.com/shuttle1987/pybind11_tutorial_code
Crossposted here: http://www.jaggedverge.com/2016/09/generating-python-bindings-from-c11-with-pybind11/ (please ask questions or leave comments over there)

Recently I was involved in some discussion sparked by this blog post where "rdrewd" was talking about benchmarking a simple program that generated prime numbers in Python.
The thing about this that I like is that it's a very good example of a situation where making pragmatic decisions about implementation is highly relevant.
I like that the author is focusing on the pragmatics of solving the underlying problem.

There's a page on the python wiki about performance in python that is a good summary of some key issues.
One passage I think is highly relevant (emphasis mine):

Some people are inappropriately obsessed with speed and think that just because C can provide better performance for certain types of problem,
it must therefore be a better language for all purposes.
Other people think that speed of development is far more important, and choose Python even for those applications where it will run slower.
Often, they are surprised to find Python code can run at quite acceptable speeds, and in some cases even faster than what they could get from C/C++ **with a similar amount of development time invested**.

Usually it is not the absolute speed that is important, you should think about what would be an acceptable speed of execution.
Optimizations beyond achieving this acceptable speed are wasteful of resources (usually: your time. And thus: money.).

The general philosophy of cPython is to write the bottleneck code in a faster language.
The original blog post got my attention because it's a very good example of a time where calling out to a faster language makes sense.
Personally if I'm faced with a situation where the code is running in cPython I'll try PyPy first because the speedups due to the tracing JIT are quite remarkable.
Many times in my career switching to PyPy was sufficient to make the implementation of a project have feasible run times.
The times it has not been enough I've used various FFI techniques to get the benefit of fast development in Python while running performance critical code in a faster language.

If there's one thing I'd like people to take away from this is that choosing implementations is not a binary choice, sometimes mixing and matching is best.

## Generation of prime numbers

I mostly want to talk about programming implementation details but first I think it's good to go over some algorithms.

Even the best implementation in the world can't overcome a bad algorithm choice in a problem like this.
Efficiently generating prime numbers is actually quite involved.

Naive algorithms like the Sieve of Eratosthenes are conceptually easy to understand but allow for almost no locality of reference.
Modern CPUs are very sophisticated and are often explicitly designed to leverage locality of reference via memory caching.
Efficiently using the CPU cache can have huge impacts on the performance of otherwise similar looking code.

The best algorithms for generating primes allow for good locality of reference.
For example the Sieve of Atkin has a better asymptotic complexity than the Sieve of Eratosthenes, but it also has much better locality of reference.
A good implementation of this algorithm that takes locality into account will outperform naive implementations
because the algorithm allows much better utilization of cache-backed CPU hardware.

All this applies for any language, when it matters good algorithms choice makes a huge difference.

## Practical implementation

Lets say we have chosen the right algorithm and that algorithm intrinsically takes some serious computing power to run.
How do we actually get that into code in a way we can easily use in our applications?

The reference code that sparked this off looks like this in Python3:

primes=[2]
cand=3
while len(primes) < 20:
for factor in primes:
if cand % factor == 0:
# Not a prime
break
else:
primes.append(cand)
cand += 2
print(primes)

Original Python2 code on pastebin: http://pastebin.com/DHQtzN0s

For the sake of the article lets assume that this algorithm is the one we have decided to use.
The blog post talked of the relative merits of using different languages, a C implementation was made and
people benchmarked various different implementations of the same algorithm in different languages.
Many people in the comments tried various languages and the inevitable discussions of the relative merits
of languages came up.
The author expressed regret nobody wrote a c++ implementation, so here's a naive one I threw together today:

#include <iostream>
#include <vector>
/**
* Generates the a sequence of prime numbers
* @param how_many how many primes to generate
*/
std::vector<int> generate_primes(unsigned int how_many){
std::vector<int> primes;
primes.reserve(how_many);
primes.push_back(2); //2 is the first prime number
unsigned int candidate = 3;
bool potential_prime = true;
while(primes.size() < how_many){
potential_prime = true;
for(unsigned int prime: primes){
if(candidate % prime == 0){
candidate += 2;
potential_prime = false;
break;
}
}
if(potential_prime){
primes.push_back(candidate);
}
}
return primes;
}
int main(){
std::vector<int> first_20_primes = generate_primes(20);
for(int prime : first_20_primes){
std::cout << prime << std::endl;
}
return 0;
}

This C++11 solution is fairly close to the original python code.

Comparatively speaking you can see that Python makes it really easy to implement *something* quickly.
There's more that can be done with c++ as well but this is about the limit of quickly putting something together.
As is fairly obvious C++ is just a more verbose language.

## Benchmarking

Decisions about performance benefit enormously from benchmarking.
Because the impact of performance is very much non-uniform finding *where* the performance matters is a crucial step.
If you are ever trying to decide if you need to use a foreign function interface out or any other means of writing
code in a different language due to performance reasons you really need to be benchmarking your code to find out where the *real* bottlenecks are.

Some benchmarks were provided in the blog post that were on extremely small sets of data.
The problem size was small enough where the IO completely eclipsed the run time.
*But that was the authors entire point*. In a throwaway project this small the run time difference did not actually matter.
In the particular case of generating 20 primes, just write it in Python, you will have the solution with the least amount of developer hours spent that way.
The computational time is going to be small enough that the trade-off of time that would be required to write it in another language is just not worth it.

However what happens when you want to generate the first thousand primes? How about the first 10 thousand?
What happens if you put this code on a popular web server where you are getting a lot of hits and you are charged for the amount of CPU time used?

We should seriously consider another implementation when the balance shifts away from ease of implementation being the main factor.
Don't make the decision lightly, if you are writing high quality code to begin with then using higher performance
code that complicates the building of your project involves a real trade-off in the form of introducing additional complexity.

First we would benchmark and see where the bottlenecks are as this enables us to make a judgment call about the costs of improving the performance of the code by re-implementing those bottlenecks.

## FFI and python

So lets say you hve benchmarked and you have found that your python implementation of this function is such a bottleneck that you can't satisfy your requirements,
what do you do next?

One of the big things that has changed in recently is how much easier it is to make python bindings now.
You don't even need to know the python C-api, you can use something like pybind11 and just use that to generate your python modules.

Here's how we would expose that C++ function to Python:

#include <vector>
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
namespace py = pybind11;
/**
* Generates the a sequence of prime numbers
* @param how_many how many primes to generate
*/
std::vector<int> generate_primes(unsigned int how_many){
std::vector<int> primes;
primes.reserve(how_many);
primes.push_back(2); //2 is the first prime number
unsigned int candidate = 3;
bool potential_prime = true;
while(primes.size() < how_many){
potential_prime = true;
for(unsigned int prime: primes){
if(candidate % prime == 0){
candidate += 2;
potential_prime = false;
break;
}
}
if(potential_prime){
primes.push_back(candidate);
}
}
return primes;
}
/**
* Generate the python bindings for this C++ function
*/
PYBIND11_PLUGIN(primes_py3) {
py::module m("example", "Generating primes in c++ with python bindings using pybind11");
m.def("generate_primes", &generate_primes, "A function which generates a list of primes. The length of this list is specified by the user");
return m.ptr();
}

Overall the amount of boilerplate is remarkably small.
Also the pybind11 library is header only, which is great for keeing builds simple as we don't need to link to it.
However the build is still complex enough that making a script is worthwhile:

#!/bin/bash
g++ -O3 -fPIC -shared -std=c++11 -I /home/janis/code_snippets/prime_gen/pybind11-1.8.1/include `python3-config --cflags --ldflags` prime_gen_python3_bindings.cpp -o primes_py3.so

This compiles the C++ into a form that can be imported from python directly:

janis@janis-laptop:~/code_snippets/prime_gen (master)$ python3
Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
[GCC 5.2.1 20151010] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import primes_py3
>>> primes_py3.__doc__
'Generating primes in c++ with python bindings using pybind11'
>>> primes_py3.generate_primes.__doc__
'generate_primes(int) -> list<int>\n\nA function which generates a list of primes. The length of the list is specified by the user\n'
>>> primes_py3.generate_primes(20)
[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71]

As you can see from the python side this is really easy to use, you don't even know that the C++ is there.

When working on a team this can be a very powerful technique to have in your arsenal.
By making performance critical modules in this way you can keep the ease of use for the consumers of the code while getting performance improvements.

## Summary

Ease of development in smaller projects is one of the biggest wins that Python delivers, this is because *speed of development is one of Python's strongest points*.
C++ has very made very different decisions because the philosophy of the language is fundamentally different.
C++ is designed to be powerful, it isn't designed for ease of use in smaller projects, though it has got a *lot* easier to use in the more modern revisions of the language.
Where C++ really shines is in the certain situations where the power it gives you is actually important, such as high performance requirements situations.
As this tutorial shows when there's some performance bottlenecks you have options other than complete re-writes.
Specifically you can just deal with the parts your profiling/benchmarking have shown to be problematic.

If you are writing your performance critical code in c++ using pybind11 is a convenient way to consume that code from Python.
It is by no means the only way, you could use subprocesses or message queues like ZeroMQ, each of these has their own set of trade-offs.
If you have C code then I'd recommend looking into CFFI, it might make your life a lot easier. I used it in my MMAP backed array library and saved myself a ton of hassles.

Essentially you want to use the right tool for the job, and sometimes this involves using more than one language.