Pyrobuf is an alternative to Google’s
Python Protobuf library that is written in Cython and that offers better
performance (roughly 1.5-2x faster), Python 3 support, and simple
serialization/deserialization to JSON and native Python dictionaries.
Since Pyrobuf’s only installation requirements are Cython, Jinja2, and setuptools,
its also much easier to install than Google’s library, and should work as a drop-in
replacement. Pyrobuf parses the same
.proto specs as the Google library and
generates Python modules (in
Pyrobuf started as just one part of a larger Python “port” of a C library that we use for serializing and deserializing data between a variety of formats. We plan to open-source this library in the future, but in the mean time we saw such good results with the Protobuf portion of the library that it seemed worthwhile to release it on its own.
Why we decided to reinvent the wheel
We use Google’s Protobuf library extensively at AppNexus – the data produced by each of the 1.2 million auctions per second that we run on our platform ultimately flow downstream through our data pipeline as Protobuf serialized messages. We also make extensive use, however, of our own custom serialization model which trades some of Protobuf’s space efficiency and safety for speed. We have our own library (open source version to come) which allows us to serialize data in any of a number of formats (e.g. our custom serialization format, Protobuf, JSON, tab delimited) and deserialize on the other end in any of the other formats.
Our serialization library gives us incredible flexibility in creating and defining producers and consumers of data. Until recently, however, this library was limited to the C language. Downstream data consumers working in Java or Python (two other popular languages at AppNexus) were limited to working with Protobuf messages. We wanted to allow other users to work with our in-house serialization format, but for backwards compatibility wanted users to easily be able to transition between formats, so we set out to duplicate our existing C library in Java and Python.
The naive approach to handling Python serialization/deserialization of Protobuf messages would be to copy our data to a class generated by Google’s existing Python Protobuf library, perform the serialization/deserialization, and then copy back. That approach would be pretty inefficient, however, and at AppNexus we’re all about merciless efficiency. Besides, writing data serializers is fun!
Why we used Cython
According to cython.org, “Cython is an optimising static compiler for both the Python programming language and the extended Cython programming language.” Cython compiles code written in a dialect of Python to C code, which is then compiled to a Python module. Cython code is often orders of magnitude faster than similar pure Python code and, additionaly, it makes it easier to work with binary data.
For comparison, the Python struct library
provides fairly simple tools for packing and unpacking data to and from binary data,
but it’s much slower than Cython. For example, compare the following code for serializing
the 32-bit integer
42 to a binary data string written with the
def ser1(): return struct.pack('i', 42)
to this code written in Cython:
def ser2(): cdef int x = 42 return (<char *>&x)[:sizeof(int)]
timeit on both, we see that Cython is much faster:
ser1(): 1000000 loops, best of 3: 283 ns per loop ser2(): 10000000 loops, best of 3: 62.7 ns per loop
Similarly for this deserializer written with
def des1(): return struct.unpack('i', '*\x00\x00\x00')
and this one written in Cython:
def des2(): cdef char *y = '*\x00\x00\x00' return <int>y
Cython wins again:
des1(): 1000000 loops, best of 3: 232 ns per loop des2(): 10000000 loops, best of 3: 45.7 ns per loop
The Cython code is a little more complicated, but I actually find it easier to
work with – partly, no doubt, because I spend most of my days working in C, but
also because the
struct library does not include the
stdint types, e.g.
Why we used Jinja2
To paraphrase Winston Churchill, “Jinja2 is the worst Python templating library, except for all the others.”
At a higher level, templating allows us to translate a Protobuf message spec into fast Cython code. The templates for our Cython code are themselves relatively readable, which makes for easier development. What I find somewhat amusing is that we “compile” Protobof specs to Cython code, which then gets compiled to C code, which is finally compiled to a usable module.
Performance Comparison and Features
The included script
tests/perf_test.py creates a new
Test message that
covers pretty much all possible Protobuf data types, including sub messages and
imported messages, with both the Google Protobuf library and Pyrobuf. The script
then performs and times 100,000 serialization and deserializations of the
message with each library. On my development machine (Ubuntu 14.04, 2 x 6 core
Intel(R) Xeon(R) CPU E5-2630L v2 @ 2.40GHz with hyperthreading turned off),
Pyrobuf is the clear winner. Using the C++ backend for the Google library,
Pyrobuf is roughly 1.5-2x faster:
Google took 1.045476 seconds to serialize Pyrobuf took 0.648426 seconds to serialize Google took 0.736501 seconds to deserialize Pyrobuf took 0.415871 seconds to deserialize
Using the default Python backend for the Google library, is over 20x faster:
Google took 14.044911 seconds to serialize Pyrobuf took 0.607058 seconds to serialize Google took 16.879840 seconds to deserialize Pyrobuf took 0.433617 seconds to deserialize
As an added bonus, Pyrobuf is Python 3 compatible (tested with Python 3.4). Further,
methods make it trivial to convert JSON formatted messages or native Python dictionaries
to Protobuf serialized messages and vice-versa.