Improve Python performance with vectorization

Python is notorious for being slow. Why? The answer is simple: it never be intended for computationally heavy use cases in the first place. But there is a simple rule of life that we engineers don't seem to remember: if something is easy to use, it will naturally attract more and more users, then the demand for lots of things suddenly emerge, performance is one of them.

What makes Python slow

In this article all references to Python simply means CPython implementation. It's the most popular and also the reference Python implementation and if you don't know what CPython is, chances are you're using it. Be aware that some other implementations have many aspects that are fundamentally different from CPython, hence cannot be treated similarly.

Python is a interpreted language. It means an interpreter (software) is needed to read the source code and execute it. In contrast, languages such as C/C++, Rust, etc. compile the source code to machine code which can be executed by CPU directly.

Yes, most of the popular and decent interpreted languages nowadays have a compilation step that is supposed to speed things up a bit. Python does the same thing (that's why sometimes you can see a bunch of .pyc files laying around). But it only does some early steps in the compilation process in advance such as lexical analysis, parsing, etc. and produces some forms of machine-friendly intermediate representation (IR) of the program. An interpreter is still needed later to execute the IR.

Python is so dynamic. There is no concept of typing (other than type hinting which plays absolutely no role functionally), hence any operation need some sorts of checking, fiddling around until the correct action can be determined and executed. Calling a function? Are you sure that's a function? That function can easily be set to something else like a number when the software is running. Adding two numbers? Oh hang on, we need to make sure both are actually numbers and are converted to the correct type of number first.

All the dynamic, flexible and friendly aspects of Python come with non-trivial cost in term of performance and memory usage.

The part that is fast

The core of Python is written in C and these things are supposed to be fast. Imagine Python a company with a large groups of experts who can do thing every accurately and efficiently, but with a big boss reading the instruction, understanding it and dispatching one task at a time to the suitable expert. If the instruction consists of a long list of tasks that are so simple it can be done in blink of an eye (in technical term, lots of tasks are so simple they can be done in one or few clock cycles) then most of the overhead would come from the big boss.

So what is the trick? Make bigger job, so that the execution time is dominated by the time spent by those experts.

Vectorization

Doing things in bulk

Vectorization is a popular and extremely efficient way of optimizing performance, not just for Python and for a huge part of the digital world. It does sound cool and all - yeah, "vectorization" - but it can be understood simply as processing things in bulk. Remember the part about the big boss who slows down the whole thing. Yeah, let's keep their nosy ass out of our work.

NumPy is probably the first thing anyone using Python for scientific computing, data analysis, machine learning, etc. would use. import numpy as np has become a habit. And that's for a every good reason.

NumPy provides a simple way to do computation on multi-dimensional arrays and matrices. So instead of doing this:

for i in range(len(a)):
  c[i] = a[i] + b[i]

We can just write this:

c = a + b

And NumPy will take care of it my calling a single function highly optimized and written in C, hence it's much much faster. A benchmark using Python 3.11 in Apple M1 shows about 42 times performance uplift from NumPy. That's significant gain for even less code. Win win.

Implementation	Execution time	Speed up
For loop	27.775 ns/element	1.0
NumPy	0.662 ns/element	42.0

Normally people with software background is actually more likely to be comfortable with the explicit for loop version. It's used to be the best way to write code, especially if you were using native programming language like C/C++ in 1990s. Yes, it's quite surprise that even in C/C++ processing each element individually like that was slow many years ago.

People with less software background is more likely to choose the NumPy approach, simply because it looks more natural and just like Math. It's both a happy coincident and a bizarre situation at the same time.

But hang on, if Python is slow because of the nosy big boss, why is using for loop in native languages like C/C++ no longer fast? Surely C/C++ don't have that big boss.

SIMD - Single Instruction Multiple Data

SIMD is a feature of modern CPUs (well, modern relative to 1980s CPUs) that provides a way to process multiple data per instruction. Simply instead of using 4 ADD instructions like this:

c[0] = a[0] + b[0]
c[1] = a[1] + b[1]
c[2] = a[2] + b[2]
c[3] = a[3] + b[3]

Or if you fancy a bit of assembly language, which is the language what map 1:1 to the CPU "native" machine code, these are the exact sequence of instructions that the CPU will run:

LDR w8, [x0]        ; Loads a[0] from memory to register w8  | w8 = a[0]
LDR w9, [x1]        ; Loads b[0] from memory to register w9  | w9 = b[0]
ADD w8, w9, w8      ; Adds w8 and w9 and writes result to w8 | w8 = w8 + w9
STR w8, [x2]        ; Stores w9 to c[0] in memory            | c[0] = w8

LDR w8, [x0, #0x04] ; Loads a[1] from memory to register w8  | w8 = a[1]
LDR w9, [x1, #0x04] ; Loads b[1] from memory to register w9  | w9 = b[1]
ADD w8, w9, w8      ; Adds w8 and w9 and writes result to w8 | w8 = w8 + w9
STR w8, [x2, #0x04] ; Stores w9 to c[1] in memory            | c[1] = w8

LDR w8, [x0, #0x08] ; Loads a[2] from memory to register w8  | w8 = a[2]
LDR w9, [x1, #0x08] ; Loads b[2] from memory to register w9  | w9 = b[2]
ADD w8, w9, w8      ; Adds w8 and w9 and writes result to w8 | w8 = w8 + w9
STR w8, [x2, #0x08] ; Stores w9 to c[2] in memory            | c[2] = w8

LDR w8, [x0, #0x0C] ; Loads a[3] from memory to register w8  | w8 = a[3]
LDR w9, [x1, #0x0C] ; Loads b[3] from memory to register w9  | w9 = b[3]
ADD w8, w9, w8      ; Adds w8 and w9 and writes result to w8 | w8 = w8 + w9
STR w8, [x2, #0x0C] ; Stores w9 to c[3] in memory            | c[3] = w8

That's a lot of instructions isn't it. In most modern CPUs, we only need a single instruction to add 4 32-bit integer numbers. Here is the C/C++ code to do it using Arm® Neon™ intrinsics:

const int32x4_t va = vld1q_s32(a);      // Loads 4 elements from a[0] to a[3].
const int32x4_t vb = vld1q_s32(b);      // Loads 4 elements from b[0] to b[3].
const int32x4_t vc = vaddq_s32(va, vb); // Adds two vectors together.
vst1q_s32(c, vc);                       // Stores the result to c[0] to c[3].

It doesn't look particularly efficient right. Let's look at what that C++ code compiled to:

LD1 {v0.4S}, [x0]             ; Loads a[0] to a[3] to register v0
LD1 {v1.4S}, [x1]             ; Loads b[0] to b[3] to register v1
ADD {v0.4S}, {v1.4S}, {v0.4S} ; Adds v0 with v1 and writes the result to v0
ST1 {v0.4S}, [x2]             ; Stores v0 to c[0] to c[3]

It obviously has significantly fewer instructions. The CPU even though extremely fast benefits greatly from decoding, scheduling and dispatching less instructions (just like the big boss above) and more predictable memory access pattern. The SIMD engine in the CPU also has the opportunity to do 4 or even more number of arithmetic operations in parallel. That is what inside NumPy and many other highly optimized libraries. SIMD and later vector instructions are used for data processing routines to achieve much higher computational throughput.

But does SIMD really make adding two arrays faster? In other word, is having less instruction really faster? Let's address it further in the next article.

Few takeaways

Python is slow and will be constrained by its dynamic, flexible and friendly nature.
Keeping the code short and simple normally helps Python run faster (but as with most other rules in life it's not 100% guaranteed).
Use popular and performant Python libraries as much as possible because these normally utilize SIMD and many other techniques to achieve performance beyond traditional programming.