I’ve been overwhelmed by security work lately, so let me switch to something lighter to relax my mind.
Python 3.14 has officially introduced a new mechanism called Tail Call Interpreter (Made by Ken Jin), which is undoubtedly another major milestone that lays the foundation for the future.
Main Content
Before discussing Python 3.14’s Tail Call Interpreter, we need to talk about the most basic syntax in C - switch-case.
1 2 3 4 5 6 7 8 9 10
switch (x) { case1: // do something break; case2: // do something else break; default: // do default thing }
What would the final assembly look like for such code? Most people’s first reaction might be to use cmp followed by je, and if it doesn’t match, continue. Let’s compile a version to see.
We can see that overall it’s as we expected - continuous cmp followed by continuous je. Now let’s evaluate the complexity here? Oh, time complexity O(n), that’s straightforward.
Damn, for Python with such a huge switch case structure, wouldn’t that be O(n) every time? Wouldn’t that be a performance disaster?
Actually, no. Usually, compilers use different strategies to handle switch case structures based on data type and scale.
0000000000001580 <char_switch>: 1580: 83 ef 61 sub $0x61,%edi # Subtract ASCII value of 'a' 1583: 40 80 ff 04 cmp $0x4,%dil # Check if ≤4 (a-e) 1587: 77 63 ja 15ec 1589: 48 8d 15 4c 0c 00 00 lea 0xc4c(%rip),%rdx 1590: 40 0f b6 ff movzbl %dil,%edi # Zero-extend to 32 bit 1594: 48 63 04 ba movslq (%rdx,%rdi,4),%rax
Here we can see that the compiler handles switch case structures differently based on different data types. Let me summarize this with a table:
Switch Type
Case Count
Distribution
Compiler Strategy
Time Complexity
small_switch
3
Consecutive(1,2,3)
Linear comparison
O(n)
dense_switch
8
Consecutive(10-17)
Offset jump table
O(1)
sparse_switch
4
Sparse(1,100,1000,10000)
Binary search
O(log n)
large_dense_switch
20
Consecutive(1-20)
Standard jump table
O(1)
mixed_switch
7
Partially consecutive
Nested comparison
O(log n)
char_switch
5
Consecutive(‘a’-‘e’)
Character offset table
O(1)
OK, here we find that the final implementation of switch-case varies depending on data types, leading to unpredictability in our final code. So do we have ways to optimize this problem? The answer is yes.
voidbasic_computed_goto(int operation) { staticvoid* jump_table[] = { &&op_add, &&op_sub, &&op_mul, &&op_div, &&op_mod, &&op_default }; int a = 10, b = 3; int result; if (operation < 0 || operation > 4) { operation = 5; } printf("Operation %d: a=%d, b=%d -> ", operation, a, b); goto *jump_table[operation]; op_add: result = a + b; printf("ADD: %d\n", result); return; op_sub: result = a - b; printf("SUB: %d\n", result); return; op_mul: result = a * b; printf("MUL: %d\n", result); return; op_div: if (b != 0) { result = a / b; printf("DIV: %d\n", result); } else { printf("DIV: Error (division by zero)\n"); } return; op_mod: if (b != 0) { result = a % b; printf("MOD: %d\n", result); } else { printf("MOD: Error (division by zero)\n"); } return; op_default: printf("Unknown operation\n"); return; }
We can see that the core operation here is to turn each case of our switch-case into a label, then we use a jump_table to directly jump to the corresponding label. Let’s look at the assembly of the most critical part:
Here we can summarize that using Computed Goto compared to traditional switch-case has the following advantages:
Reduces the cost of branch prediction fallback
Better instruction cache locality
Reduces the number and overhead of cmp instructions
So how much faster can it be? You can refer to the test results of Computed Goto introduced in CPython, which showed an overall improvement of about 15%.
So is the Computed Goto approach perfect? Actually, no. Currently, CPython’s interpreter ceval.c, which is also currently the largest switch case, has several typical problems:
Computed Goto as a specialized feature of clang and gcc, other platforms have limited benefits
Currently Computed Goto is not mature, different versions of the same compiler may have different issues
Extremely large switch cases cause compilers to not optimize switch cases well enough
We cannot use perf to precisely perform quantitative analysis of per-opcode overhead in our entire process, which will be a big problem in the context of making Python faster
Points 1, 3, and 4 are easy to understand. Let’s look at an example of point 2 (thanks to Ken Jin for providing the example).
In GCC 11, switch-case would generate normal code in certain parts:
We can see that additional registers were introduced. Computer Architecture 101: additional registers mean additional overhead. Registers are expensive!
So do we have ways to iterate on the extremely large switch case above? Some students might be thinking, since the switch case above is extremely large, why don’t we split it into multiple small functions so that the compiler can have enough context to optimize, and our perf can also precisely analyze the overhead of each function. Wouldn’t that be great?
But other students object: function calls trigger call instructions, which bring additional overhead of register push and pop operations. Won’t this make it slower again?
So can we optimize this? The answer is yes. Many students might have thought of it - tail call.
The call 1140 <g> instruction is very prominent. This is also an important source of function call overhead.
In modern compilers, there’s a special optimization called tail recursion, where when the last step of a function is calling another function, the compiler can optimize away the overhead of this call.
We can see that the last step of function f is jmp 1140 <g>, not call 1140 <g>. This means when we call function g, we won’t have additional overhead like register allocation that traditional call instructions bring.
Some students might have realized that after tail recursion processing, this can completely be viewed as a high-performance Goto.
Bingo, the idea here is similar. A 1977 paper “Debunking the ‘Expensive Procedure Call’ Myth, or, Procedure Call Implementations Considered Harmful, or, Lambda: The Ultimate GOTO” mentioned that efficient procedure calls can have performance close to Goto, while being more concise in implementation.
In Python 3.14, the implementation of Tail Call Interpreter is based on this idea.
We can see that we’ve applied tail recursion processing to the macro that dispatches opcodes:
So while ensuring our baseline performance is as good as or even better than Computed Goto, we can get the following benefits:
Broader platform support
After splitting cases, compilers are less likely to make mistakes, and overall performance predictability is stronger
Happy perf
And I can do more cool stuff with tools like eBPF
Summary
This article is pretty much it. Although it claims to only introduce Python 3.14’s Tail Call Interpreter, it still completely introduces the entire evolution of thinking.
This also gives me an insight: predictability is really a very important characteristic in many cases.
This, along with remote debug, are the two features I like most in 3.14. Long live observability!