Interning in CPython

Perhaps not every Python developer knows an interesting property of the CPython that makes newcomers go crazy. An example speaks louder than words:

>>> x = 255
>>> y = 255
>>> x == y
True
>>> x is y
True

Whaaat?

What is an assignment in Python

The general syntax of a simple assignment in Python looks as follows:

<expression left> = <expression right>

For example:

x = "Dummy string"

So now, we can use the x variable to use our string in another part of the program.

If you use CPython, we can even explicitly check the memory address where the x variable is located using the built-in id() function.

>>> id(x)
140400709562064

This should be interpreted as — the variable x points to a value that is stored in system memory at address 140400709562064. In C language terms, it is termed as a pointer. If the pointers of two variables are equal, it means they point to the same address in memory. The critical word here is "points" and not "contains" or "equal because a variable is just a name that points to an object in memory, and there can be many such names but the actual value lies in the same place in memory.

For example, we can assign a value stored at 140400709562064 to two more variables:

>>> z = y = x
>>> id(x), id(y), id(z)
(140400709562064, 140400709562064, 140400709562064)

We see that the object in memory has not changed at all, it is the same object and three variables are pointing to it.

Assignment

So the assignment operation in Python does not copy the object, it only creates a reference or a pointer to the actual object stored somewhere in the RAM.

>>> x += "!"  # this statement creates a new independent object
>>> x
'Dummy string!'
>>> y  # the old one remains unharmed
'Dummy string'
>>> id(x), id(y)
(140534150379248, 140400709562064)

Now, if we change the original object, a new object will be created in another place in memory, and our new variable will point to it. The original object remains untouched. This is because strings in Python are immutable and each change creates a new object.

Immutable object assignment

With immutable types, everything is simple — they are almost always recreated in memory. But let's imagine that the expression on the right is a mutable object like a list:

>>> x = [1, 2, 3]
[1, 2, 3]

If we try to change the original mutable object, a very confusing situation arises:

>>> y = x  # both variables point to the same object
>>> y
[1, 2, 3]
>>> id(x), id(y)
(140534150306864, 140534150306864)
>>> x.append(4)
>>> x
[1, 2, 3, 4]
>>> y
[1, 2, 3, 4]
>>> id(x), id(y)
(140534150306864, 140534150306864)

So when we change a variable pointing to a mutable object, we are not creating a copy of it like with immutable objects — here we are changing the original object the variable pointing to.

All right, now with the basics in mind, let's go to our example at the beginning of the post.

Integer interning

Going back to our example at the beginning of the post, we can see that we created two separate objects with the value equal to 255 expecting them to reference different objects in memory:

>>> x = 255
>>> y = 255
>>> x == y
True
>>> x is y
True

Here we are comparing the values of the objects using the == operator. And is operator compares the identity of two objects. The is operator evaluates to True if the variables on either side of the operator point to the same object and false otherwise.

But how are they literally the same?

There is actually an optimization in CPython regarding small integers (-5 to 256 inclusive). These objects are loaded into the interpreter's memory when the interpreter starts up. This results in a small internal cache. Every time we try to create an integer object in this range, CPython automatically refers to these objects in memory instead of creating new integer objects. Because of this, variables with the same values point to the same object, and the result is true.

This is called interning — reusing the objects on-demand instead of creating new objects. As you can see from the above code, both x and y refer to the same object in memory. This is because CPython does not create a new object but instead refers to a memory section of a. All this happens because of integer interning.

A similar example for number more than 256 works as expected:

>>> x = 257
>>> y = 257
>>> x == y
True
>>> x is y
False

The reason for this optimization strategy is simple: integers between -5 and 256 are used more often, so it makes sense to store them in the main memory. CPython preloads them into memory on startup to optimize speed and memory.

String interning

The same principle is also applied for strings:

>>> a = "the"
>>> b = "the"
>>> a == b
True
>>> a is b
True

String interning is a very useful mechanism in CPython that allows you to compare strings much faster. Unfortunately, newcomers can sometimes be confused by unexpected results which can be caused by this.

As we have already discussed, the rules for implicit string interning can be different. Relying on the rules can lead to unexpected errors. After all, rules can change so quickly as language is constantly evolving, but developers still need to ensure the stability and robustness of their programs. Not to mention that remembering the rules creates an unnecessary burden for developers.

For example, before Python 3.7, peephole optimization was used to intern strings, and all strings longer than 20 characters were not interned. However, then the algorithm was changed to AST optimizer, and the threshold length became 4096 instead of 20.

# version used: Python 3.6.8
>>> x = "K"*20
>>> y = "K"*20
>>> x == y, x is y
(True, True)
>>> x2 = "K"*21
>>> y2 = "K"*21
>>> x2 == y2, x2 is y2
(True, False)

# version used: Python 3.9.5
>>> x = "K"*4096
>>> y = "K"*4096
>>> x == y, x is y
(True, True)
>>> x2 = "K"*4097
>>> y2 = "K"*4097
>>> x2 == y2, x2 is y2
(True, False)

So, if for some reason we need to intern a string, there is a built-in function we can use to intern the string explicitly — intern() function in module sys.

Let's see how it works with a simple example:

>>> import sys
>>> x = sys.intern("K"*4097)
>>> y = sys.intern("K"*4097)
>>> x == y, x is y
(True, True)

As shown above, using intern() we can make Python intern strings no matter what the implicit rules are.

In practice, if we need to compare several long strings and the same value may appear many times, using the intern() function explicitly is the recommended way to speed up string comparison.

Why do developers need interning?

String interning in CPython is a mechanism for storing only one copy of a string value in memory. If there are multiple string variables whose values are the same, they will be interned by CPython implicitly and will refer to the same object in memory.

There are several advantages to this mechanism:

Saving memory space.
Fast comparisons
Fast dictionary lookups.

The first advantage is obvious because it takes less space to store only one copy than to store all copies. The second and third are because if two strings refer to the same object, they are unambiguously equal and there is no need to compare their characters one by one.

Nothing is perfect, and interning and maintaining a pool of cached objects also takes time. If a string is very long and will never be compared to others, it will just take an extra time to intern it. So use this functionality wisely.

Interning mechanism is a hidden gem in CPython. CPython will intern some strings implicitly. However, developers can use it explicitly and if used properly, it will make a huge difference in increasing the speed of applications.

P.S.

There is more to it.

You can access this memory if you want and change the value. Say, change literal 4 to value 5. I'll probably be damned by many people for this post, but here is a sample code:

>>> import ctypes
>>> ctypes.memmove(id(4) + 24, id(5) + 24, 8)
>>> print(2 * 2) # 5

You can also replace 0 with 1. Have fun debugging!