Home

Asynchronous programming. Python3.5+

Asynchronous programming. Python3.5+

This is a practical post of the series of asynchronous programming.

Whole series:

In this post we will be talking about the Python stack on the concepts we talked so far: from the simplest like threads, processes to the asyncio library.

Asynchronous programming in Python has become more and more popular lately. There are many different libraries in python for doing asynchronous programming. One of those libraries is asyncio, which is a python standard library added in Python 3.4. In Python 3.5 we got an async/await syntax. Asyncio is part of the reason asynchronous programming is becoming more popular in Python. This article will explain what asynchronous programming is and compare some of these libraries.


Quick Recap

What we have realized so far from the previous posts:

  • Sync: Blocking operations.

  • Async: Non-blocking operations.

  • Concurrency: Making progress together.

  • Parallelism: Making progress in parallel.

  • Parallelism implies Concurrency. But Concurrency doesn’t always mean Parallelism.


Python code can now be mainly executed in one of two worlds, synchronous or asynchronous. You should think of them as separate worlds with different libraries and styles of calls, but sharing variables and syntax.

In the synchronous Python world, which has existed for several decades, you call functions directly, and everything is processed sequentially, exactly as you wrote your code. There are options to run the code concurrently.

Synchronous world

In this post we will be comparing different implementations of the same code. We will try to execute two functions. First one is calculating the power of number:

def cpu_bound(a, b):
    return a ** b

We will do it N times:

def simple_1(N, a, b):
    for i in range(N):
        cpu_bound(a, b)

And the second one is downloading data from the web:

def io_bound(urls):
    data = []
    for url in urls:
        data.append(urlopen(url).read())
    return data

def simple_2(N, urls):
    for i in range(N):
        io_bound(urls)

To compare how much time function works we implement simple decorator/context manager:

import time
from contextlib import ContextDecorator

class timeit(object):
    def __call__(self, f):
        @functools.wraps(f)
        def decorated(*args, **kwds):
            with self:
                return f(*args, **kwds)
        return decorated

    def __enter__(self):
        self.start_time = time.time()

    def __exit__(self, *args, **kw):
        elapsed = time.time() - self.start_time
        print("{:.3} sec".format(elapsed))

Now let's put it all together and run, to understand how much time my machine will be executing this code:

import time
import functools
from urllib.request import urlopen
from contextlib import ContextDecorator


class timeit(object):
    def __call__(self, f):
        @functools.wraps(f)
        def decorated(*args, **kwds):
            with self:
                return f(*args, **kwds)
        return decorated

    def __enter__(self):
        self.start_time = time.time()

    def __exit__(self, *args, **kw):
        elapsed = time.time() - self.start_time
        print("{:.3} sec".format(elapsed))


def cpu_bound(a, b):
    return a ** b


def io_bound(urls):
    data = []
    for url in urls:
        data.append(urlopen(url).read())
    return data


@timeit()
def simple_1(N, a, b):
    for i in range(N):
        cpu_bound(a, b)


@timeit()
def simple_2(N, urls):
    for i in range(N):
        io_bound(urls)


if __name__ == '__main__':
    a = 7777
    b = 200000
    urls = [
        "http://google.com",
        "http://yahoo.com",
        "http://linkedin.com",
        "http://facebook.com"
    ]
    simple_1(10, a, b)
    simple_2(10, urls)

We implemented execution of the same functions N times sequentially.

On my hardware, cpu_bound function took 2.18 sec, io_bound — 31.4 sec.

So, we get our basic performance. Let's move on to threads.

Threads

threads

A thread is the smallest unit of processing that can be performed in an OS.

Threads of a process can share the memory of global variables. If a global variable is changed in one thread, this change is valid for all threads.

In simple words, a thread is a sequence of such operations within a program that can be executed independently of other code.

Threads executing concurrently but can be executing in parallel — it depends on the system on which they are running.

Python threads are implemented using OS threads in all implementations I know (CPython, PyPy and Jython). For each Python thread, there is an underlying OS thread.

One thread is executed on one processor core per unit of time. It works until it consumes its time slice (default is 100 ms) or until it gives up control to the next thread by making a system call.

Let's try to implement our example functionality using threads:

from threading import Thread

@timeit()
def threaded(n_threads, func, *args):
    jobs = []
    for i in range(n_threads):
        thread = Thread(target=func, args=args)
        jobs.append(thread) 

    # start the threads
    for j in jobs:
        j.start() 

    # ensure all of the threads have finished
    for j in jobs:
        j.join()

if __name__ == '__main__':
    ...
    threaded(10, cpu_bound, a, b)
    threaded(10, io_bound, urls)

On my hardware, cpu_bound took 2.47 sec, io_bound — 7.9 sec.

The I/O-bound function executed more than 5 times faster because we download the data in parallel on separate threads. But why CPU-bound operation goes slower?

In the reference implementation of Python — CPython there is the infamous GIL (Global Interpreter Lock). And we slowly go to its section...

Global Interpreter Lock (GIL)

First of all, GIL is a lock that must be taken before any access to Python (and this is not only the execution of Python code but also calls to the Python C API). In essence, GIL is a global semaphore that does not allow more than one thread to work simultaneously within an interpreter.

Strictly speaking, the only calls available after running the interpreter with an uncaptured GIL are its capture. Violation of the rule leads to an instant crash (the best option) or delayed crash of the program (much worse and harder to debug).

How it works?

When the thread starts, it performs a GIL capture. After a while, the process scheduler decides that the current thread has done enough and give control to the next thread. Thread #2 sees that GIL is captured so that it does not continue to work, but plunges itself into sleep, yielding the processor to thread #1.

But the thread cannot hold GIL indefinitely. Prior to Python 3.3, GIL switched every 100 machine code instructions. In later versions of GIL, a thread can be held no longer than 5 ms. GIL is also released if the thread makes a system call, works with a disk or network(I/O-bound operation).

In fact, GIL in python makes the idea of ​​using threads for parallelism in computational problems(CPU-bound operations) useless. They will work sequentially even on a multiprocessor system. On CPU-bound tasks, the program will not accelerate, but only slow down, because now the threads will have to halve the processor time. At the same time, the GIL I/O operation will not slow down, since before the system call the thread releases the GIL.

It is clear that GIL slows down the execution of our program due to the additional work of creating and communicating threads, capturing and releasing the semaphore itself and preserving the context. But it needs to be mentioned that GIL does not limit parallel execution.

GIL is not part of the language and does not exist in all language implementations, but only in the above mentioned CPython.

So why the heck does he even exist?

GIL protects operating data structures from concurrent access problems. For example, it prevents the race condition when the object's reference count value changes. GIL makes it easy to integrate non-thread safe libraries on C. Thanks to GIL, we have so many fast modules and binders for almost everything.

There are exceptions, for example, the GIL control mechanism is available to C libraries. For example, NumPy releases it on long operations. Or, when using the numba package, the programmer can control the semaphore disabling itself.

On this sad note, you can come to the conclusion that threads will be enough for parallelizing tasks that are tied to I/O. But computing tasks should be run in separate processes.

Processes

From the OS point of view, a process is a data structure that holds a memory area and some other resources, for example, files opened by it. Often the process has one thread, called main, but the program can create any number of threads. At the start, the thread is not allocated individual resources, instead, it uses the memory and resources of the process that spawned it. Due to this, the threads can quickly start and stop.

Multitasking is handled by the scheduler — part of the OS kernel, which in turn loads the execution threads into the central processor.

Like threads, processes are always executed concurrently, but they can also be executed in parallel, depending on the presence of the hardware component.

from multiprocessing import Process

@timeit()
def multiprocessed(n_threads, func, *args):
    processes = []
    for i in range(n_threads):
        p = Process(target=func, args=args)
        processes.append(p)

    # start the processes
    for p in processes:
        p.start()

    # ensure all processes have finished execution
    for p in processes:
        p.join()

if __name__ == '__main__':
    ...
    multiprocessed(10, cpu_bound, a, b)
    multiprocessed(10, io_bound, urls)

On my hardware, cpu_bound took 1.12 sec, io_bound — 7.22 sec.

So, the calculation operation executed faster than threaded implementation because now we are not stuck in capturing GIL, but the I/O bound function took slightly more time because processes are more heavyweight than threads.

Asynchronous world

asynchronous programming

In an asynchronous world, everything changes a bit. Everything works in a central event-processing loop, which is a small code that allows you to run several coroutines (an important term, in simple terms it is not OS-managed threads, except they are co-operatively multitasking, and hence not truly concurrent) at the same time. Coroutines work synchronously until the expected result is reached, and then they stop, transfer control to the event loop, and something else can happen.

Green threads

Green threads

Green threads are a primitive level of asynchronous programming. A green thread is a regular thread, except that switching between threads is done in the application code(on the user level), and not in the processor(OS level). The core of it is non-blocking operations. Switching between threads occurs only on I/O. Non-I/O threads will take control forever.

Gevent is a well-known Python library for using green threads. Gevent is a green thread and non-blocking I/O. gevent.monkey modifies the behavior of standard Python libraries so that they allow non-blocking I/O operations to be performed.

Other libraries:

Let's see how performance changes if we start using green threads using gevent library in Python:

import gevent.monkey

# patch any other imported module that has a blocking code in it 
# to make it asynchronous.
gevent.monkey.patch_all()

@timeit()
def green_threaded(n_threads, func, *args):
    jobs = []
    for i in range(n_threads):
        jobs.append(gevent.spawn(func, *args))
    # ensure all jobs have finished execution
    gevent.wait(jobs)

if __name__ == '__main__:
    ...
    green_threaded(10, cpu_bound, a, b)
    green_threaded(10, io_bound, urls)

Results are: cpu_bound — 2.23 sec, io_bound — 6.85 sec.

Slower for CPU-bound function, and faster for I/O-bound function. As expected.

Asyncio

The asyncio package is described in the Python documentation as a library for writing parallel code. However, asyncio is not multithreaded and is not multiprocessing. It is not built on top of one of them.

While Gevent and Twisted aim to be higher level frameworks, asyncio aims to be a lower-level implementation of an asynchronous event loop, with the intention that higher level frameworks like Twisted, Gevent or Tornado, will build on top of it. However, by itself, it makes a suitable framework on its own.

In fact, asyncio is a single-threaded, single-process project: it uses cooperative multitasking. asyncio allows us to write asynchronous concurrent programs running in the same thread, using an event loop for scheduling tasks and multiplexing I/O through sockets (and other resources).

asyncio provides us an event loop along with other good stuff. The event loop tracks different I/O events and switches to tasks which are ready and pauses the ones which are waiting on I/O. Thus we don’t waste time on tasks which are not ready to run right now.

How it works

Synchronous and asynchronous functions/callables are different types — you can't just mix them. If you block a coroutine synchronously — maybe you use time.sleep(10) rather than await asyncio.sleep(10) — you don't return control to the event loop — the entire process will block.

You should think of your codebase as comprised of pieces of either sync code or async code — anything inside an async def is async code, anything else (including the main body of a Python file or class) is synchronous code.

asynchronous api

The idea is very simple. There’s an event loop. And we have an asynchronous function (coroutine) in Python, you declare it with async def, which changes how its call behaves. In particular, calling it will immediately return a coroutine object, which basically says "I can run the coroutine and return a result when you await me".

We give those functions to the event loop and ask it to run them for us. The event loop gives us back a Future object, it’s like a promise that we will get something back in the future. We hold on to the promise, time to time check if it has a value (when we feel impatient) and finally when the future has a value, we use it in some other operations.

When you call await, the function gets suspended while whatever you asked to wait on happens, and then when it's finished, the event loop will wake the function up again and resume it from the await call, passing any result out. Example:

import asyncio

async def say(what, when):
    await asyncio.sleep(when)
    print(what)

loop = asyncio.get_event_loop()
loop.run_until_complete(say('hello world', 1))
loop.close()

In the example here, the say() function pauses and gives back control to the event loop, which sees that sleep needs to run, calls it, and then that calls await and gets suspended with a marker to resume it one second. Once it resumes, say completes, returns a result, and then that makes main ready to run again and the event loop resumes it with the returned value.

This is how async code can have so many things happening at once - anything that's blocking calls await, and gets put onto the event loop's list of paused coroutines so something else can run. Everything that's paused has an associated callback that will wake it up again — some are time-based, some are I/O-based, and most of them are like the example above and waiting for a result from another coroutine.

Let's return to our example. We have two blocking functions cpu_bound and io_bound. As I said, we cannot mix synchronous and asynchronous operations — we must make all of them asynchronous. Naturally, not for everything there are asynchronous libraries. Some code remains blocking, and it must somehow be run so that it does not block our event loop. For this, there is a good run_in_executor() method, which runs what we passed to it in one of the threads of the built-in pool, without blocking the main thread with the event loop. We will use this functionality for our CPU-bound function. We will rewrite the I/O-bound function completely to await those moments when we are waiting for an event.

import asyncio
import aiohttp

async def async_func(N, func, *args):
    coros = [func(*args) for _ in range(N)]
    # run awaitable objects concurrently
    await asyncio.gather(*coros)


async def a_cpu_bound(a, b):
    result = await loop.run_in_executor(None, cpu_bound, a, b)
    return result


async def a_io_bound(urls):
    # create a coroutine function where we will download from individual url
    async def download_coroutine(session, url):
        async with session.get(url, timeout=10) as response:
            await response.text()

    # set an aiohttp session and download all our urls
    async with aiohttp.ClientSession(loop=loop) as session:
        for url in urls:
            await download_coroutine(session, url)


if __name__ == '__main__':
    ...
    loop = asyncio.get_event_loop()
    with timeit():
        loop.run_until_complete(async_func(10, a_cpu_bound, a, b))

    with timeit():
        loop.run_until_complete(async_func(10, a_io_bound, urls))

Results are: cpu_bound — 2.23 sec, io_bound — 4.37 sec.

Slower for CPU-bound function, and almost twice as fast as the threaded example for I/O-bound function.

Making the Right Choice

  • CPU-bound -> multiprocessing
  • I/O-bound, fast I/O, Limited Number of Connections -> multithreading
  • I/O-bound, slow I/O, many connections -> asyncio

Conclusion

Threads will be easier if you have a typical web application that does not depend on external services, and relatively finite amount users for whom the response time will be predictably short.

async is suitable if the application spends most of the time reading/writing data, rather than processing it. For example, you have a lot of slow requests — websockets, long polling, or there are slow external synchronous backends, requests for which are unknown when they end.

Synchronous programming is what most often begins the development of applications in which sequential execution of commands is performed.

Even with conditional branching, loops, and function calls, we think about code in terms of performing one step at a time. After the completion of the current step, it proceeds to the next.

An asynchronous application behaves differently. It still running one step at a time, but the difference is that the system moving forward, it's not waiting for the completion of the current execution step. As a result, we are going to the event-driven programming.

asyncio is a great library and it's cool that it was included into Python standard library. asyncio has already begun to build an ecosystem (aiohttp, asyncpg, etc.) for application development. There are other event loop implementations (uvloop, dabeaz/curio, python-trio/trio), and I think the asyncio will evolve in even more powerful tool.

Links

Support author