Python pytest hangs forever


The problem

Tests are great. Our Python project has hundreds of them. And all of these the tests run successfully. Every time:

$ pytest
[..]
======================= 291 passed, 9 skipped in 20.68s ========================

However, somehow pytest doesn't move along. It just hangs and hangs, for hours on end. This typically happens on the CI server, impeding other projects' builds:

$ ssh jenkins@builder.example.com
$ ps auxww
jenkins  29472  0.2  0.9 1031840 75556 ?       Sl   12:34   0:04
/home/jenkins/.local/share/virtualenvs/src.1-g-rtNnkp/bin/python3.6m
/home/jenkins/.local/share/virtualenvs/src.1-g-rtNnkp/bin/pytest

To make it even more interesting, it doesn't do this every time. Sometimes pytest hangs the 10th time, sometimes the 43th time and yet other times the 130th time pytest runs. Very fascinating.

Getting the Linux call stack

At some point, Python calls the Linux APIs, so seeing the OS call stack can sometimes give a hunch of what's failing:

# cat /proc/29472/stack
[<0>] futex_wait_queue_me+0xc4/0x120
[<0>] futex_wait+0x10a/0x250
[<0>] do_futex+0x325/0x500
[<0>] SyS_futex+0x13b/0x180
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff

Reading the documentation on futex - fast user-space locking revealed that Python was waiting for a lock to proceed. Ok, that's nice, but no way enough to figure out what I must fix in my Python app or tests to make the eternal hangs go away.

Connect gdb to the running Python process - take 1

The next step was then attach gdb to the running process, however, Python wasn't too talkative:

$ pipenv shell
$ gdb python -p <pid>
(gdb) bt
#0  0x00007f0d5b920896 in ?? ()
#1  0x0000000000503fd0 in PyImport_Cleanup () at ../Python/import.c:436
#2  0x0000000000000001 in ?? ()
#3  0x00007f0d30000d50 in ?? ()
#4  0x0000000000000000 in ?? ()

Install Python with debugging symbols

To make Python tell more about what it was doing, I had to use a version of the Python interpreter with debugging symbols:

# apt-get install python3-dbg

Set up your virtual env to use this Python interpreter

$ pipenv --python /usr/bin/python3.7-dbg

Connect gdb to the running Python process - take 2

You should now see a more details trace of what's going on:

(gdb) bt
#0  0x00007f3669a91896 in do_futex_wait.constprop () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f3669a91988 in __new_sem_wait_slow.constprop.0 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00000000005279d2 in PyThread_acquire_lock_timed (lock=lock@entry=0x7f3644002080, microseconds=<optimized out>, microseconds@entry=-1000000, intr_flag=intr_flag@entry=1) at ../Python/thread_pthread.h:372
#3  0x00000000005827fb in acquire_timed (lock=0x7f3644002080, timeout=-1000000000) at ../Modules/_threadmodule.c:61
#4  0x0000000000582992 in lock_PyThread_acquire_lock (self=self@entry=0x7f36636582a8, args=args@entry=0x7f3667d100c8, kwds=kwds@entry=0x0) at ../Modules/_threadmodule.c:144
#5  0x00000000004370cb in _PyMethodDef_RawFastCallKeywords (method=<optimized out>, self=self@entry=0x7f36636582a8, args=args@entry=0x7f364c0054a0, nargs=nargs@entry=2, kwnames=kwnames@entry=0x0) at ../Objects/call.c:690
#6  0x000000000061a8bf in _PyMethodDescr_FastCallKeywords (descrobj=descrobj@entry=0x7f366915bdf0, args=args@entry=0x7f364c005498, nargs=nargs@entry=3, kwnames=kwnames@entry=0x0) at ../Objects/descrobject.c:288
#7  0x00000000004e8787 in call_function (kwnames=0x0, oparg=3, pp_stack=<synthetic pointer>) at ../Python/ceval.c:4593
#8  _PyEval_EvalFrameDefault (f=0x7f364c0052f8, throwflag=<optimized out>) at ../Python/ceval.c:3110
#9  0x00000000004dda6a in PyEval_EvalFrameEx (f=f@entry=0x7f364c0052f8, throwflag=throwflag@entry=0) at ../Python/ceval.c:547
#10 0x00000000004de488 in _PyEval_EvalCodeWithName (_co=0x7f3668ee1940, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f365c077298, argcount=1, kwnames=0x0, kwargs=0x7f365c0772a0, kwcount=0, kwstep=1, defs=0x7f3668ef6be0, defcount=2, kwdefs=0x0, closure=0x0, name=0x7f3668ef4400, qualname=0x7f3668ef35c0) at ../Python/ceval.c:3930
#11 0x0000000000436a21 in _PyFunction_FastCallKeywords (func=func@entry=0x7f3668e25f70, stack=stack@entry=0x7f365c077298, nargs=nargs@entry=1, kwnames=kwnames@entry=0x0) at ../Objects/call.c:433
#12 0x00000000004e8828 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at ../Python/ceval.c:4616
#13 _PyEval_EvalFrameDefault (f=0x7f365c077108, throwflag=<optimized out>) at ../Python/ceval.c:3110
#14 0x00000000004dda6a in PyEval_EvalFrameEx (f=f@entry=0x7f365c077108, throwflag=throwflag@entry=0) at ../Python/ceval.c:547
#15 0x00000000004de488 in _PyEval_EvalCodeWithName (_co=0x7f3668ee1880, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f3667cd87e8, argcount=1, kwnames=0x0, kwargs=0x7f3667cd87f0, kwcount=0, kwstep=1, defs=0x7f3668ef5f88, defcount=1, kwdefs=0x0, closure=0x0, name=0x7f36691be930, qualname=0x7f3668ef6200) at ../Python/ceval.c:3930
#16 0x0000000000436a21 in _PyFunction_FastCallKeywords (func=func@entry=0x7f3668e25eb8, stack=stack@entry=0x7f3667cd87e8, nargs=nargs@entry=1, kwnames=kwnames@entry=0x0) at ../Objects/call.c:433
#17 0x00000000004e8828 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at ../Python/ceval.c:4616
#18 _PyEval_EvalFrameDefault (f=0x7f3667cd8658, throwflag=<optimized out>) at ../Python/ceval.c:3110
#19 0x00000000004dda6a in PyEval_EvalFrameEx (f=f@entry=0x7f3667cd8658, throwflag=throwflag@entry=0) at ../Python/ceval.c:547
#20 0x00000000004360b5 in function_code_fastcall (co=co@entry=0x7f3668efd040, args=args@entry=0x0, nargs=nargs@entry=0, globals=globals@entry=0x7f3668f539b8) at ../Objects/call.c:283
#21 0x00000000004367e0 in _PyFunction_FastCallDict (func=func@entry=0x7f3668e2a058, args=args@entry=0x0, nargs=nargs@entry=0, kwargs=kwargs@entry=0x0) at ../Objects/call.c:322
#22 0x0000000000437e78 in _PyObject_FastCallDict (callable=callable@entry=0x7f3668e2a058, args=args@entry=0x0, nargs=nargs@entry=0, kwargs=kwargs@entry=0x0) at ../Objects/call.c:98
#23 0x0000000000438143 in _PyObject_CallFunctionVa (callable=callable@entry=0x7f3668e2a058, format=format@entry=0x0, va=va@entry=0x7ffddeaca568, is_size_t=is_size_t@entry=0) at ../Objects/call.c:931
#24 0x00000000004383e3 in callmethod (callable=callable@entry=0x7f3668e2a058, format=format@entry=0x0, va=va@entry=0x7ffddeaca568, is_size_t=is_size_t@entry=0) at ../Objects/call.c:1027
#25 0x00000000004386eb in _PyObject_CallMethodId (obj=obj@entry=0x7f3668f57d58, name=name@entry=0x830fc0 <PyId__shutdown.17745>, format=format@entry=0x0) at ../Objects/call.c:1096
#26 0x00000000005104e8 in wait_for_thread_shutdown () at ../Python/pylifecycle.c:2246
#27 0x0000000000511a74 in Py_FinalizeEx () at ../Python/pylifecycle.c:1129
#28 0x0000000000511d9d in Py_Exit (sts=sts@entry=0) at ../Python/pylifecycle.c:2278
#29 0x000000000051a057 in handle_system_exit () at ../Python/pythonrun.c:636
#30 0x000000000051b601 in PyErr_PrintEx (set_sys_last_vars=set_sys_last_vars@entry=1) at ../Python/pythonrun.c:646
#31 0x000000000051b94e in PyErr_Print () at ../Python/pythonrun.c:542
#32 0x000000000051cac8 in PyRun_SimpleFileExFlags (fp=fp@entry=0x198eec0, filename=<optimized out>, filename@entry=0x7f3668e67750 "/home/torstein/.local/share/virtualenvs/myapp-EYf4qmw2/bin/pytest", closeit=closeit@entry=1, flags=flags@entry=0x7ffddeaca82c) at ../Python/pythonrun.c:435
#33 0x000000000051cbdf in PyRun_AnyFileExFlags (fp=fp@entry=0x198eec0, filename=0x7f3668e67750 "/home/torstein/.local/share/virtualenvs/myapp-EYf4qmw2/bin/pytest", closeit=closeit@entry=1, flags=flags@entry=0x7ffddeaca82c) at ../Python/pythonrun.c:84
#34 0x0000000000425daa in pymain_run_file (fp=0x198eec0, filename=0x1986a90 L"/home/torstein/.local/share/virtualenvs/myapp-EYf4qmw2/bin/pytest", p_cf=p_cf@entry=0x7ffddeaca82c) at ../Modules/main.c:427
#35 0x0000000000425e6a in pymain_run_filename (pymain=pymain@entry=0x7ffddeaca850, cf=cf@entry=0x7ffddeaca82c) at ../Modules/main.c:1627
#36 0x0000000000425f77 in pymain_run_python (pymain=pymain@entry=0x7ffddeaca850) at ../Modules/main.c:2877
#37 0x0000000000429571 in pymain_main (pymain=pymain@entry=0x7ffddeaca850) at ../Modules/main.c:3038
#38 0x0000000000429654 in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at ../Modules/main.c:3073
#39 0x0000000000422e5b in main (argc=<optimized out>, argv=<optimized out>) at ../Programs/python.c:15

But that still didn't give me a clue as to what I should fix in my program, or if it is at all possible to fix it (could it be a bug in pytest or python?)

Ensure your gdb version supports Python scripting

$ gdb
(gdb) python print("Hello world!")
Python scripting is not supported in this copy of GDB.

If this is the case, you probably have gdb-minimal instead of gdb installed. This is easily fixed with:

# apt-get install gdb

Get the py extensions for gdb

gdb must be able to source a Python file called python3.7-dbg-gdb.py (changes according to the major.minor version of the python binary).

I found it the easiest to put /usr/share/gdb/auto-load/usr/bin/python3.7-dbg-gdb.py inside my pipenv next to where the python symlink is:

$ pipenv shell
$ cd $(dirname $(which python))
$ ln -s /usr/share/gdb/auto-load/usr/bin/python3.7-dbg-gdb.py

Now, when I started up gdb with:

$ pipenv shell
$ gdb -p 17709 python 

I saw:

Reading symbols from python...done.
Attaching to program: /home/torstein/.local/share/virtualenvs/myapp-EpEGAS3U/bin/python, process 17709

This also gave me a number of new gdb commands to play with, including py-bt.

Get pytest to hang

One of the challenges with this problem, was that there was no way of making it fail at will, it just failed sometimes. Every 50s run or every 150s run.

To make it fail at will, I had bash do it for me:

$ start=$(date +%s);
  for i in {1..161}; do
    echo "pytest run #${i}, elapsed $(( $(date +%s) - start )) seconds";
    pipenv run pytest;
  done

Connect gdb to the running Python process - take 3

$ pipenv shell
$ gdb python -p <pid>
(gdb) py-bt
Traceback (most recent call first):
  File "/usr/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
  File "/usr/lib/python3.7/threading.py", line 1032, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.7/threading.py", line 1281, in _shutdown
    t.join()

There, finally, I got somewhere: it hangs in threading.py and it's a call to join() without any timeout that makes it stay there forever.

Bonus for Emacs users

As an alternative to running gdb on the command line like below, you can run gdb through Emacs with M-x gdb. This gives you auto completion on the debugger commands among other things:

Emacs running GDB against the Python process

Analysis

The call to t.join() is called from _shutdown() in threading.py which again is called from Thread#shutdown(). So what in my code is creating a thread and calls shutdown() on that object?

$ grep -rn 'threading.Thread' --include=*.py src                      
src/skybert/concurrent/wire.py:289:        self._thread = threading.Thread(
src/skybert/concurrent/player.py:145:            self._thread = threading.Thread(
src/skybert/horse_whisperer.py:61:class Whisperer(threading.Thread):

I also searched for import threading to be on the safe side, but since Python's own package names are so short, we tend to use the full path so there were no additional cases.

Now I knew which places in my code which potentially directly or indirectly called this troublesome code in Python's threading.py.

The fix

Two of our app's threads were not daemonic, so Python would wait around forever for until they exited: As none of them were writing precious information to a database, I set them to daemonic. That way, they'd be shut down whenever the python program wanted to quit (that's what you normally want, right?):

In the words of the Python documentation:

The significance of this flag is that the entire Python program exits when only daemon threads are left.

As often is, once you know it, the remedy is simple:

-t = threading.Thread(name="My thread")
+t = threading.Thread(name="My thread", daemon=True)

With that wee change to my application, the tests have at the time of writing run 152 times successfully without any hang.

Happy testing!


Licensed under CC BY Creative Commons License ~ ✉ torstein.k.johansen @ gmail ~ 🐦 @torsteinkrause ~