> The finer grained synchronization primitives are (a) already available in Python, and (b) necessary even with the GIL for reasons I've given elsewhere in this discussion.
The finer grained synchronization primitives does not already available in Python. Or, it should not be visible to Python code at all. What I'm talking about is the internal implementation of, e.g. PyDict. While from Python bytecode side setitem on it is already not atomic, it does guarantee that Python interpreter won't segfault if there are two Python threads manipulating one dict object concurrently. This is achieved via GIL and has to be replaced.
It's the same problem you mentioned above as "problems in C extension". But no, nogil is hard not only because of compatibility issues. People (especially those who insist on that their workload is inherently embarrassingly parallel) do NOT accept any regression in single-thread performance. If you only ever want to optimize for single thread one global lock is the optimal solution.
The finer grained synchronization primitives does not already available in Python. Or, it should not be visible to Python code at all. What I'm talking about is the internal implementation of, e.g. PyDict. While from Python bytecode side setitem on it is already not atomic, it does guarantee that Python interpreter won't segfault if there are two Python threads manipulating one dict object concurrently. This is achieved via GIL and has to be replaced.
It's the same problem you mentioned above as "problems in C extension". But no, nogil is hard not only because of compatibility issues. People (especially those who insist on that their workload is inherently embarrassingly parallel) do NOT accept any regression in single-thread performance. If you only ever want to optimize for single thread one global lock is the optimal solution.