Skip to content

Commit 62a45fa

Browse files
gh-135898: Add section to free-threading howto about memory usage (#143279)
Co-authored-by: Kumar Aditya <kumaraditya@python.org>
1 parent 6d7a19e commit 62a45fa

1 file changed

Lines changed: 129 additions & 0 deletions

File tree

Doc/howto/free-threading-python.rst

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -165,3 +165,132 @@ to false. If the flag is true then the :class:`warnings.catch_warnings`
165165
context manager uses a context variable for warning filters. If the flag is
166166
false then :class:`~warnings.catch_warnings` modifies the global filters list,
167167
which is not thread-safe. See the :mod:`warnings` module for more details.
168+
169+
170+
Increased memory usage
171+
----------------------
172+
173+
The free-threaded build will typically use more memory compared to the default
174+
build. There are multiple reasons for this, mostly due to design decisions.
175+
176+
177+
All interned strings are immortal
178+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
179+
180+
For modern Python versions (since version 2.3), interning a string (e.g. with
181+
:func:`sys.intern`) does not cause it to become immortal. Instead, if the last
182+
reference to that string disappears, it will be removed from the interned
183+
string table. This is not the case for the free-threaded build and any interned
184+
string will become immortal, surviving until interpreter shutdown.
185+
186+
187+
Non-GC objects have a larger object header
188+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
189+
190+
The free-threaded build uses a different :c:type:`PyObject` structure. Instead
191+
of having the GC related information allocated before the :c:type:`PyObject`
192+
structure, like in the default build, the GC related info is part of the normal
193+
object header. For example, on the AMD64 platform, ``None`` uses 32 bytes on
194+
the free-threaded build vs 16 bytes for the default build. GC objects (such as
195+
dicts and lists) are the same size for both builds since the free-threaded
196+
build does not use additional space for the GC info.
197+
198+
199+
QSBR can delay freeing of memory
200+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
201+
202+
In order to safely implement lock-free data structures, a safe memory
203+
reclamation (SMR) scheme is used, known as quiescent state-based reclamation
204+
(QSBR). This means that the memory backing data structures allowing lock-free
205+
access will use QSBR, which defers the free operation, rather than immediately
206+
freeing the memory. Two examples of these data structures are the list object
207+
and the dictionary keys object. See ``InternalDocs/qsbr.md`` in the CPython
208+
source tree for more details on how QSBR is implemented. Running
209+
:func:`gc.collect` should cause all memory being held by QSBR to be actually
210+
freed. Note that even when QSBR frees the memory, the underlying memory
211+
allocator may not immediately return that memory to the OS and so the resident
212+
set size (RSS) of the process might not decrease.
213+
214+
215+
mimalloc allocator vs pymalloc
216+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
217+
218+
The default build will normally use the "pymalloc" memory allocator for small
219+
allocations (512 bytes or smaller). The free-threaded build does not use
220+
pymalloc and allocates all Python objects using the "mimalloc" allocator. The
221+
pymalloc allocator has the following properties that help keep memory usage
222+
low: small per-allocated-block overhead, effective memory fragmentation
223+
prevention, and quick return of free memory to the operating system. The
224+
mimalloc allocator does quite well in these respects as well but can have some
225+
more overhead.
226+
227+
In the free-threaded build, mimalloc manages memory in a number of separate
228+
heaps (currently four). For example, all GC supporting objects are allocated
229+
from their own heap. Using separate heaps means that free memory in one heap
230+
cannot be used for an allocation that uses another heap. Also, some heaps are
231+
configured to use QSBR (quiescent-state based reclamation) when freeing the
232+
memory that backs up the heap (known as "pages" in mimalloc terminology). The
233+
use of QSBR creates a delay between all memory blocks for a page being freed
234+
and the memory page being released, either for new allocations or back to the
235+
OS.
236+
237+
The mimalloc allocator also defers returning freed memory back to the OS. You
238+
can reduce that delay by setting the environment variable
239+
:envvar:`!MIMALLOC_PURGE_DELAY` to ``0``. Note that this will likely reduce
240+
the performance of the allocator.
241+
242+
243+
Free-threaded reference counting can cause objects to live longer
244+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
245+
246+
In the default build, when an object's reference count reaches zero, it is
247+
normally deallocated. The free-threaded build uses "biased reference
248+
counting", with a fast-path for objects "owned" by the current thread and a
249+
slow path for other objects. See :pep:`703` for additional details. Any time
250+
an object's reference count ends up in a "queued" state, deallocation can be
251+
deferred. The queued state is cleared from the "eval breaker" section of the
252+
bytecode evaluator.
253+
254+
The free-threaded build also allows a different mode of reference counting,
255+
known as "deferred reference counting". This mode is enabled by setting a flag
256+
on a per-object basis. Deferred reference counting is enabled for the
257+
following types:
258+
259+
* module objects
260+
* module top-level functions
261+
* class methods defined in the class scope
262+
* descriptor objects
263+
* thread-local objects, created by :class:`threading.local`
264+
265+
When deferred reference counting is enabled, references from Python function
266+
stacks are not added to the reference count. This scheme reduces the overhead
267+
of reference counting, especially for objects used from multiple threads.
268+
Because the stack references are not counted, objects with deferred reference
269+
counting are not immediately freed when their internal reference count goes to
270+
zero. Instead, they are examined by the next GC run and, if no stack
271+
references to them are found, they are freed. This means these objects are
272+
freed by the GC and not when their reference count goes to zero, as is typical.
273+
274+
275+
Per-thread reference counting can delay freeing objects
276+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
277+
278+
To avoid contention on the reference count fields of frequently shared
279+
objects, the free-threaded build also uses "per-thread reference counting"
280+
for a few selected object types. Rather than updating a single shared
281+
reference count, each thread maintains its own local reference count array,
282+
indexed by a unique id assigned to the object. The true reference count is
283+
only computed by summing the per-thread counts when the object's local
284+
count drops to zero. Per-thread reference counting is currently used for:
285+
286+
* heap type objects (classes created in Python)
287+
* code objects
288+
* the ``__dict__`` of module objects
289+
290+
Because the per-thread counts must be merged back to the object before it
291+
can be deallocated, objects using per-thread reference counting are
292+
typically freed later than they would be in the default build. In
293+
particular, such an object is usually not freed until the thread that
294+
referenced it reaches a safe point (for example, in the "eval breaker"
295+
section of the bytecode evaluator) or exits. Running :func:`gc.collect`
296+
will merge the per-thread counts and allow these objects to be freed.

0 commit comments

Comments
 (0)