Skip to content

Commit 670f982

Browse files
[3.15] gh-134837: Correct and improve base85 documentation for base64 and binascii modules (GH-145843) (GH-149742)
(cherry picked from commit e667d62) Co-authored-by: David Huggins-Daines <dhd@ecolingui.ca>
1 parent 564902e commit 670f982

5 files changed

Lines changed: 112 additions & 64 deletions

File tree

Doc/library/base64.rst

Lines changed: 58 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,10 @@
1616
This module provides functions for encoding binary data to printable
1717
ASCII characters and decoding such encodings back to binary data.
1818
This includes the :ref:`encodings specified in <base64-rfc-4648>`
19-
:rfc:`4648` (Base64, Base32 and Base16)
20-
and the non-standard :ref:`Base85 encodings <base64-base-85>`.
19+
:rfc:`4648` (Base64, Base32 and Base16), the :ref:`Base85 encoding
20+
<base64-base-85>` specified in `PDF 2.0
21+
<https://pdfa.org/resource/iso-32000-2/>`_, and non-standard variants
22+
of Base85 used elsewhere.
2123

2224
There are two interfaces provided by this module. The modern interface
2325
supports encoding :term:`bytes-like objects <bytes-like object>` to ASCII
@@ -284,19 +286,28 @@ POST request.
284286
Base85 Encodings
285287
-----------------
286288

287-
Base85 encoding is not formally specified but rather a de facto standard,
288-
thus different systems perform the encoding differently.
289+
Base85 encoding is a family of algorithms which represent four bytes
290+
using five ASCII characters. Originally implemented in the Unix
291+
``btoa(1)`` utility, a version of it was later adopted by Adobe in the
292+
PostScript language and is standardized in PDF 2.0 (ISO 32000-2).
293+
This version, in both its ``btoa`` and PDF variants, is implemented by
294+
:func:`a85encode`.
289295

290-
The :func:`a85encode` and :func:`b85encode` functions in this module are two implementations of
291-
the de facto standard. You should call the function with the Base85
292-
implementation used by the software you intend to work with.
296+
A separate version, using a different output character set, was
297+
defined as an April Fool's joke in :rfc:`1924` but is now used by Git
298+
and other software. This version is implemented by :func:`b85encode`.
293299

294-
The two functions present in this module differ in how they handle the following:
300+
Finally, a third version, using yet another output character set
301+
designed for safe inclusion in programming language strings, is
302+
defined by ZeroMQ and implemented here by :func:`z85encode`.
295303

296-
* Whether to include enclosing ``<~`` and ``~>`` markers
297-
* Whether to include newline characters
298-
* The set of ASCII characters used for encoding
299-
* Handling of null bytes
304+
The functions present in this module differ in how they handle the following:
305+
306+
* Whether to include and expect enclosing ``<~`` and ``~>`` markers.
307+
* Whether to fold the input into multiple lines.
308+
* The set of ASCII characters used for encoding.
309+
* Compact encodings of sequences of spaces and null bytes.
310+
* The encoding of zero-padding bytes applied to the input.
300311

301312
Refer to the documentation of the individual functions for more information.
302313

@@ -307,18 +318,22 @@ Refer to the documentation of the individual functions for more information.
307318

308319
*foldspaces* is an optional flag that uses the special short sequence 'y'
309320
instead of 4 consecutive spaces (ASCII 0x20) as supported by 'btoa'. This
310-
feature is not supported by the "standard" Ascii85 encoding.
321+
feature is not supported by the standard encoding used in PDF.
311322

312323
If *wrapcol* is non-zero, insert a newline (``b'\n'``) character
313324
after at most every *wrapcol* characters.
314325
If *wrapcol* is zero (default), do not insert any newlines.
315326

316-
If *pad* is true, the input is padded with ``b'\0'`` so its length is a
317-
multiple of 4 bytes before encoding.
318-
Note that the ``btoa`` implementation always pads.
327+
*pad* controls whether zero-padding applied to the end of the input
328+
is fully retained in the output encoding, as done by ``btoa``,
329+
producing an exact multiple of 5 bytes of output. This is not part
330+
of the standard encoding used in PDF, as it does not preserve the
331+
length of the data.
319332

320-
*adobe* controls whether the encoded byte sequence is framed with ``<~``
321-
and ``~>``, which is used by the Adobe implementation.
333+
*adobe* controls whether the encoded byte sequence is framed with
334+
``<~`` and ``~>``, as in a PostScript base-85 string literal. Note
335+
that while ASCII85Decode streams in PDF documents *must* be
336+
terminated with ``~>``, they *must not* use a leading ``<~``.
322337

323338
.. versionadded:: 3.4
324339

@@ -330,10 +345,12 @@ Refer to the documentation of the individual functions for more information.
330345

331346
*foldspaces* is a flag that specifies whether the 'y' short sequence
332347
should be accepted as shorthand for 4 consecutive spaces (ASCII 0x20).
333-
This feature is not supported by the "standard" Ascii85 encoding.
348+
This feature is not supported by the standard Ascii85 encoding used in
349+
PDF and PostScript.
334350

335-
*adobe* controls whether the input sequence is in Adobe Ascii85 format
336-
(i.e. is framed with <~ and ~>).
351+
*adobe* controls whether the ``<~`` and ``~>`` markers are
352+
present. While the leading ``<~`` is not required, the input must
353+
end with ``~>``, or a :exc:`ValueError` is raised.
337354

338355
*ignorechars* should be a :term:`bytes-like object` containing characters
339356
to ignore from the input.
@@ -356,8 +373,11 @@ Refer to the documentation of the individual functions for more information.
356373
Encode the :term:`bytes-like object` *b* using base85 (as used in e.g.
357374
git-style binary diffs) and return the encoded :class:`bytes`.
358375

359-
If *pad* is true, the input is padded with ``b'\0'`` so its length is a
360-
multiple of 4 bytes before encoding.
376+
The input is padded with ``b'\0'`` so its length is a multiple of 4
377+
bytes before encoding. If *pad* is true, all the resulting
378+
characters are retained in the output, which will always be a
379+
multiple of 5 bytes, and thus the length of the data may not be
380+
preserved on decoding.
361381

362382
If *wrapcol* is non-zero, insert a newline (``b'\n'``) character
363383
after at most every *wrapcol* characters.
@@ -372,8 +392,7 @@ Refer to the documentation of the individual functions for more information.
372392
.. function:: b85decode(b, *, ignorechars=b'', canonical=False)
373393

374394
Decode the base85-encoded :term:`bytes-like object` or ASCII string *b* and
375-
return the decoded :class:`bytes`. Padding is implicitly removed, if
376-
necessary.
395+
return the decoded :class:`bytes`.
377396

378397
*ignorechars* should be a :term:`bytes-like object` containing characters
379398
to ignore from the input.
@@ -392,11 +411,12 @@ Refer to the documentation of the individual functions for more information.
392411
.. function:: z85encode(s, pad=False, *, wrapcol=0)
393412

394413
Encode the :term:`bytes-like object` *s* using Z85 (as used in ZeroMQ)
395-
and return the encoded :class:`bytes`. See `Z85 specification
396-
<https://rfc.zeromq.org/spec/32/>`_ for more information.
414+
and return the encoded :class:`bytes`.
397415

398-
If *pad* is true, the input is padded with ``b'\0'`` so its length is a
399-
multiple of 4 bytes before encoding.
416+
The input is padded with ``b'\0'`` so its length is a multiple of 4
417+
bytes before encoding. If *pad* is true, all the resulting
418+
characters are retained in the output, which will always be a
419+
multiple of 5 bytes, as required by the ZeroMQ standard.
400420

401421
If *wrapcol* is non-zero, insert a newline (``b'\n'``) character
402422
after at most every *wrapcol* characters.
@@ -414,8 +434,7 @@ Refer to the documentation of the individual functions for more information.
414434
.. function:: z85decode(s, *, ignorechars=b'', canonical=False)
415435

416436
Decode the Z85-encoded :term:`bytes-like object` or ASCII string *s* and
417-
return the decoded :class:`bytes`. See `Z85 specification
418-
<https://rfc.zeromq.org/spec/32/>`_ for more information.
437+
return the decoded :class:`bytes`.
419438

420439
*ignorechars* should be a :term:`bytes-like object` containing characters
421440
to ignore from the input.
@@ -499,3 +518,11 @@ recommended to review the security section for any code deployed to production.
499518
Section 5.2, "Base64 Content-Transfer-Encoding," provides the definition of the
500519
base64 encoding.
501520

521+
`ISO 32000-2 Portable document format - Part 2: PDF 2.0 <https://pdfa.org/resource/iso-32000-2/>`_
522+
Section 7.4.3, "ASCII85Decode Filter," provides the definition
523+
of the Ascii85 encoding used in PDF and PostScript, including
524+
the output character set and the details of data length preservation
525+
using zero-padding and partial output groups.
526+
527+
`ZeroMQ RFC 32/Z85 <https://rfc.zeromq.org/spec/32/>`_
528+
The "Formal Specification" section provides the character set used in Z85.

Doc/library/binascii.rst

Lines changed: 18 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -133,8 +133,11 @@ The :mod:`!binascii` module defines the following functions:
133133
should be accepted as shorthand for 4 consecutive spaces (ASCII 0x20).
134134
This feature is not supported by the "standard" Ascii85 encoding.
135135

136-
*adobe* controls whether the input sequence is in Adobe Ascii85 format
137-
(i.e. is framed with <~ and ~>).
136+
*adobe* controls whether the encoded byte sequence is framed with
137+
``<~`` and ``~>``, as in a PostScript base-85 string literal. If
138+
*adobe* is true, a leading ``<~`` is optionally accepted, while a
139+
trailing ``~>`` is *required*, and :exc:`binascii.Error` is raised
140+
if it is not found.
138141

139142
*ignorechars* should be a :term:`bytes-like object` containing characters
140143
to ignore from the input.
@@ -164,12 +167,16 @@ The :mod:`!binascii` module defines the following functions:
164167
after at most every *wrapcol* characters.
165168
If *wrapcol* is zero (default), do not insert any newlines.
166169

167-
If *pad* is true, the input is padded with ``b'\0'`` so its length is a
168-
multiple of 4 bytes before encoding.
169-
Note that the ``btoa`` implementation always pads.
170+
If *pad* is true, the zero-padding applied to the end of the input
171+
is fully retained in the output encoding, as done by ``btoa``,
172+
producing an exact multiple of 5 bytes of output. This is not part
173+
of the standard encoding used in PDF, as it does not preserve the
174+
length of the data.
170175

171-
*adobe* controls whether the encoded byte sequence is framed with ``<~``
172-
and ``~>``, which is used by the Adobe implementation.
176+
*adobe* controls whether the encoded byte sequence is framed with
177+
``<~`` and ``~>``, as in a PostScript base-85 string literal. Note
178+
that while ASCII85Decode streams in PDF documents *must* be
179+
terminated with ``~>``, they *must not* use a leading ``<~``.
173180

174181
.. versionadded:: 3.15
175182

@@ -213,8 +220,10 @@ The :mod:`!binascii` module defines the following functions:
213220
after at most every *wrapcol* characters.
214221
If *wrapcol* is zero (default), do not insert any newlines.
215222

216-
If *pad* is true, the input is padded with ``b'\0'`` so its length is a
217-
multiple of 4 bytes before encoding.
223+
If *pad* is true, the zero-padding applied to the end of the input
224+
is retained in the output, which will always be a multiple of 5
225+
bytes, and thus the length of the data may not be preserved on
226+
decoding.
218227

219228
.. versionadded:: 3.15
220229

Lib/base64.py

Lines changed: 24 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -315,16 +315,20 @@ def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False):
315315
316316
foldspaces is an optional flag that uses the special short sequence 'y'
317317
instead of 4 consecutive spaces (ASCII 0x20) as supported by 'btoa'. This
318-
feature is not supported by the "standard" Adobe encoding.
318+
feature is not supported by the standard encoding used in PDF.
319319
320320
If wrapcol is non-zero, insert a newline (b'\\n') character after at most
321321
every wrapcol characters.
322322
323-
pad controls whether the input is padded to a multiple of 4 before
324-
encoding. Note that the btoa implementation always pads.
323+
pad controls whether zero-padding applied to the end of the input
324+
is fully retained in the output encoding, as done by btoa,
325+
producing an exact multiple of 5 bytes of output.
326+
327+
adobe controls whether the encoded byte sequence is framed with <~
328+
and ~>, as in a PostScript base-85 string literal. Note that
329+
while ASCII85Decode streams in PDF documents must be terminated
330+
with ~>, they must not use a leading <~.
325331
326-
adobe controls whether the encoded byte sequence is framed with <~ and ~>,
327-
which is used by the Adobe implementation.
328332
"""
329333
return binascii.b2a_ascii85(b, foldspaces=foldspaces,
330334
adobe=adobe, wrapcol=wrapcol, pad=pad)
@@ -333,12 +337,14 @@ def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v',
333337
canonical=False):
334338
"""Decode the Ascii85 encoded bytes-like object or ASCII string b.
335339
336-
foldspaces is a flag that specifies whether the 'y' short sequence should be
337-
accepted as shorthand for 4 consecutive spaces (ASCII 0x20). This feature is
338-
not supported by the "standard" Adobe encoding.
340+
foldspaces is a flag that specifies whether the 'y' short sequence
341+
should be accepted as shorthand for 4 consecutive spaces (ASCII
342+
0x20). This feature is not supported by the standard Ascii85
343+
encoding used in PDF and PostScript.
339344
340-
adobe controls whether the input sequence is in Adobe Ascii85 format (i.e.
341-
is framed with <~ and ~>).
345+
adobe controls whether the <~ and ~> markers are present. While
346+
the leading <~ is not required, the input must end with ~>, or a
347+
ValueError is raised.
342348
343349
ignorechars should be a byte string containing characters to ignore from the
344350
input. This should only contain whitespace characters, and by default
@@ -358,8 +364,10 @@ def b85encode(b, pad=False, *, wrapcol=0):
358364
If wrapcol is non-zero, insert a newline (b'\\n') character after at most
359365
every wrapcol characters.
360366
361-
If pad is true, the input is padded with b'\\0' so its length is a multiple of
362-
4 bytes before encoding.
367+
The input is padded with b'\0' so its length is a multiple of 4
368+
bytes before encoding. If pad is true, all the resulting
369+
characters are retained in the output, which will always be a
370+
multiple of 5 bytes.
363371
"""
364372
return binascii.b2a_base85(b, wrapcol=wrapcol, pad=pad)
365373

@@ -379,8 +387,10 @@ def z85encode(s, pad=False, *, wrapcol=0):
379387
If wrapcol is non-zero, insert a newline (b'\\n') character after at most
380388
every wrapcol characters.
381389
382-
If pad is true, the input is padded with b'\\0' so its length is a multiple of
383-
4 bytes before encoding.
390+
The input is padded with b'\0' so its length is a multiple of
391+
bytes before encoding. If pad is true, all the resulting
392+
characters are retained in the output, which will always be a
393+
multiple of 5 bytes, as required by the ZeroMQ standard.
384394
"""
385395
return binascii.b2a_base85(s, wrapcol=wrapcol, pad=pad,
386396
alphabet=binascii.Z85_ALPHABET)

Modules/binascii.c

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1057,7 +1057,8 @@ binascii.a2b_ascii85
10571057
foldspaces: bool = False
10581058
Allow 'y' as a short form encoding four spaces.
10591059
adobe: bool = False
1060-
Expect data to be wrapped in '<~' and '~>' as in Adobe Ascii85.
1060+
Expect data to be terminated with '~>' as in Adobe Ascii85, and
1061+
optionally accept leading '<~'.
10611062
ignorechars: Py_buffer = b''
10621063
A byte string containing characters to ignore from the input.
10631064
canonical: bool = False
@@ -1069,7 +1070,7 @@ Decode Ascii85 data.
10691070
static PyObject *
10701071
binascii_a2b_ascii85_impl(PyObject *module, Py_buffer *data, int foldspaces,
10711072
int adobe, Py_buffer *ignorechars, int canonical)
1072-
/*[clinic end generated code: output=09b35f1eac531357 input=dd050604ed30199e]*/
1073+
/*[clinic end generated code: output=09b35f1eac531357 input=08eab2e53c62f1a8]*/
10731074
{
10741075
const unsigned char *ascii_data = data->buf;
10751076
Py_ssize_t ascii_len = data->len;
@@ -1264,7 +1265,7 @@ binascii.b2a_ascii85
12641265
wrapcol: size_t = 0
12651266
Split result into lines of provided width.
12661267
pad: bool = False
1267-
Pad input to a multiple of 4 before encoding.
1268+
Retain zero-padding bytes at end of output.
12681269
adobe: bool = False
12691270
Wrap result in '<~' and '~>' as in Adobe Ascii85.
12701271
@@ -1274,7 +1275,7 @@ Ascii85-encode data.
12741275
static PyObject *
12751276
binascii_b2a_ascii85_impl(PyObject *module, Py_buffer *data, int foldspaces,
12761277
size_t wrapcol, int pad, int adobe)
1277-
/*[clinic end generated code: output=5ce8fdee843073f4 input=791da754508c7d17]*/
1278+
/*[clinic end generated code: output=5ce8fdee843073f4 input=a77e31d63517bf19]*/
12781279
{
12791280
const unsigned char *bin_data = data->buf;
12801281
Py_ssize_t bin_len = data->len;
@@ -1539,7 +1540,7 @@ binascii.b2a_base85
15391540
/
15401541
*
15411542
pad: bool = False
1542-
Pad input to a multiple of 4 before encoding.
1543+
Retain zero-padding bytes at end of output.
15431544
wrapcol: size_t = 0
15441545
alphabet: Py_buffer(c_default="{NULL, NULL}") = BASE85_ALPHABET
15451546
@@ -1549,7 +1550,7 @@ Base85-code line of data.
15491550
static PyObject *
15501551
binascii_b2a_base85_impl(PyObject *module, Py_buffer *data, int pad,
15511552
size_t wrapcol, Py_buffer *alphabet)
1552-
/*[clinic end generated code: output=98b962ed52c776a4 input=1b20b0bd6572691b]*/
1553+
/*[clinic end generated code: output=98b962ed52c776a4 input=54886d05128d41a8]*/
15531554
{
15541555
const unsigned char *bin_data = data->buf;
15551556
Py_ssize_t bin_len = data->len;

Modules/clinic/binascii.c.h

Lines changed: 5 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)