gh-149468: Add option to validate ElementTree during serialization#149469
gh-149468: Add option to validate ElementTree during serialization#149469serhiy-storchaka wants to merge 3 commits into
Conversation
|
This PR also fixes some bugs in serialization to HTML. I am going to extract them into separate issue before merging this PR. But for now, it is more convenient to review them together. |
| @@ -0,0 +1,3 @@ | |||
| Add the *validate* option to :mod:`xml.etree.ElementTree` serialization | |||
| functions, which allows to validate the element or element tree before | |||
There was a problem hiding this comment.
Suggestion on wording: "...which validates the element or element tree names and values only contain allowed/escaped characters prior to serialization"
* The content of comments, processing instructions and elements "xmp", "iframe", "noembed", "noframes", and "plaintext" is no longer escaped. * The "plaintext" element no longer have the closing tag. * Add support of empty attributes (with value None).
73e1b24 to
a134c0b
Compare
|
Rebased it onto #149490. |
|
|
||
|
|
||
| xml.etree.ElementTree | ||
| --------------------- |
There was a problem hiding this comment.
You should now move this entry to Doc/whatsnew/3.16.rst.
ca19970 to
c81fe70
Compare
| generate a Unicode string (otherwise, a bytestring is generated). *method* | ||
| is either ``"xml"``, ``"html"`` or ``"text"`` (default is ``"xml"``). | ||
| *xml_declaration*, *default_namespace* and *short_empty_elements* has the same | ||
| *xml_declaration*, *default_namespace*, *validate* and *short_empty_elements* has the same |
There was a problem hiding this comment.
| *xml_declaration*, *default_namespace*, *validate* and *short_empty_elements* has the same | |
| *xml_declaration*, *default_namespace*, *validate* and *short_empty_elements* have the same |
| generate a Unicode string (otherwise, a bytestring is generated). *method* | ||
| is either ``"xml"``, ``"html"`` or ``"text"`` (default is ``"xml"``). | ||
| *xml_declaration*, *default_namespace* and *short_empty_elements* has the same | ||
| *xml_declaration*, *default_namespace*, *validate* and *short_empty_elements* has the same |
There was a problem hiding this comment.
| *xml_declaration*, *default_namespace*, *validate* and *short_empty_elements* has the same | |
| *xml_declaration*, *default_namespace*, *validate* and *short_empty_elements* have the same |
| .. versionchanged:: next | ||
| Added the *validate* parameter. | ||
|
|
||
| .. versionchanged:: 3.8 |
There was a problem hiding this comment.
| .. versionchanged:: next | |
| Added the *validate* parameter. | |
| .. versionchanged:: 3.8 | |
| .. versionchanged:: 3.8 |
This is duplicated below.
| If *validate* is true, check that all characters are legal XML or HTML | ||
| characters, depending on *method*, element and attribute names are | ||
| valid, and the content of comments, processing instructions and | ||
| HTML elements like ``<script>`` do not contain illegal sequences, | ||
| and raise :exc:`ValueError` otherwise. | ||
| By default, no validation is performed. |
There was a problem hiding this comment.
This sentence is a bit of a mouthful, but it's a nice compromise between conciseness and clarity.
It could be improved with a couple of minor tweaks:
| If *validate* is true, check that all characters are legal XML or HTML | |
| characters, depending on *method*, element and attribute names are | |
| valid, and the content of comments, processing instructions and | |
| HTML elements like ``<script>`` do not contain illegal sequences, | |
| and raise :exc:`ValueError` otherwise. | |
| By default, no validation is performed. | |
| If *validate* is true, check that all characters are legal XML or HTML | |
| characters (depending on *method*), that element and attribute names are | |
| valid, and that the content of comments, processing instructions and | |
| HTML elements like ``<script>`` do not contain illegal sequences. | |
| Raise :exc:`ValueError` if any check fails. | |
| By default, or if *method* is ``"text"``, no validation is performed. |
or (if the method applies to all checks):
| If *validate* is true, check that all characters are legal XML or HTML | |
| characters, depending on *method*, element and attribute names are | |
| valid, and the content of comments, processing instructions and | |
| HTML elements like ``<script>`` do not contain illegal sequences, | |
| and raise :exc:`ValueError` otherwise. | |
| By default, no validation is performed. | |
| If *validate* is true, check that all characters are legal, | |
| that element and attribute names are valid, and that the content | |
| of comments, processing instructions and HTML elements | |
| like ``<script>`` do not contain illegal sequences according | |
| to the selected *method* (``"xml"`` or ``"html"``). | |
| Raise :exc:`ValueError` if any check fails. | |
| By default, or if *method* is ``"text"``, no validation is performed. |
Another option is to list the checks:
| If *validate* is true, check that all characters are legal XML or HTML | |
| characters, depending on *method*, element and attribute names are | |
| valid, and the content of comments, processing instructions and | |
| HTML elements like ``<script>`` do not contain illegal sequences, | |
| and raise :exc:`ValueError` otherwise. | |
| By default, no validation is performed. | |
| If *validate* is true and *method* is either ``"xml"`` or ``"html"``, | |
| check that: | |
| * all characters are legal XML or HTML characters | |
| * element and attribute names are valid | |
| * the content of comments, processing instructions and HTML elements | |
| like ``<script>`` do not contain illegal sequences | |
| and raise :exc:`ValueError` otherwise. | |
| By default, or if *method* is ``"text"``, no validation is performed. |
| self.check(ET.Element(ET.QName('\ufffe', 'tag'))) | ||
|
|
||
| class HTMLValidationTest(unittest.TestCase): |
There was a problem hiding this comment.
| self.check(ET.Element(ET.QName('\ufffe', 'tag'))) | |
| class HTMLValidationTest(unittest.TestCase): | |
| self.check(ET.Element(ET.QName('\ufffe', 'tag'))) | |
| class HTMLValidationTest(unittest.TestCase): |
| def test_invalid_attr_value(self): | ||
| self.check(ET.Element('tag', attrib={'key': '\x00'})) | ||
| self.check(ET.Element('tag', attrib={'key': '\ud8ff'})) | ||
| self.check(ET.Element('tag', attrib={'key': '\ufffe'})) | ||
| self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')})) | ||
| self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')})) | ||
| self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')})) |
There was a problem hiding this comment.
This and several other methods could use subTests if you think it's an improvement, e.g.:
| def test_invalid_attr_value(self): | |
| self.check(ET.Element('tag', attrib={'key': '\x00'})) | |
| self.check(ET.Element('tag', attrib={'key': '\ud8ff'})) | |
| self.check(ET.Element('tag', attrib={'key': '\ufffe'})) | |
| self.check(ET.Element('tag', attrib={'key': ET.QName('\x00')})) | |
| self.check(ET.Element('tag', attrib={'key': ET.QName('\ud8ff')})) | |
| self.check(ET.Element('tag', attrib={'key': ET.QName('\ufffe')})) | |
| @support.subTests('value', ('\x00', '\ud8ff', '\ufffe')) | |
| def test_invalid_attr_value(self, value): | |
| self.check(ET.Element('tag', attrib={'key': value})) | |
| self.check(ET.Element('tag', attrib={'key': ET.QName(value)})) |
| self.check(elem) | ||
|
|
||
| class IterparseTest(unittest.TestCase): |
There was a problem hiding this comment.
| self.check(elem) | |
| class IterparseTest(unittest.TestCase): | |
| self.check(elem) | |
| class IterparseTest(unittest.TestCase): |
| write("<!--%s-->" % text) | ||
| elif tag is ProcessingInstruction: | ||
| if validate: | ||
| m = re.search('[ \t\r\n]', text) |
| if text: | ||
| if validate: | ||
| if '\0' in text: | ||
| raise ValueError('invalid characters') |
There was a problem hiding this comment.
| raise ValueError('invalid characters') | |
| raise ValueError('invalid character ("\\0")') |
Similarly the other error messages could specify what is invalid.
Uh oh!
There was an error while loading. Please reload this page.