bpo-6686: Fix Lib.xml.sax.expatreader.GetProperty to return a string object#9715
Conversation
(xml.sax.handler.property_xml_string) returns bytes
|
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately our records indicate you have not signed the CLA. For legal reasons we need you to sign this before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. You can check yourself to see if the CLA has been received. Thanks again for your contribution, we look forward to reviewing it! |
There was a problem hiding this comment.
This is a great start!
In addition to my inline comments, IMO checking that the type of the .getProperty() output is not enough. We should check that the output is precisely what we expect given the input. One way to implement this is to have the handler keep all of the outputs, e.g. in a list, and then inspect the output after the parsing is done.
Finally, while I think the name SaxPropertyTest is good, it currently only tests one type of property. I feel that we should include a comment mentioning that it currently doesn't test other property types.
| self.test_harness = test_harness | ||
| self.reader = reader | ||
|
|
||
| def startElement(self, name, attr): # @UnusedVariable |
There was a problem hiding this comment.
Please remove the # @UnusedVariable comment, we don't use these in this codebase.
| if self._parser: | ||
| if hasattr(self._parser, "GetInputContext"): | ||
| return self._parser.GetInputContext() | ||
| return self._parser.GetInputContext().decode() |
There was a problem hiding this comment.
.decode() uses the UTF-8 encoding by default. Shouldn't this be using the encoding used by the parser?
There was a problem hiding this comment.
I missed this one. Will have a new commit in a little while.
There was a problem hiding this comment.
This turns out to be much more complicated than appears on the surface. Here is a description of what happens:
- The encoding is stored in a xmlreader.InputSource instance.
- This instance is not saved in any Python class instance,
- It is passed unmodified right through to the Expat parser.
- It is permitted for the encoding to not be specified by a calling application.
- If the encoding is not specified, Expat uses the encoding specified in the XML prolog statement.
- Standard Expat only supports US-ASCII, UTF-8, UTF-16 and ISO 8859 (Latin 1)
- If the XML prolog does not specify an encoding it uses UTF-8.
- The final encoding used is not passed back to Python and Expat converts the input to UTF-8 no matter what the external coding was, losing the information about the original encoding.
- This is according to the documentation for Expat and the Python ans CPython code.
This is what will happen in various situations:
- If no encoding is specified anywhere and the document is in ISO-8859 or UTF-16, Expat will usually encounter indigestion early and will tell us. It is possible that short pathological documents might slip through.
- If no encoding is specified anywhere and the document is in US-ASCII it will be accepted correctly by Expat. If the document is in UTF-8 it will obviously be accepted.
- If the encoding is contained in the prolog of the XML document the situation is the same since Expat will reject any unsupported encodings and if the document is mis-coded it will fail in Expat as described above.
- I have not yet found where the property string which is actually the raw source line gets converted back to bytes within Python or Expat or whether Python is able to get a copy of the line before it is converted to UTF-8 by Expat for its' internal processing. Do you think I should continue researching this?.
My recommendation is to leave our code as it is as we have no way to determine what the original encoding was. Expat will handle most of the error situations itself and this feature seems to be little used by Python users based on the lack of any followup in the last 9 years to this issue.
What do you think?
There was a problem hiding this comment.
If we're going to fix this, we should fix it properly.
If expat converts the data internally to UTF-8, then the existing code here should just work, and all that will be needed is to add a comment explaining this.
If expat only supports the four encodings mentioned, then IMO we don't need to address what happens with other encodings.
I suggest adding a tests with non-ASCII characters in the XML, checking that the "XML string" property is exactly as expected, i.e. expat properly decoded it from its original encoding and then encoded into UTF-8. Do this for all encodings supported by expat (though for ASCII include only ASCII characters). This will ensure that everything works as expected.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
|
I have made the requested changes; please review again. |
|
Thanks for making the requested changes! @taleinat: please review the changes made to this pull request. |
| def test_xml_str_str(self): | ||
| reader = create_parser() | ||
| reader.setContentHandler(PropertyContentHandler(self, reader)) | ||
| input_\ |
There was a problem hiding this comment.
Please format as so (here and below):
input_ = \
'<...>'
There was a problem hiding this comment.
why on two lines? PEP8 the column could be between 80 and 100 (https://www.python.org/dev/peps/pep-0008/#id19)
There was a problem hiding this comment.
This is the style used in the rest of this codebase (I just did a search to make sure).
ISTM that PEP8 refers to non-assignment operators in the section that you quote. Perhaps it could use some clarification.
| input_\ | ||
| = '<?xml version="1.0" encoding="UTF-8"?>\n<test>Hi there</test>' | ||
| reader.parse(StringIO(input_)) | ||
| self.assertEqual(input_, self.result.getvalue()) |
There was a problem hiding this comment.
Does this test the result of the .getProperty(property_xml_string) calls? That's what I'd like to see tested for more than its return type.
There was a problem hiding this comment.
By using XMLGenerator in the HandleContent hierarchy, the oiginal source is reconstructed. This gives an indirect verification.
There was a problem hiding this comment.
Related to line breaks - I am following the recommendation in Pep8. It is recommended as the new approach.
There was a problem hiding this comment.
I am waiting for your reaction to my comments before I push the latest changes.
There was a problem hiding this comment.
By using XMLGenerator in the HandleContent hierarchy, the oiginal source is reconstructed. This gives an indirect verification.
Yes, but the original source is not reconstructed via the .getProperty() method that we are testing (unless I'm missing something).
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
| self.assertEqual(attrs.getQNameByName((ns_uri, "attr")), "ns:attr") | ||
|
|
||
|
|
||
| # =========================================================================== |
There was a problem hiding this comment.
Hi,
Could you remove these comments, they don't need to be in the source code.
Thank you
There was a problem hiding this comment.
If you look at the rest of this test module, you'll see that it is in keeping with its existing style. IMO this is fine.
| def test_xml_str_str(self): | ||
| reader = create_parser() | ||
| reader.setContentHandler(PropertyContentHandler(self, reader)) | ||
| input_\ |
There was a problem hiding this comment.
why on two lines? PEP8 the column could be between 80 and 100 (https://www.python.org/dev/peps/pep-0008/#id19)
| reader = create_parser() | ||
| reader.setContentHandler(PropertyContentHandler(self, reader)) | ||
| input_\ | ||
| = b'<?xml version="1.0" encoding="UTF-8"?>\n<test>Hi there</test>' |
There was a problem hiding this comment.
My reading of Pep8 was that external teams could decide to allow 100 colums but that for Python development itself ony 80 columns should be used. The comments you are talking about were there before I created modifications to test_sax.py and I was asked by the reviewer to only touch my additions.
I did sign the CLA yesterday but it needs to be reviewed by the PSF first before they accept it.
There was a problem hiding this comment.
In this codebase, code lines should be at most 79 characters long.
|
And there is an other point, you didn't sign the CLA. it's a requirement for the review. Thank you |
| # | ||
| # Sax parser property tests | ||
| # | ||
| # Currently only tests the condition reported in issue bpo-6686 |
There was a problem hiding this comment.
This note (the second line) shouldn't be in this comment block though, it should be in the class itself.
There was a problem hiding this comment.
Can you point me at something that discusses this idiom? I don't understand the rationale.
|
I have finally got to the bottom of this issue. The situation is as follows:
Given all this this is my assessment of what should be done:
What do you think? |
|
@jfgossage
This is great work! Please make sure to add such tests to this PR at some point.
I think that XML is inherently a string-based format rather than a bytes-based format. Therefore, returning the "XML string" property as bytes seems wrong to me, especially considering that users have no good way of retrieving the appropriate encoding with which to decode it.
Can we rectify this situation? It seems like adding such support should be relatively simple. |
|
Sorry to take so long but I needed to do some more research. Let me summarize the streaming XML support situation in Python. There are two streaming XML parsers provided by Python, the SAX parser and the Expat parser If you are going to write new software in Python using only standard Python support you are almost certainly going to use the Expat parser. If you have a free choice you will probably use Lxml which is even better and more performant. The SAX parser exists basically to support legacy applications and the documentation for the InputContext property clearly states that you get the data in the same encoding as you originally supplied it. Also, even with the Expat parser, it would be up to the user to implement the code needed to determine the encoding that was actually used. The parser defines an interface for the needed handler, but it is up to the user to provide an implementation for the handler. So if we are to implement your suggestion we would need to do the following:
My personal feeling is that this is far too much work to support what is ultimately a flawed construct. I think that if we are to proceed along this path we need a wider consensus than we two can provide and that the content of this PR should be examined by one or more senior members of the Python community. The basic position is that we are now looking at an enhancement, not simply a bug fix. |
|
@jfgossage
No need to apologize, this is great work.
I agree. How about a different path, then? What do you think about changing This too would require review from a more senior core dev, but that's fine IMO to get a good solution in place. |
|
I think we have to accept that the code is working as the developer if not the documenter envisaged it. This makes your proposal an enhancement and I have one big problem with it. Changing the return type from bytes to a string is a breaking change. Anyone who is actually using this is accustomed to receiving a bytes object and their code will probably break when they try to use the new string object. What I propose is the following:
Doing things this way ensures that no breaking changes are introduced, that changes are held to a minimum and gives the community the opportunity for normal discussion of the proposal. |
|
I was also bothered by the clunkiness of the solution I proposed. The biggest problem I had was that handlers in Expat are designed to allow users to receive data from Expat, not for internal use within Expat. Expat already supports a handler that makes this information available to the user, the Using the The SAX parser does provide a mechanism, I don't think that the Python Expat support should decode the In summary, the If you accept this solution than I feel we should treat the original issue as a documentation issue and bring the documentation in harmony with the implementation. We can then create a new issue to cover the exposure of the |
|
One more reason why |
|
@jfgossage, after taking some time to think this over and get a fresh perspective, I completely agree with your latest comments. Let's close this PR and fix the documentation. I'm not even sure an enhancement proposal is necessary: This hasn't been requested so far, so I'm not sure it is worth the effort to implement for purely theoretical use cases. |
|
I'm going to close this as it is from an unknown repository. Based on the last comment from Tal, it probably needs to be changed to a documentation change so the current PR should not be replaced as it currently is. |
https://bugs.python.org/issue6686