gh-81074: Allow non-ASCII addr_spec in email.headerregistry.Address#122477
gh-81074: Allow non-ASCII addr_spec in email.headerregistry.Address#122477medmunds wants to merge 9 commits intopython:mainfrom
Conversation
The email.headerregistry.Address constructor raised an error if addr_spec contained a non-ASCII character. (But it fully supports non-ASCII in the separate username and domain args.) This change removes the error for a non-ASCII addr_spec.
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
The email.headerregistry.Address constructor raised an error if addr_spec contained a non-ASCII character. (But it fully supports non-ASCII in the separate username and domain args.) This change removes the error for a non-ASCII addr_spec.
|
@bitdancer, as the email expert, do you want to review this? |
|
Yes, I'll add it to my list. It will take me quite some time to get through the backlog, unfortunately. |
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
When parsing email messages from Unicode strings (but not bytes), get_local_part() recorded a NonASCIILocalPartDefect for non-ASCII characters. RFC 5322 permits such addresses. This change: - removes the parse-time detection for a non-ASCII local-part (and a related test) - adds tests for passing a non-ASCII addr_spec to email.headerregistry.Address.__init__() - marks the (undocumented) email.errors.NonASCIILocalPartDefect as unused and deprecated This affected parsing email messages from Unicode strings (but not from bytes), and also prevented
|
I have made the requested changes; please review again |
|
Thanks for making the requested changes! @bitdancer: please review the changes made to this pull request. |
|
This PR is stale because it has been open for 30 days with no activity. |
|
@github-actions that doesn't seem like a good way to encourage contributions from volunteers. |
|
It did remind me that I hadn't reviewed the PR, which I guess is the point? I've been ignoring a lot of these pings on issues I'm not ready to deal with yet, though, so you also have a point ;) Now, unfortunately (or fortunately, depending on how you look at it) in the time since you made the change I have discovered, while doing work on the parser, that there actually is a place where the non-ascii local part defect is generated and is appropriate during non-utf8 processing: it happens when the local part is an encoded word that decodes to something containing non-ascii. That's not RFC compliant, of course, but that's why it's a defect ;) I'm not quite sure where to go with this. On the one hand that's clearly a defect, on the other hand if we took a message like that and refolded it for sending via SMTPUTF8, it would be valid, just as it would be valid if it came in through an Address. So I guess I'm leaning toward still getting rid of the defect, and making sure there is a defect for an encoded word in the local part. The latter is, unfortunately, not currently the case. I'm in the midst of re-writing the parser, so there are two ways we could go here: go ahead and fix the code now to add the encoded word defect, or just ignore the issue until I've rewritten the parser where I will add that fix. The problem with the latter is that currently the non-ascii defect causes the mailbox to get marked as invalid. The problem with the former is that with the current code it would require scanning the local-part after it is created to see if to contains any encoded words. That's not a huge overhead, so maybe it is worth it. On the third hand, currently an address like A third approach would be to keep the current code but replace the non-ascii defect with an encoded word defect, which would preserve the current invalid-mailbox behavior (both right and wrong) except that the defect type would change. That might be the safest option. Thoughts? Opinions? |
|
Ah. Another problem. My comment above said "now that we are doing the serialization check", but as far as I can tell we are not doing that. A non-ascii local part results in our writing an RFC invalid email with the local part rfc2047 encoded :( :( Sigh. I'm not sure what to do with that, because if we fix it, we'll probably break peoples code. Allowing Address to create such addresses will make this worse, so should we add something to generator that will deprecate this behavior? If so we should probably open a new issue and not commit this PR until after the deprecation is in place. |
I think PR #122540 was meant to address that?
I'd argue code relying on that is already broken: it's generating undeliverable email. Raising an error instead seems like an improvement. (I arrived in these issues via Django, which had interpreted Python's support for rfc2047-encoded local-parts as spec compliance. We took some steps to correct that in recent releases.) I'll need a bit to think on your other questions, but would the more detailed errors floated in #122540 (comment) help any? |
|
Ah, OK, that makes more sense: I have just lost the context :( Sorry about the delay in dealing with these issues, I'll see if I can drive them through to completion. I'll have to re-load my brain with the full context first. |
|
OK, so #122540 still looks good and I'll get it merged before the beta ;). For this one, I think we should convert the non-ascii defect into a regular InvalidHeaderDefect talking about the invalid encoded word. That will be a little more backward and forward compatible; I'll clean it up more fully in my parser rewrite. |
|
And having written that I'm having second thoughts. The whole problem is the existence of a defect, and just changing the type would not change the existence of the defect. We'd have to go with the "look for an encoded word" pattern, which would only catch some cases. So I take it back, let's just go with what you have here. |
The email.headerregistry.Address constructor raised an error if addr_spec contained a non-ASCII character. (But it fully supports non-ASCII in the separate username and domain args.) This change removes the error for a non-ASCII addr_spec.
This PR implements @bitdancer's suggested fix from #81074 (comment):
(The other bugs discussed in that comment are reported separately as #83938 and #122476.)
Fixes gh-81074