Skip to content

gh-81074: Allow non-ASCII addr_spec in email.headerregistry.Address#122477

Open
medmunds wants to merge 9 commits intopython:mainfrom
medmunds:fix-issue-81074
Open

gh-81074: Allow non-ASCII addr_spec in email.headerregistry.Address#122477
medmunds wants to merge 9 commits intopython:mainfrom
medmunds:fix-issue-81074

Conversation

@medmunds
Copy link
Copy Markdown
Contributor

@medmunds medmunds commented Jul 30, 2024

The email.headerregistry.Address constructor raised an error if addr_spec contained a non-ASCII character. (But it fully supports non-ASCII in the separate username and domain args.) This change removes the error for a non-ASCII addr_spec.

This PR implements @bitdancer's suggested fix from #81074 (comment):

… But with RFC6532 support, it should be valid to have a local part that has non-ascii in an Address, and the error, as I noted above, should be raised only at serialization time and when we don't have an original source string. So that raise should be modified to explicitly ignore the NonASCIILocalPartDefect.

(The other bugs discussed in that comment are reported separately as #83938 and #122476.)

Fixes gh-81074

The email.headerregistry.Address constructor raised an error if
addr_spec contained a non-ASCII character. (But it fully supports
non-ASCII in the separate username and domain args.) This change
removes the error for a non-ASCII addr_spec.
@medmunds medmunds requested a review from a team as a code owner July 30, 2024 19:16
@ghost
Copy link
Copy Markdown

ghost commented Jul 30, 2024

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Jul 30, 2024

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

medmunds and others added 4 commits September 9, 2024 14:06
The email.headerregistry.Address constructor raised an error if
addr_spec contained a non-ASCII character. (But it fully supports
non-ASCII in the separate username and domain args.) This change
removes the error for a non-ASCII addr_spec.
@encukou
Copy link
Copy Markdown
Member

encukou commented Mar 19, 2025

@bitdancer, as the email expert, do you want to review this?

@bitdancer
Copy link
Copy Markdown
Member

Yes, I'll add it to my list. It will take me quite some time to get through the backlog, unfortunately.

Comment thread Lib/email/headerregistry.py Outdated
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 18, 2025

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

medmunds added 3 commits May 26, 2025 14:48
When parsing email messages from Unicode strings (but not bytes),
get_local_part() recorded a NonASCIILocalPartDefect for non-ASCII
characters. RFC 5322 permits such addresses.

This change:
- removes the parse-time detection for a non-ASCII local-part
  (and a related test)
- adds tests for passing a non-ASCII addr_spec to
  email.headerregistry.Address.__init__()
- marks the (undocumented) email.errors.NonASCIILocalPartDefect
  as unused and deprecated

 This affected parsing email messages from Unicode strings
(but not from bytes), and also prevented
@medmunds
Copy link
Copy Markdown
Contributor Author

I have made the requested changes; please review again

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 26, 2025

Thanks for making the requested changes!

@bitdancer: please review the changes made to this pull request.

@bedevere-app bedevere-app Bot requested a review from bitdancer May 26, 2025 23:19
@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions Bot added the stale Stale PR or inactive for long period of time. label Apr 17, 2026
@medmunds
Copy link
Copy Markdown
Contributor Author

@github-actions that doesn't seem like a good way to encourage contributions from volunteers.

@bitdancer
Copy link
Copy Markdown
Member

It did remind me that I hadn't reviewed the PR, which I guess is the point? I've been ignoring a lot of these pings on issues I'm not ready to deal with yet, though, so you also have a point ;)

Now, unfortunately (or fortunately, depending on how you look at it) in the time since you made the change I have discovered, while doing work on the parser, that there actually is a place where the non-ascii local part defect is generated and is appropriate during non-utf8 processing: it happens when the local part is an encoded word that decodes to something containing non-ascii. That's not RFC compliant, of course, but that's why it's a defect ;)

I'm not quite sure where to go with this. On the one hand that's clearly a defect, on the other hand if we took a message like that and refolded it for sending via SMTPUTF8, it would be valid, just as it would be valid if it came in through an Address. So I guess I'm leaning toward still getting rid of the defect, and making sure there is a defect for an encoded word in the local part.

The latter is, unfortunately, not currently the case. I'm in the midst of re-writing the parser, so there are two ways we could go here: go ahead and fix the code now to add the encoded word defect, or just ignore the issue until I've rewritten the parser where I will add that fix. The problem with the latter is that currently the non-ascii defect causes the mailbox to get marked as invalid. The problem with the former is that with the current code it would require scanning the local-part after it is created to see if to contains any encoded words. That's not a huge overhead, so maybe it is worth it. On the third hand, currently an address like =?utf-8?q?foo?= @abc.com will not show as having any defects and will therefore be marked as a valid mailbox. Which is a bug ;) On the fourth hand, nothing is currently done by the rest of the code with mailboxes marked invalid, so it will only matter to people actually digging in to the parse tree, and there are probably very few of those. (There are other problems with the whole 'invalid mailbox' idea, but that's a whole separate set of issues I'm not ready to deal with.)

A third approach would be to keep the current code but replace the non-ascii defect with an encoded word defect, which would preserve the current invalid-mailbox behavior (both right and wrong) except that the defect type would change. That might be the safest option.

Thoughts? Opinions?

@bitdancer
Copy link
Copy Markdown
Member

Ah. Another problem. My comment above said "now that we are doing the serialization check", but as far as I can tell we are not doing that. A non-ascii local part results in our writing an RFC invalid email with the local part rfc2047 encoded :( :(

Sigh. I'm not sure what to do with that, because if we fix it, we'll probably break peoples code. Allowing Address to create such addresses will make this worse, so should we add something to generator that will deprecate this behavior? If so we should probably open a new issue and not commit this PR until after the deprecation is in place.

@medmunds
Copy link
Copy Markdown
Contributor Author

Ah. Another problem. My comment above said "now that we are doing the serialization check", but as far as I can tell we are not doing that. A non-ascii local part results in our writing an RFC invalid email with the local part rfc2047 encoded :( :(

I think PR #122540 was meant to address that?

Sigh. I'm not sure what to do with that, because if we fix it, we'll probably break peoples code. [...]

I'd argue code relying on that is already broken: it's generating undeliverable email. Raising an error instead seems like an improvement. (I arrived in these issues via Django, which had interpreted Python's support for rfc2047-encoded local-parts as spec compliance. We took some steps to correct that in recent releases.)

I'll need a bit to think on your other questions, but would the more detailed errors floated in #122540 (comment) help any?

@bitdancer
Copy link
Copy Markdown
Member

Ah, OK, that makes more sense: I have just lost the context :( Sorry about the delay in dealing with these issues, I'll see if I can drive them through to completion. I'll have to re-load my brain with the full context first.

@bitdancer
Copy link
Copy Markdown
Member

OK, so #122540 still looks good and I'll get it merged before the beta ;). For this one, I think we should convert the non-ascii defect into a regular InvalidHeaderDefect talking about the invalid encoded word. That will be a little more backward and forward compatible; I'll clean it up more fully in my parser rewrite.

@bitdancer
Copy link
Copy Markdown
Member

And having written that I'm having second thoughts. The whole problem is the existence of a defect, and just changing the type would not change the existence of the defect. We'd have to go with the "look for an encoded word" pattern, which would only catch some cases. So I take it back, let's just go with what you have here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting merge stale Stale PR or inactive for long period of time. topic-email

Projects

None yet

Development

Successfully merging this pull request may close these issues.

email.headerregistry.Address blocks Unicode local part addr_spec accepted elsewhere

5 participants