gh-104400: pygettext: use an AST parser instead of a tokenizer #104402

tomasr8 · 2023-05-11T21:17:57Z

This PR replaces the token-based message extraction with one that uses the AST parser instead.
See the issue or the forum discussion for more info.

This change fixes some issues just by virtue of using AST instead of working directly with tokens:

docstrings with leading blank lines are extracted correctly
dosctrings like """Hello, {}!""".format('world') are no longer extracted
docstrings are cleaned with inspect.cleandoc() via ast.get_docstring()
This is now correctly extracted:

def test(x=_('param')):
    pass

I added a CLI argument --charset (same as in pybabel and --from-code in xgettext) to force a file encoding, e.g. --charset=utf-8 will open the source files with utf-8 encoding. This is useful because currently we are relying on the system default which is error-prone. For example on Windows, open() in my locale defaults to cp1250 which mangles up utf-8 files and vice versa (with some UnicodeDecoreErrors in between).

This PR has lots more tests to make sure we don't regress on anything. The tests now compare the script output to a .po file rather than just comparing the msgids (basically snapshot tests). This ensures that we also catch issues with formatting, line locations or anything else.

I moved the test files into a separate folder, so the diff of test_i18n.py is not that useful.. Check out just the last commit to see a better diff.

@warsaw if you feel like having a look (or anyone else ;))

Issue: pygettext: use an AST parser instead of a tokenizer #104400

Tools/i18n/pygettext.py

tomasr8 · 2023-08-03T08:17:11Z

@ambv This is what I talked to you about at EuroPython. If you have time I'd be very happy if you could have a look :)

The TL;DR is pygettext has a couple of bugs which stem from it using a tokenizer-based extraction (and overall the code needs modernizing). I fix those bugs in this PR by switching to a parser. Otherwise I try to keep the functionality as close as possible. I also added lots more tests which compare the entire output and not just the messages as it was previously.

There are also lots of features missing in pygettext - handling ngettext, pgettext and others, format flags, etc..
Once this is done, I will submit patches for those missing features as well - I didn't want to put everything in one giant PR as it's already pretty big.

Thank you!

AA-Turner · 2023-08-08T06:25:54Z

@tomasr8 would it be possible to break this PR up into several chunks / stages? You may have more luck with reviewers & progress -- I'm happy to help if wanted.

A

tomasr8 · 2023-08-08T07:08:49Z

@tomasr8 would it be possible to break this PR up into several chunks / stages? You may have more luck with reviewers & progress -- I'm happy to help if wanted.

I can definitely give it a try! I think it'll be difficult to separate the actual change from tokens to AST, since that's kind of an all-or-nothing change but I could start with improving the tests first in a separate PR. That should be an added value regardless of whether the rest gets merged or not. I'll see if I can get a separate PR for the tests in the coming days.

Any help/review is greatly appreciated of course! :)

tomasr8 · 2023-08-20T17:13:20Z

@AA-Turner I opened a separate PR just adding extra tests, if you wanna have a look ;)

Wulian233 · 2024-10-19T07:50:48Z

There was a recent issue in support of f-string #113604 , that mentioned this. AST makes this feature easier to add

https://github.com/python/cpython/pull/108173/files already contains the required tests, can you remove them in this PR, so that the diff will be smaller and easier to review

tomasr8 · 2024-10-19T20:21:40Z

There was a recent issue in support of f-string #113604 , that mentioned this. AST makes this feature easier to add

https://github.com/python/cpython/pull/108173/files already contains the required tests, can you remove them in this PR, so that the diff will be smaller and easier to review

Don't waste your time reviewing this PR just yet, we should get the tests merged before moving on with this :) Actually, it's been a while since I opened the tests PR, it might need updating first..

tomasr8 added 2 commits May 11, 2023 00:52

Move test_i18n into a separate folder

ce99920

Switch to AST-based message extraction

4234c0b

bedevere-bot added the awaiting review label May 11, 2023

bedevere-bot mentioned this pull request May 11, 2023

pygettext: use an AST parser instead of a tokenizer #104400

Open

tomasr8 added 2 commits May 11, 2023 23:32

Add news entry

d431951

Merge branch 'main' into better-pygettext

5721857

arhadthedev reviewed Jun 12, 2023

View reviewed changes

Tools/i18n/pygettext.py Outdated Show resolved Hide resolved

tomasr8 added 5 commits June 13, 2023 20:25

Fix comment

42277e8

Merge branch 'main' into better-pygettext

03f698f

Merge branch 'main' into better-pygettext

705b608

Merge branch 'main' into better-pygettext

a16274f

Merge branch 'main' into better-pygettext

f291862

tomasr8 mentioned this pull request Aug 20, 2023

gh-104400: Add more tests to pygettext #108173

Merged

erlend-aasland marked this pull request as draft October 21, 2024 09:55

bedevere-app bot removed the awaiting review label Oct 21, 2024

serhiy-storchaka self-requested a review October 28, 2024 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-104400: pygettext: use an AST parser instead of a tokenizer #104402

gh-104400: pygettext: use an AST parser instead of a tokenizer #104402

tomasr8 commented May 11, 2023 •

edited by bedevere-bot

Loading

tomasr8 commented Aug 3, 2023

AA-Turner commented Aug 8, 2023

tomasr8 commented Aug 8, 2023

tomasr8 commented Aug 20, 2023

Wulian233 commented Oct 19, 2024

tomasr8 commented Oct 19, 2024

gh-104400: pygettext: use an AST parser instead of a tokenizer #104402

Are you sure you want to change the base?

gh-104400: pygettext: use an AST parser instead of a tokenizer #104402

Conversation

tomasr8 commented May 11, 2023 • edited by bedevere-bot Loading

tomasr8 commented Aug 3, 2023

AA-Turner commented Aug 8, 2023

tomasr8 commented Aug 8, 2023

tomasr8 commented Aug 20, 2023

Wulian233 commented Oct 19, 2024

tomasr8 commented Oct 19, 2024

tomasr8 commented May 11, 2023 •

edited by bedevere-bot

Loading