Automatically detect character encoding of YAML files and ignore files #630

Jayman2000 · 2024-01-03T23:53:14Z

This PR makes sure that yamllint never uses open()’s default encoding. Specifically, it uses the character encoding detection algorithm specified in chapter 5.2 of the YAML spec when reading both YAML files and files that are on the ignore-from-file list.

There are two other PRs that are similar to this one. Here’s how this PR compares to those two:

This PR doesn’t have any merge conflicts.
This PR has a cleaner commit history. You can run the tests and flake8 on each commit in this PR, and they’ll report no errors. I don’t think that you can do that with Detect encoding per yaml spec (fix #238) #240.
This PR has longer commit messages. I really tried to explain why I think that my changes make sense.
This PR detects the encoding of files being linted, config files, and files on the ignore-from-file list. Those two PRs only detects the encoding of files being linted.
Detect encoding per yaml spec (fix #238) #240 PR adds a dependency on chardet. This PR doesn’t add any dependencies.
This PR only supports UTF-8, UTF-16 and UTF-32. Both of those PRs support additional encodings.
Unicode yaml #581 adds support for running tests on Windows. This PR doesn’t.
The code that this PR adds to the yamllint package is simpler.
The code that this PR adds to the test package is much more complicated, but hopefully it tests things more thoroughly.

Fixes #218. Fixes #238. Fixes #347.
Closes #240. Closes #581.

coveralls · 2024-01-03T23:54:02Z

coverage: 99.835% (+0.01%) from 99.825%
when pulling aeabade on Jayman2000:auto-detect-encoding
into f0c0c75 on adrienverge:master.

Jayman2000 · 2024-02-13T13:57:56Z

I just noticed that one of the checks for this PR is failing. The coverage for yamllint/config.py went down, but that’s just because the total number relevant lines went down. There’s only two lines that aren’t covered, but those same two lines aren’t covered in the master branch. Is there anything that I need to do here?

adrienverge · 2024-02-15T08:54:04Z

Is there anything that I need to do here?

At the moment, no. I'm sorry, please excuse the delay, this is a big change with much impact, I need a large time slot to review this, which I couldn't find yet.

Before this change, build_temp_workspace() would always encode a path using UTF-8 and the strict error handler [1]. Most of the time, this is fine, but systems do not necessarily use UTF-8 and the strict error handler for paths [2]. [1]: <https://docs.python.org/3.12/library/stdtypes.html#str.encode> [2]: <https://docs.python.org/3.12/glossary.html#term-filesystem-encoding-and-error-handler>

Before this commit, test_run_default_format_output_in_tty() changed the values of sys.stdout and sys.stderr, but it would never change them back. This commit makes sure that they get changed back. At the moment, this commit doesn’t make a user-visible difference. A future commit will add a new test named test_ignored_from_file_with_multiple_encodings(). That new test requires stdout and stderr to be restored, or else it will fail.

Before this change, yamllint would open YAML files using open()’s default encoding. As long as UTF-8 mode isn’t enabled, open() defaults to using the system’s locale encoding [1][2]. Most of the time, the locale encoding on Linux systems is UTF-8 [3][4], but it doesn’t have to be [5]. Additionally, the locale encoding on Windows systems is the system’s ANSI code page [6]. As a result, you would have to either enable UTF-8 mode, give Python a custom manifest or enable a beta feature in Windows settings in order to lint UTF-8 YAML files on Windows [2][7]. Finally, using open()’s default encoding is a violation of the YAML spec. Chapter 5.2 says: “On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported. If a character stream begins with a byte order mark, the character encoding will be taken to be as indicated by the byte order mark. Otherwise, the stream must begin with an ASCII character. This allows the encoding to be deduced by the pattern of null (x00) characters.” [8] This change fixes all of those problems by implementing the YAML spec’s character encoding detection algorithm. Now, as long as YAML files begins with either a byte order mark or an ASCII character, yamllint will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other character encodings are not supported at the moment. Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347. [1]: <https://docs.python.org/3.12/library/functions.html#open> [2]: <https://docs.python.org/3.12/library/os.html#utf8-mode> [3]: <https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.html> [4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale> [5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f> [6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding> [7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page> [8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>

Before this change, yamllint would decode files on the ignore-from-file list using open()’s default encoding [1][2]. This can cause decoding to fail on some systems and succeed on other systems (see the previous commit message for details). This change makes yamllint automatically detect the encoding for files on the ignore-from-file list. It uses the same algorithm that it uses for detecting the encoding of YAML files, so the same limitations apply: files must use UTF-8, UTF-16 or UTF-32 and they must begin with either a byte order mark or an ASCII character. [1]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.input> [2]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.FileInput>

In general, using open()’s default encoding is a mistake [1]. This change makes sure that every time open() is called, the encoding parameter is specified. Specifically, it makes it so that all tests succeed when run like this: python -X warn_default_encoding -W error::EncodingWarning -m unittest discover [1]: <https://peps.python.org/pep-0597/#using-the-default-encoding-is-a-common-mistake>

The previous few commits have removed all calls to open() that use its default encoding. That being said, it’s still possible that code added in the future will contain that same mistake. This commit makes it so that the CI test job will fail if that mistake is made again. Unfortunately, it doesn’t look like coverage.py allows you to specify -X options [1] or warning filters [2] when running your tests [3]. As a result, the CI test job will also fail if coverage.py uses open()’s default encoding. Hopefully, coverage.py won’t do that. If it does, then we can always temporarily revert this commit. [1]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-X> [2]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-W> [3]: <https://coverage.readthedocs.io/en/7.4.0/cmd.html#execution-coverage-run>

adrienverge

Hello Jason, please excuse the very long delay for reviewing this... This was a big piece and I needed time. I apologize.

The 6 commits are well splitted, well explained, and make the review much easier. Thanks a lot!

In my opinion this PR is good to go. I suspect it can solve problems in several cases (including the issues you pointed out), but I also see a small risk of breakage on exotic systems the day it's released. If this happens, will you be around to help find a solution?

A few notes:

I notice that you used encoding names with underscores (e.g. utf_8 vs. utf-8). I just read on https://docs.python.org/fr/3/library/codecs.html#standard-encodings that not only are they valid, but they also seem to be the "right" notation:

Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.
I feared that using open().decode() would put the whole contents of files in memory before even linting them, and affect performance. But this is already what yamllint does currently.

adrienverge · 2024-10-08T16:20:47Z

tests/common.py

 import contextlib
 from io import StringIO
 import os
 import shutil
 import sys
 import tempfile
 import unittest
+import warnings
+from codecs import CodecInfo as CI


When possible I prefer keeping original names, to avoid confusion (especially when "CI" has other meanings). Would you be OK to use CodecInfo at the 4 places where it's used?

adrienverge · 2024-10-08T16:21:27Z

tests/common.py

+# Workspace related stuff:
+Blob = collections.namedtuple('Blob', ('text', 'encoding'))


It's smart but not very common, maybe hard to understand for beginners. We could also use something more explicit like:

class Blob: def __init__(self, text, encoding): self.text = text self.encoding = encoding

This is just an idea, please choose what you prefer.

adrienverge · 2024-10-08T16:23:00Z

tests/common.py

+        shutil.rmtree(wd)
+
+
+def ws_with_files_in_many_codecs(path_template, text):


For the sake of explicitness and readability, can you change it to:

Suggested change

def ws_with_files_in_many_codecs(path_template, text):

def temp_workspace_with_files_in_many_codecs(path_template, text):

or:

Suggested change

def ws_with_files_in_many_codecs(path_template, text):

def workspace_with_files_in_many_codecs(path_template, text):

(if you choose the second one, we should also remove temp_ from the other function temp_workspace())

adrienverge · 2024-10-08T16:23:32Z

tests/common.py

+def utf_codecs():
+    for chunk_size in ('32', '16'):
+        for endianness in ('be', 'le'):
+            for sig in ('', '_sig'):
+                yield f'utf_{chunk_size}_{endianness}{sig}'
+    yield 'utf_8_sig'
+    yield 'utf_8'


Smarter than a simple hardcoded list 😄 but maybe a list would be more readable?

adrienverge · 2024-10-08T16:25:17Z

tests/test_decoder.py

+        self.assertTrue(encoding_detectable('wn', 'utf_8_sig'))
+
+
+class DecoderTestCase(unittest.TestCase):


In complement to testing various encoded test_strings, could you add a bunch of byte arrays (e.g. b'\xc3\xa7a va ?', b'\xef\xbb\xbf\xc3\xa7a va ?', b'\xe7\x00a\x00 \x00v\x00a\x00 \x00?\x00' etc.) hardcoded inside our test files?

adrienverge · 2024-10-08T16:26:14Z

yamllint/decoder.py

+        return 'utf_8'
+
+
+def auto_decode(stream_data, errors='strict'):


The argument errors='strict' isn't used anywhere, except in tests. Can we get rid of it?

Jayman2000 force-pushed the auto-detect-encoding branch 4 times, most recently from 8cedbee to 3fa4c57 Compare January 10, 2024 12:40

Jayman2000 force-pushed the auto-detect-encoding branch from 3fa4c57 to 75b2889 Compare January 13, 2024 11:46

Jayman2000 force-pushed the auto-detect-encoding branch from 75b2889 to be0cc85 Compare January 20, 2024 12:04

Jayman2000 force-pushed the auto-detect-encoding branch 2 times, most recently from fd2c72d to bb8dc2b Compare February 8, 2024 16:01

Jayman2000 force-pushed the auto-detect-encoding branch from bb8dc2b to 13be50b Compare February 15, 2024 14:06

Jayman2000 force-pushed the auto-detect-encoding branch from 13be50b to d569de6 Compare February 25, 2024 14:41

Jayman2000 force-pushed the auto-detect-encoding branch from d569de6 to d562b6b Compare July 19, 2024 15:16

Jayman2000 added 6 commits September 20, 2024 08:05

Jayman2000 force-pushed the auto-detect-encoding branch from d562b6b to aeabade Compare September 20, 2024 12:27

adrienverge reviewed Oct 8, 2024

View reviewed changes

BaseMax mentioned this pull request Nov 3, 2024

Update cli.py: encoding='utf-8' #696

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically detect character encoding of YAML files and ignore files #630

Automatically detect character encoding of YAML files and ignore files #630

Jayman2000 commented Jan 3, 2024 •

edited

Loading

coveralls commented Jan 3, 2024 •

edited

Loading

Jayman2000 commented Feb 13, 2024

adrienverge commented Feb 15, 2024

adrienverge left a comment

adrienverge Oct 8, 2024

adrienverge Oct 8, 2024

adrienverge Oct 8, 2024

adrienverge Oct 8, 2024

adrienverge Oct 8, 2024

adrienverge Oct 8, 2024

		# Workspace related stuff:
		Blob = collections.namedtuple('Blob', ('text', 'encoding'))

		shutil.rmtree(wd)


		def ws_with_files_in_many_codecs(path_template, text):

	def ws_with_files_in_many_codecs(path_template, text):
	def temp_workspace_with_files_in_many_codecs(path_template, text):

	def ws_with_files_in_many_codecs(path_template, text):
	def workspace_with_files_in_many_codecs(path_template, text):

		self.assertTrue(encoding_detectable('wn', 'utf_8_sig'))


		class DecoderTestCase(unittest.TestCase):

		return 'utf_8'


		def auto_decode(stream_data, errors='strict'):

Automatically detect character encoding of YAML files and ignore files #630

Are you sure you want to change the base?

Automatically detect character encoding of YAML files and ignore files #630

Conversation

Jayman2000 commented Jan 3, 2024 • edited Loading

coveralls commented Jan 3, 2024 • edited Loading

Jayman2000 commented Feb 13, 2024

adrienverge commented Feb 15, 2024

adrienverge left a comment

Choose a reason for hiding this comment

adrienverge Oct 8, 2024

Choose a reason for hiding this comment

adrienverge Oct 8, 2024

Choose a reason for hiding this comment

adrienverge Oct 8, 2024

Choose a reason for hiding this comment

adrienverge Oct 8, 2024

Choose a reason for hiding this comment

adrienverge Oct 8, 2024

Choose a reason for hiding this comment

adrienverge Oct 8, 2024

Choose a reason for hiding this comment

Jayman2000 commented Jan 3, 2024 •

edited

Loading

coveralls commented Jan 3, 2024 •

edited

Loading