Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow path-like objects in fnmatch.fnmatchcase #123215

Closed
picnixz opened this issue Aug 22, 2024 · 17 comments
Closed

Allow path-like objects in fnmatch.fnmatchcase #123215

picnixz opened this issue Aug 22, 2024 · 17 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@picnixz
Copy link
Contributor

picnixz commented Aug 22, 2024

Feature or enhancement

Proposal:

We should call os.fspath in fnmatchcase so that all functions in fnmatch have path-like support:

 def fnmatchcase(name, pat):
     """Test whether FILENAME matches PATTERN, including case. 

     This is a version of fnmatch() which doesn't case-normalize
     its arguments.
     """
     match = _compile_pattern(pat)
-    return match(name) is not None
+    return match(os.fspath(name)) is not None

See #123122 (comment) and #123122 (comment)

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

Linked PRs

@picnixz picnixz added type-feature A feature request or enhancement stdlib Python modules in the Lib dir labels Aug 22, 2024
@picnixz picnixz self-assigned this Aug 22, 2024
@JelleZijlstra
Copy link
Member

We talked about this in typeshed recently too: python/typeshed#12522 (comment) and later comments. I don't feel too strongly but it might be clearer to keep fnmatch limited to strings instead.

@picnixz
Copy link
Contributor Author

picnixz commented Aug 22, 2024

Unfortunately, they are not limited to strings at runtime since they are calling os.path.normcase sometimes and sometimes not (a bit inconsistenly I would say...). I don't mind either path, namely, either path-like support, or string-support only. I can add an isinstance(name, (str, bytes)) check in filter (see #123122).

I'm marking the other as DO-NOT-MERGE until we decide whether to accept or not (what's important is that the module is consistent on different platforms).

@AlexWaygood
Copy link
Member

We talked about this in typeshed recently too: python/typeshed#12522 (comment) and later comments.

Earlier comments too ;) python/typeshed#12522 (comment)

@picnixz
Copy link
Contributor Author

picnixz commented Aug 23, 2024

By reading the discussion, I feel we should explicitly reject path-like objects in fnmatch (it appears it could even cause bugs in Black). In addition the docs say "Test whether the filename string name matches the pattern string pat" (emphasis mine), so maybe we could have an explicit check.

@barneygale Let's discuss this matter here. Should we rather explicitly reject path-like objects (even on Windows platforms where os.path.normcase is called and thus os.fspath as well implicitly) or should we allow path-like objects? If we were to allow it, there are some subtleties that could be un-desired (see the tests for #123216) where passing a path object with a non-path pattern may cause divergences on platforms.

In order to keep the current behaviour, possibly restricting it at runtime but making it in sync with typeshed, I think we should warn if a non-str/bytes filename or pattern is given (at least on platforms that implicitly call os.fspath due to os.path.normcase) and then raise in Python 3.16 (directly raise in 3.14?)

@barneygale
Copy link
Contributor

That all makes sense to me! I suppose fnmatch concerns itself with filenames, and there's no os.FileNameLike :)

@picnixz
Copy link
Contributor Author

picnixz commented Aug 24, 2024

It seems we reached a consensus. Unfortunately, I'm conflicted about eagerly checking the types in filter and fnmatchcase. Having

for name in names:
	if not isinstance(name, (str, bytes)):  # this may be costly
		raise_or_warn()
	do_the_rest()

So, I think we could also keep the status quo by not changing anything for now? I can make the docs a bit more precise and eager so that users really know not to pass path-like objects. How does it sound?

@AlexWaygood
Copy link
Member

AlexWaygood commented Aug 24, 2024

I think adding a strict type check would also require a deprecation period. (Although passing PathLike objects is bug-prone, that doesn't mean that people aren't depending on it — and the people depending on it might not be suffering from any of the possible bugs.)

So I would also vote for leaving things as they are. I don't think a deprecation period would be worth it, and the standard library generally does not do strict pre-emptive type checks along these lines.

I'd be happy with a docs clarification, but I also don't think it necessarily deserves that much space in the docs. I think the language in the docs where it talks about "file names" is already reasonably clear, in my opinion :-)

@picnixz
Copy link
Contributor Author

picnixz commented Aug 24, 2024

I'll just pass through the docs to see whether they require better clarification or not. But let's keep the status quo for now. At least, with this issue, we will have a trace of why we did not consider (yet) path-like objects.

@picnixz
Copy link
Contributor Author

picnixz commented Aug 25, 2024

Before opening a PR, when we say "filename string" do we mean a string in the sense of a str object or do we mean a string in the sense "str or bytes object"? The docs say:

Test whether the filename string name matches the pattern string pat

The overloaded signatures are:

def fnmatch(name: str, pat: str) -> bool: ...
def fnmatch(name: bytes, pat: bytes) -> bool: ...

but where pat is a latin-1 (ISO-8859-1) encodable string (this one is important). We don't allow arbitrary bytes objects since we do

@functools.lru_cache(maxsize=32768, typed=True)
def _compile_pattern(pat):
    if isinstance(pat, bytes):
        pat_str = str(pat, 'ISO-8859-1')
        res_str = translate(pat_str)
        res = bytes(res_str, 'ISO-8859-1')
    else:
        res = translate(pat)
    return re.compile(res).match

I'm actually wondering why the pattern must be encoded in latin-1 and why we can't assume it's UTF-8 encoded (AFAIK, re.match works with UTF-8 characters, right?). So I would instead assume that the pattern should be decoded using os.fsdecode, then translated, and then encoded back using os.fsencode instead of assuming latin-1 encoding.

If this is the case, I'll file (yet again) another issue.


So, here is what I suggest:

Plan 1 is the safest one but we could have people starting to wonder why we did not allow path-like objects (as we saw on typeshed). So I think plan 2 would be better (the fact that bytes patterns are accepted should be mentioned I think, this is not just an implementation detail IMO). Plan 3 would change the runtime with less undesirable effects than just accepting plain path-like objects but this would still be system-dependent since it relies on os.fsdecode.

So my personal recommendation is plan 2. What do you think?

@AlexWaygood
Copy link
Member

Plan 2 SGTM.

@picnixz
Copy link
Contributor Author

picnixz commented Aug 26, 2024

I have a PR and an issue ready for plan 2. @JelleZijlstra @barneygale I'm planning to go for plan 2; do you have any objection?

@JelleZijlstra
Copy link
Member

No strong opinion; in general I prefer the plan that makes the fewest runtime behavior changes, since such changes could break people's code.

@picnixz
Copy link
Contributor Author

picnixz commented Aug 26, 2024

Following the above discussion, I'm closing this specific issue as not planned (even if we were to go for plan 1, I would have closed it).

@picnixz picnixz closed this as not planned Won't fix, can't repro, duplicate, stale Aug 26, 2024
@barneygale
Copy link
Contributor

barneygale commented Aug 26, 2024

I vote for plan 1. Reading the discussion on GH-47437, it's not clear that the latin-1 encoding was a confident choice or a holdover from a time before sys.getfilesystemencoding() / os.fsdecode(), and so plan 2 could be documenting a fixable bug.

@picnixz
Copy link
Contributor Author

picnixz commented Aug 26, 2024

Oh, I wasn't aware of that. I thought it was something that was decided. I can also live with plan 1. Since you're the pathlib expert, it's up to you! Plan 2 was essentially here to avoid surprises as on typeshed, but maybe with this issue, people would understand it? (technically, "filename string" should be sufficient to indicate a "str" so I'm also happy with plan 1).

@barneygale
Copy link
Contributor

My opinion isn't strong, so please do whatever you think is best @picnixz. I'm no fnmatch expert!

@picnixz
Copy link
Contributor Author

picnixz commented Aug 26, 2024

If you want to review #123345, I'd be happy. The idea is to avoid users having surprises and/or not knowing why something works on Windows but not on POSIX.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants