Certain globs ending with "non-word" characters fail to match #18

pe8ter · 2018-11-18T06:35:06Z

Please describe the minimum necessary steps to reproduce this issue:

Run this Node.js script:

const nanomatch = require('nanomatch');
const reg = nanomatch.makeRe('é/**/*');
console.log(reg.test('é/foo.txt'));

What is happening (but shouldn't):

Output is false because the RegExp test fails.

What should be happening instead?

Output is true because the RegExp test succeeds.

What's happening

Here is the RegExp produced by nanomatch:

/^(?:(?:\.[\\\/](?=.))?é[\\\/]?\b(?!\.)(?:(?!(?:[\\\/]|^)\.).)*?[\\\/](?!\.)(?=.)[^\\\/]*?(?:[\\\/]|$))$/
                               **

The word boundary matcher (starred) is the culprit. This matcher requires that the end of the first part of the glob é is a word boundary. There are two problems with the matcher:

According to ECMA-262, the set of characters that constitutes a word boundary is quite small, which is why é gets rejected as a word boundary. One solution is to add the Unicode flag u to the end of the RegExp. This is only a partial solution because...
Directory names can end in odd characters like # for example. If you replace the é in this example with #, the test fails even with the Unicode flag.

Another odd behavior with this RegExp is that the first test here fails but the second test passes:

reg.test('é/foo.txt'); // false
reg.test('é/a/foo.txt'); // true

The Unicode flag would be a good addition to un-break certain consumers of this library (see gulpjs/gulp#2153), but given the above odd behavior and above problem (2), it seems there might be some other consideration necessary.

The text was updated successfully, but these errors were encountered:

jonschlinkert · 2018-11-18T06:36:24Z

thank you for looking into this and figuring out the issue! I've been working on getting these matching libs updated for the past few days, I'll get this fixed.

thanks!

pe8ter · 2018-11-19T19:29:42Z

Are you thinking about adding a "Unicode" option to makeRe, or will all of the generated RegExps have the Unicode u flag by default? Or are you doing something entirely different?

I wanted to get a sense of your thoughts because, depending on the solution, all the packages in the dependency chain from Gulp down to nanomatch may need to be updated.

jonschlinkert · 2018-11-19T23:08:37Z

Honestly I’m not sure yet. I’m open to suggestions

…

Sent from my iPhone

On Nov 19, 2018, at 2:29 PM, Peter Safranek ***@***.***> wrote: Are you thinking about adding a "Unicode" option to makeRe, or will all of the generated RegExps have the Unicode u flag by default? Or are you doing something entirely different? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

pe8ter · 2018-11-20T00:20:05Z

Could you give me high-level explanation about why there are so many globbing packages: anymatch, micromatch, nanomatch, picomatch, etc? What's different about them?

jonschlinkert · 2018-11-20T01:04:49Z

No, I dont have time to do that. But the projects each have descriptions that answer your question, and there are readme documents that took a long time to write and were created for that purpose.

pe8ter · 2018-11-20T03:08:09Z

Fair enough.

Silic0nS0ldier · 2019-01-20T02:30:14Z

Using a unicode aware regex polyfill modified to include the exceptions noted seems like a feasible solution for this.

Given that the fix will greatly expand the allowed characters, locking this behind a configuration switch or semver major release would be a good idea (particularly given the widespread usage of nanomatch). Semver major would prevent accumulation of technical debt, however dependent libraries might be better off with the switch option (lots of semver major version bumps otherwise).

jonschlinkert added the bug label Nov 18, 2018

This was referenced Nov 18, 2018

gulp.watch not works if watched glob contains Japanese characters folder name gulpjs/gulp#2153

Closed

watch not works if watched glob contains Japanese characters folder name paulmillr/chokidar#716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Certain globs ending with "non-word" characters fail to match #18

Certain globs ending with "non-word" characters fail to match #18

pe8ter commented Nov 18, 2018

jonschlinkert commented Nov 18, 2018

pe8ter commented Nov 19, 2018 •

edited

Loading

jonschlinkert commented Nov 19, 2018 via email

pe8ter commented Nov 20, 2018

jonschlinkert commented Nov 20, 2018

pe8ter commented Nov 20, 2018

Silic0nS0ldier commented Jan 20, 2019

Certain globs ending with "non-word" characters fail to match #18

Certain globs ending with "non-word" characters fail to match #18

Comments

pe8ter commented Nov 18, 2018

Please describe the minimum necessary steps to reproduce this issue:

What is happening (but shouldn't):

What should be happening instead?

What's happening

jonschlinkert commented Nov 18, 2018

pe8ter commented Nov 19, 2018 • edited Loading

jonschlinkert commented Nov 19, 2018 via email

pe8ter commented Nov 20, 2018

jonschlinkert commented Nov 20, 2018

pe8ter commented Nov 20, 2018

Silic0nS0ldier commented Jan 20, 2019

pe8ter commented Nov 19, 2018 •

edited

Loading