-
-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make {eol_}comments_re read-only and non-init arguments in `ParserCon… #353
Conversation
…fig` (#352) * [buffering] drop forced multiline match for string patterns Previously, when scanning for matches to a regex, if the type of the pattern was `str`, the pattern was always compiled with `re.MULTILINE`. Recent changes to `ParserConfig` [0] changed the type used for regex matches in generated code from `str` to `re.Pattern` which could lead to a difference in behavior from previous versions where a defined comments or eol_comments may have been implicitly relying on the `re.MULTILINE` flag. After discussion [1], it has been determined that usage of `re` flags within TatSu should be deprecated in favor of users specifying the necessary flags within patterns. As such, drop the `re.MULTILINE` flag for strings compiled on the fly. [0]: #338 [1]: #351 (comment) * [grammar] make eol_comments multiline match Make the default eol_comments regex use multiline matching. Recent changes to `ParserConfig` [0] now use a precompiled regex (an `re.Pattern`) instead of compiling the `str` regex on the fly. The `Tokenizer` previously assumed `str` type regexes should all be `re.MULTILINE` regardless of options defined in the regex itself when compiling the pattern. This behavior has since changed to no longer automatically apply and thus requires configurations to specify the option in the pattern. [0]: #338 * [infos] make {eol_}comments_re read-only attributes Previously, the `eol_comments_re` and `comments_re` attributes were public init arguments, were modifiable, and could thus become out of sync with the `eol_comments` and `comments` attributes. Also, with recent changes to `ParserConfig` [0], there were two ways to initialize the regex values for comments and eol_comments directives; either via the constructor using the *_re variables or by using the sister string arguments and relying on `__post_init__` to compile the values which trumped the explicit *_re argument values. Now, the constructor interface has been simplified to not take either `eol_comments_re` or `comments_re` as arguments. Callers may only use `eol_comments` and `comments`. The `eol_comments_re` and `comments_re` attributes are still public, but are read-only so they are always a reflection of their sister string values passed into the constructor. [0]: #200 * [codegen] migrate to {eol_}comments * [ngcodegen] migrate to {eol_}comments * [bootstrap] migrate to {eol_}comments * [lint] resolve errors * [docs] note {eol_}comments directive behavior changes * [docs] update syntax to reflect {eol_}comments arguments * [test] fix test_parse_hash to use eol_comments * [test] explicitly use multiline match in test_patterns_with_newlines
I guess I should add myself to the contributors list? Or is that something you do? Otherwise the only fixups I'd make are the reference markers in the description and commit message but those aren't super important |
@vfazio, Please see the review comments. It's not possible to merge this PR as is. |
Oh, sorry. I don't see any comments, but you put this in draft so I imagine you're still adding them. Thanks for your help. |
@vfazio, I managed to upgrade one of my parsers to this branch, and it wasn't much work. But we can't just go breaking who knows how many parsers by removing What could be done is :
(the same for If you can make the above changes, I'd be glad to merge. If not, I'll take care of the changes myself soon. |
@apalala I'll think on this tomorrow. I have a feeling it's going to be complex to deal with What we can maybe do is perform a translation in Naive impl def _sanitize_deprecated_options(self, **settings: Any) -> dict[str, Any]:
for option in (("comments_re", "comments"), ("eol_comments_re", "eol_comments")):
if not option[0] in settings:
continue
deprecated_option = settings[option[0]]
del settings[option[0]]
if deprecated_option is not None:
warnings.warn(f"{option[0]} is deprecated in favor of {option[1]}", DeprecationWarning, 4)
if option[1] in settings:
raise ValueError(f"Cannot specify {option[0]} and {option[1]} simultaneously")
if isinstance(deprecated_option, re.Pattern):
settings[option[1]] = deprecated_option.pattern
else:
settings[option[1]] = deprecated_option
return settings
def replace(self, **settings: Any) -> ParserConfig:
overrides = self._find_common(**self._sanitize_deprecated_options(**settings))
overrides = self._filter_non_init_fields(**overrides)
result = dataclasses.replace(self, **overrides)
Changing bootstrap:
Otherwise, does this also mean reverting the code generation behavior? Or leave that using the new |
Continue to allow the old *_re arguments but issue a DeprecationWarning when they are encountered and translate them into the new arguments.
@apalala when you get a chance, can you look this over and retest that things work as you expect? |
6150d8e shouldn't be necessary. The arguments should hopefully be accepted as is if the unit test is correct. The arguments will always be set if you make these public because setting My last two commits should fix this and have tests that pass showing it. I'm not sure they need to be actual attributes on |
I forgot to commit my latest changes yesterday. For some reason I haven't understood yet, Backwards compatibility is important. It has been there since the origins of TatSu in Grako, over 10 years ago. Some very weird code still present in TatSu is is to secure that. The way to deprecate is to:
Note that We're almost done with this pull request. |
I think we should maybe have a discussion about that before we keep stepping on each other's implementations. When a >>> import tatsu.infos
>>> z = tatsu.infos.ParserConfig(comments="\\d")
>>> z.comments
'\\d'
>>> z.comments_re
re.compile('\\d') The When we're performing a >>> y = dataclasses.replace(z, **{"comments": "\\w"})
>>> hex(id(y))
'0x766bddfbcb90'
>>> hex(id(z))
'0x766bde9fc650'
>>> z == y
False
>>> y.comments_re
re.compile('\\w')
>>> z.comments_re
re.compile('\\d') When it does so, it takes all fields marked as We have the added problem that
My implementation prior to the latest commits should maintain backwards compatibility. I'm not sure if you tested it and found that it did not? I was hoping the following commit would have just worked since the unittest I wrote seemed to imply it both raised deprecation warnings when appropriate and raised errors when appropriate while all other unit tests continued to pass. The goal of the commit and the design up to this point is to do the following:
Since these attributes on the If nothing outside of Tatsu should be touching or manipulating attributes on By dropping these attributes completely, we don't have to worry about sanitizing |
Dropping them all together and updating (venv) vfazio@vfazio4 ~/development/TatSu $ git diff
diff --git a/tatsu/buffering.py b/tatsu/buffering.py
index a64f448..30688bd 100644
--- a/tatsu/buffering.py
+++ b/tatsu/buffering.py
@@ -48,6 +48,8 @@ class Buffer(Tokenizer):
self.text = self.original_text = text
self.whitespace_re = self.build_whitespace_re(config.whitespace)
+ self.comments_re = None if not config.comments else re.compile(config.comments)
+ self.eol_comments_re = None if not config.eol_comments else re.compile(config.eol_comments)
self.nameguard = (
config.nameguard
if config.nameguard is not None
@@ -268,11 +270,11 @@ class Buffer(Tokenizer):
return self._eat_regex(self.whitespace_re)
def eat_comments(self):
- comments = self._eat_regex_list(self.config.comments_re)
+ comments = self._eat_regex_list(self.comments_re)
self._index_comments(comments, lambda x: x.inline)
def eat_eol_comments(self):
- comments = self._eat_regex_list(self.config.eol_comments_re)
+ comments = self._eat_regex_list(self.eol_comments_re)
self._index_comments(comments, lambda x: x.eol)
def next_token(self):
diff --git a/tatsu/infos.py b/tatsu/infos.py
index b822fc5..6fbc9d6 100644
--- a/tatsu/infos.py
+++ b/tatsu/infos.py
@@ -31,9 +31,6 @@ class ParserConfig:
start_rule: str | None = None # FIXME
rule_name: str | None = None # Backward compatibility
- comments_re: re.Pattern | str | None = None
- eol_comments_re: re.Pattern | str | None = None
-
tokenizercls: type[Tokenizer] | None = None # FIXME
semantics: type | None = None
@@ -65,27 +62,6 @@ class ParserConfig:
if self.ignorecase:
self.keywords = [k.upper() for k in self.keywords]
- # FIXME: this check should work, but somehow both attributes seem to be always set
- # if self.comments and self.comments_re:
- # raise AttributeError(
- # f'Both `comments` and `comments_re` defined: {self.comments!r} {self.comments_re!r}')
- if self.comments:
- self.comments_re = re.compile(self.comments)
- elif self.comments_re:
- if not isinstance(self.comments_re, re.Pattern):
- self.comments_re = re.compile(self.comments_re)
- self.comments = self.comments_re.pattern
-
- # FIXME: this check should work, but somehow both attributes seem to be always set
- # if self.eol_comments and self.eol_comments_re:
- # raise AttributeError('Both `eol_comments` and `eol_comments_re` defined')
- if self.eol_comments:
- self.eol_comments_re = re.compile(self.eol_comments)
- elif self.eol_comments_re:
- if not isinstance(self.eol_comments_re, re.Pattern):
- self.eol_comments_re = re.compile(self.eol_comments_re)
- self.eol_comments = self.eol_comments_re.pattern
-
@classmethod
def new(
cls,
@@ -119,18 +95,6 @@ class ParserConfig:
else:
return self.replace(**vars(other))
- # non-init fields cannot be used as arguments in `replace`, however
- # they are values returned by `vars` and `dataclass.asdict` so they
- # must be filtered out.
- # If the `ParserConfig` dataclass drops these fields, then this filter can be removed
- def _filter_non_init_fields(self, settings: MutableMapping[str, Any]) -> MutableMapping[str, Any]:
- for field in [
- field.name for field in dataclasses.fields(self) if not field.init
- ]:
- if field in settings:
- del settings[field]
- return settings
-
def _sanitize_deprecated_options(self, settings: MutableMapping[str, Any]) -> None:
for deprecated_option, new_option in (
("comments_re", "comments"),
@@ -145,8 +109,8 @@ class ParserConfig:
DeprecationWarning,
4,
)
- # if new_option in settings:
- # raise ValueError(f"Cannot specify {deprecated_option} and {new_option} simultaneously")
+ if new_option in settings:
+ raise ValueError(f"Cannot specify {deprecated_option} and {new_option} simultaneously")
if isinstance(deprecated_value, re.Pattern):
settings[new_option] = deprecated_value.pattern
elif isinstance(deprecated_value, str):
@@ -156,7 +120,7 @@ class ParserConfig:
def replace(self, **settings: Any) -> ParserConfig:
self._sanitize_deprecated_options(settings)
- overrides = self._filter_non_init_fields(self._find_common(**settings))
+ overrides = self._find_common(**settings)
result = dataclasses.replace(self, **overrides)
if 'grammar' in overrides:
result.name = result.grammar
diff --git a/test/grammar/syntax_test.py b/test/grammar/syntax_test.py
index e0badd2..1cde590 100644
--- a/test/grammar/syntax_test.py
+++ b/test/grammar/syntax_test.py
@@ -426,7 +426,6 @@ def test_deprecated_comments_override(comment, option):
tool.parse(grammar, text, **option)
-@pytest.mark.skip
@pytest.mark.parametrize(
"comment,option",
[
|
I don't understand... |
We can drop the attributes completely from These are the only two places that use the #338 was the PR that decided these needed to be compiled. Up until this PR, the values of There's no reason why We don't document return types on If these have to be available off the class, which I'm not sure they do but I'll defer to you, then we can mark those as deprecated properties: @property
def comments_re(self) -> str | None:
warnings.warn(f"{self.__class__.__name__}.comments_re is deprecated", DeprecationWarning, 1)
return self.comments
@property
def eol_comments_re(self) -> str | None:
warnings.warn(f"{self.__class__.__name__}.eol_comments_re is deprecated", DeprecationWarning, 1)
return self.eol_comments This still avoids them being direct attributes on the |
I thought about it, and I agree that we should just drop We could keep them on Some parsers will break, but the fix is extremely easy. I'll make the changes now for your review. |
I'm on holiday until next week. I had planned on summarizing this issue so we could discuss a path forward. The challenge here is I'm an outsider and am not familiar with how this library is used. I want to make sure we're both talking about the same thing. From what I understand, you want to continue to allow things like I think this fine and 100% doable. It does not require having the attributes on We can continue to support those settings arguments because the interface is not strictly typed. That's what my commit did, it just translates those arguments to the new attributes. My commit could be simplified further by dropping the legacy attributes (the diff I posted) since, at least from my reading of the code, they aren't necessary anymore. The unit tests showed this at least and it passed the test I added which checks for conversion of the legacy arguments. I can either submit my suggested fixups on Monday, or you can proceed as you see fit and I will review them, though I'm not sure how useful my input will be. |
@vfazio Take a look at the latest commits when you have time. What I've done is leave We can retake this issue with a new pull request, in time. |
…fig` (#352)
Previously, when scanning for matches to a regex, if the type of the pattern was
str
, the pattern was always compiled withre.MULTILINE
.Recent changes to
ParserConfig
0 changed the type used for regex matches in generated code fromstr
tore.Pattern
which could lead to a difference in behavior from previous versions where a defined comments or eol_comments may have been implicitly relying on there.MULTILINE
flag.After discussion 1, it has been determined that usage of
re
flags within TatSu should be deprecated in favor of users specifying the necessary flags within patterns.As such, drop the
re.MULTILINE
flag for strings compiled on the fly.Make the default eol_comments regex use multiline matching.
Recent changes to
ParserConfig
0 now use a precompiled regex (anre.Pattern
) instead of compiling thestr
regex on the fly.The
Tokenizer
previously assumedstr
type regexes should all bere.MULTILINE
regardless of options defined in the regex itself when compiling the pattern. This behavior has since changed to no longer automatically apply and thus requires configurations to specify the option in the pattern.Previously, the
eol_comments_re
andcomments_re
attributes were public init arguments, were modifiable, and could thus become out of sync with theeol_comments
andcomments
attributes.Also, with recent changes to
ParserConfig
0, there were two ways to initialize the regex values for comments and eol_comments directives; either via the constructor using the *_re variables or by using the sister string arguments and relying on__post_init__
to compile the values which trumped the explicit *_re argument values.Now, the constructor interface has been simplified to not take either
eol_comments_re
orcomments_re
as arguments. Callers may only useeol_comments
andcomments
.The
eol_comments_re
andcomments_re
attributes are still public, but are read-only so they are always a reflection of their sister string values passed into the constructor.[codegen] migrate to {eol_}comments
[ngcodegen] migrate to {eol_}comments
[bootstrap] migrate to {eol_}comments
[lint] resolve errors
[docs] note {eol_}comments directive behavior changes
[docs] update syntax to reflect {eol_}comments arguments
[test] fix test_parse_hash to use eol_comments
[test] explicitly use multiline match in test_patterns_with_newlines