Fix indentation tokens lengths #1708

aabounegm · 2024-10-02T11:23:43Z

Removed the duplication of indent tokens as whitespace, and ensured that dedent tokens are 0-width (empty string).
The dedent token could previously contain the matched whitespace of the following indentation (if it matched a previous indentation and didn't introduce a new indent token).

For example, in this snippet:

if True:
    if False:
        print('impossible')
    else:
        print('makes sense')

the dedent token after the first print statement contained the whitespace before the else keyword

Additionally, fixed a minor issue (mostly inconvenience with folding) where the dedent tokens were placed after all newline tokens. For example, in the snippet:

if True:
    print('yes')


else:
    print('no')

the dedent was placed on the 5th line, right before the else keyword. This PR moves it such that it is on the 3rd line.

When the dedent had some whitespace after it (not an empty line), that whitespace was considered part of the dedent token, causing the CST node range to be off

msujew · 2024-10-09T13:09:26Z

packages/langium/src/parser/indentation-aware.ts

@@ -321,19 +312,20 @@ export class IndentationAwareTokenBuilder<Terminals extends string = string, Key
        }

        const numberOfDedents = this.indentationStack.length - matchIndentIndex - 1;
+        const newlinesBeforeDedent = text.substring(0, offset).match(/[\r\n]+$/)?.[0].length ?? 1;


I think this should not only backtrack through all the newline characters, but also whitespace. For example, if we dedent later:

{ value <-- whitespace <-- whitespace } <-- Dedent token appears at start of this line

This seems pretty weird, as if we remove the whitespace, the dedent already appears in the 3rd line. Since this PR is there to fix behavior such as the folding, I would've thought it should keep the DEDENT token in the 3rd line.

But if emit the DEDENT token on the beginning of the 3rd line, then it means that the whitespace after it will be detected as INDENT, and naturally followed by another DEDENT on the next line (and once more, for this example). These extra INDENT/DEDENT tokens are unexpected and will cause parsing errors.

I generally think that trailing whitespace is problematic and should be marked as a parsing error, especially in indentation-sensitive languages.

I see. I guess that makes sense. Thanks for the explanation 👍

msujew

Makes sense to me, thanks 👍

aabounegm added 3 commits October 2, 2024 14:08

Fix indentation duplication as a whitespace token

3994c08

Fix dedent tokens being nonempty

1919efa

When the dedent had some whitespace after it (not an empty line), that whitespace was considered part of the dedent token, causing the CST node range to be off

Place the dedent after first new line

048d76d

msujew reviewed Oct 9, 2024

View reviewed changes

msujew approved these changes Oct 14, 2024

View reviewed changes

msujew merged commit d0522c1 into eclipse-langium:main Oct 14, 2024
4 checks passed

aabounegm deleted the indentation-tokens-length branch October 14, 2024 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix indentation tokens lengths #1708

Fix indentation tokens lengths #1708

aabounegm commented Oct 2, 2024 •

edited

Loading

msujew Oct 9, 2024

aabounegm Oct 9, 2024

msujew Oct 14, 2024

msujew left a comment

Fix indentation tokens lengths #1708

Fix indentation tokens lengths #1708

Conversation

aabounegm commented Oct 2, 2024 • edited Loading

msujew Oct 9, 2024

Choose a reason for hiding this comment

aabounegm Oct 9, 2024

Choose a reason for hiding this comment

msujew Oct 14, 2024

Choose a reason for hiding this comment

msujew left a comment

Choose a reason for hiding this comment

aabounegm commented Oct 2, 2024 •

edited

Loading