Parsing C-style identifiers #711

maartenflippo · 2024-12-30T10:10:13Z

maartenflippo
Dec 30, 2024

I am trying to write a lexer for a C-style language, and I came across the following (to me) unexpected behavior. I am using version 1.0.0-alpha.7.

I have the following setup (simplified):

let ident = text::ident().map(Token::Ident);
let constant = text::int(10).map(Token::Constant); // Token::Constant takes a &str and not an integer type

let token = ident.or(constant).padded().repeated();

Let's say we are trying to parse the "1foo" as a token. The ident parser will fail, as expected, because identifiers cannot start with numbers. However, constant succeeds in parsing the "1", and then the ident parser succeeds on parsing "foo". It means that the input "1foo" leads to two tokens: [Constant("1"), Ident("foo")].

Ideally, I would expect the token parser to fail. If I were to use regex, I would use the word boundary as part of the pattern, leading to my desired behavior. My question is: Is there an 'idiomatic' way to achieve what I want? One possible solution is to change the ident parser to accept inputs starting with numbers, and then use .try_map to reject those identifiers starting with a number. It works, but it feels a bit like a hack.

In my search for a solution, I found that the nano_rust example has the exact same behavior. The lexer succeeds, and it is left to the parsing stage to identify the problem.

zesterer · 2024-12-31T10:52:19Z

zesterer
Dec 31, 2024
Maintainer

The usual approach is to change your definitions slightly:

Identifiers: any alphabetic + underscore, followed by any number of alphanumeric + underscore
Numbers: any numeric, followed by any number of alphanumeric + underscore

Of course, not all numbers necessarily correspond to a semantically valid number, but that's usually something caught by a later pass. 1foo is such a case: it would be parsed as a number, but later rejected as an invalid numeric literal. This approach allows things like binary, octal and hex numbers to be parsed more easily too.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing C-style identifiers #711

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Parsing C-style identifiers #711

maartenflippo Dec 30, 2024

Replies: 1 comment

zesterer Dec 31, 2024 Maintainer

maartenflippo
Dec 30, 2024

zesterer
Dec 31, 2024
Maintainer