Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a new lexer (and split Token and related code out of experimental/ast) #358

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mcy
Copy link
Member

@mcy mcy commented Oct 25, 2024

This PR implements a new Protobuf lexer under experimental/parser. It is a fairly lenient lexer that parses many things protoc does not, and diagnoses them.

Along with this I have separated the token parts of experimental/ast into its own package. This was actually a really good idea, and makes the token code (arguably the fussiest part of the AST library) easier to follow. It's a good sign when three files become eight.

I also did some work in experimental/report to make it so there is Only One True Span Type, report.Span. This side-steps a lot of the nastiness around carrying line and column information around; it is now truly only computed on-demand when diagnostics are rendered. Also, hooray for the massive experimental/report test suite.

@mcy mcy requested a review from jhump October 25, 2024 19:25
@mcy
Copy link
Member Author

mcy commented Oct 25, 2024

I would like to note that it is really nice to feel like I am "iterating" on existing code now XD

@mcy mcy force-pushed the mcy/lexer2 branch 2 times, most recently from 7306a37 to ca721ad Compare October 25, 2024 21:28
experimental/internal/with.go Outdated Show resolved Hide resolved
// This is helpful for immediately panicking on function entry.
func PanicIfNil[C comparable](with *With[C]) {
if with.Nil() {
panic(fmt.Errorf("use of zero value: %p", with))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Printing the address of with doesn't seem helpful for users that get this error. The message seems a little vague, too. Maybe instead:

Suggested change
panic(fmt.Errorf("use of zero value: %p", with))
var zero C
panic(fmt.Errorf("value has nil context %T", zero))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, I always forget about %T.

experimental/report/renderer.go Outdated Show resolved Hide resolved
experimental/token/kind.go Outdated Show resolved Hide resolved
experimental/token/raw.go Outdated Show resolved Hide resolved
digits:
for i := 0; i < digits; i++ {
if l.Done() {
break escapeLoop
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, let's not use labels -- this form of break is basically a goto. You could use a couple of local var flags to get the right control flow. But even better is likely to factor some of this out to a function, where you can use break vs return to control the level at which the loops are resumed. That would also help with readability since this function is kind of a monster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular label is gone.

Personally I am kind of allergic to using functions in lieu of labels. However, this function did benefit from some chopping up regardless.

case r >= 'A' && r <= 'F':
value |= uint32(r) - 'A' + 10
default:
break digits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct for \u and \U escapes, which must be exactly four or eight. Only \x can be short.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Their length is validated elsewhere.

experimental/parser/lexer.go Outdated Show resolved Hide resolved
experimental/parser/lexer.go Outdated Show resolved Hide resolved
})
return
}
text = text[utf8.RuneLen(r):]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will utf8.RuneLen always be correct? Just want to make sure whether it's possible for an encoder to marshal a rune in a "non-canonical" way, where we'd really need the original length from utf8.DecodeRune.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8 does not permit over-long encodings, which DecodeRune rejects.

@mcy mcy requested a review from jhump November 13, 2024 23:35
mcy added a commit that referenced this pull request Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants