Implement a new lexer (and split `Token` and related code out of `experimental/ast`) #358

mcy · 2024-10-25T19:25:10Z

This PR implements a new Protobuf lexer under experimental/parser. It is a fairly lenient lexer that parses many things protoc does not, and diagnoses them.

Along with this I have separated the token parts of experimental/ast into its own package. This was actually a really good idea, and makes the token code (arguably the fussiest part of the AST library) easier to follow. It's a good sign when three files become eight.

I also did some work in experimental/report to make it so there is Only One True Span Type, report.Span. This side-steps a lot of the nastiness around carrying line and column information around; it is now truly only computed on-demand when diagnostics are rendered. Also, hooray for the massive experimental/report test suite.

mcy · 2024-10-25T19:25:28Z

I would like to note that it is really nice to feel like I am "iterating" on existing code now XD

experimental/internal/with.go

jhump · 2024-10-29T15:56:46Z

experimental/internal/with.go

+// This is helpful for immediately panicking on function entry.
+func PanicIfNil[C comparable](with *With[C]) {
+	if with.Nil() {
+		panic(fmt.Errorf("use of zero value: %p", with))


Printing the address of with doesn't seem helpful for users that get this error. The message seems a little vague, too. Maybe instead:

Suggested change

panic(fmt.Errorf("use of zero value: %p", with))

var zero C

panic(fmt.Errorf("value has nil context %T", zero))

Ooh, I always forget about %T.

experimental/report/renderer.go

experimental/token/kind.go

experimental/token/raw.go

jhump · 2024-11-06T16:20:54Z

experimental/parser/lexer.go

+		digits:
+			for i := 0; i < digits; i++ {
+				if l.Done() {
+					break escapeLoop


In general, let's not use labels -- this form of break is basically a goto. You could use a couple of local var flags to get the right control flow. But even better is likely to factor some of this out to a function, where you can use break vs return to control the level at which the loops are resumed. That would also help with readability since this function is kind of a monster.

This particular label is gone.

Personally I am kind of allergic to using functions in lieu of labels. However, this function did benefit from some chopping up regardless.

jhump · 2024-11-06T16:24:19Z

experimental/parser/lexer.go

+				case r >= 'A' && r <= 'F':
+					value |= uint32(r) - 'A' + 10
+				default:
+					break digits


This is not correct for \u and \U escapes, which must be exactly four or eight. Only \x can be short.

Their length is validated elsewhere.

experimental/parser/lexer.go

jhump · 2024-11-06T16:26:33Z

experimental/parser/lexer.go

+			})
+			return
+		}
+		text = text[utf8.RuneLen(r):]


Will utf8.RuneLen always be correct? Just want to make sure whether it's possible for an encoder to marshal a rune in a "non-canonical" way, where we'd really need the original length from utf8.DecodeRune.

UTF-8 does not permit over-long encodings, which DecodeRune rejects.

mcy requested a review from jhump October 25, 2024 19:25

mcy force-pushed the mcy/lexer2 branch 2 times, most recently from 7306a37 to ca721ad Compare October 25, 2024 21:28

jhump reviewed Nov 6, 2024

View reviewed changes

mcy force-pushed the mcy/lexer2 branch from 2139085 to 401302a Compare November 13, 2024 23:35

mcy requested a review from jhump November 13, 2024 23:35

mcy added a commit that referenced this pull request Nov 19, 2024

Contents of #358

958692c

mcy added 6 commits November 18, 2024 17:02

rip all the token stuff out of ast

2a86d93

implement a lexer (tests tbd)

6cc540c

handle non-printable chracters in diagnostics gracefully

b19bf73

start writing lexer tests; fix some unicode issues

683be9d

tests

ad1ab53

windows fix

009d06c

mcy force-pushed the mcy/lexer2 branch from 401302a to 009d06c Compare November 19, 2024 01:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a new lexer (and split `Token` and related code out of `experimental/ast`) #358

Implement a new lexer (and split `Token` and related code out of `experimental/ast`) #358

mcy commented Oct 25, 2024 •

edited

Loading

mcy commented Oct 25, 2024

jhump Oct 29, 2024

mcy Nov 11, 2024

jhump Nov 6, 2024

mcy Nov 13, 2024

jhump Nov 6, 2024

mcy Nov 13, 2024

jhump Nov 6, 2024

mcy Nov 11, 2024

	panic(fmt.Errorf("use of zero value: %p", with))
	var zero C
	panic(fmt.Errorf("value has nil context %T", zero))

Implement a new lexer (and split Token and related code out of experimental/ast) #358

Are you sure you want to change the base?

Implement a new lexer (and split Token and related code out of experimental/ast) #358

Conversation

mcy commented Oct 25, 2024 • edited Loading

mcy commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Implement a new lexer (and split `Token` and related code out of `experimental/ast`) #358

Implement a new lexer (and split `Token` and related code out of `experimental/ast`) #358

mcy commented Oct 25, 2024 •

edited

Loading