Start, end of tokens in sanitized text #69

zzwx · 2020-11-30T13:12:04Z

This update is to keep track of original locations in the provided text, however I couldn't figure out how better to deal with sanitized (clean) text step inside Tokenize without breaking API here:

	clean, white := t.sanitizer.Replace(text), false
	length := len(clean)

Obviously clean becomes the actual source for Start and End, and not the original text.

Possible solution: Leave sanitizing up to caller so that they can have both the original string & the locations in it.

nicolasassi · 2021-01-15T11:56:56Z

types.go

+	Start int    // The token's start in bytes in sanitized text.
+	End   int    // The token's end in bytes in sanitized text.


confusing comments, please make it simplier as the concept of "sanitized" is outside this struct

nicolasassi · 2021-01-15T12:18:57Z

utilities.go

@@ -52,24 +52,24 @@ func getDiskAsset(path string) *gob.Decoder {
 	return gob.NewDecoder(f)
 }

-func hasAnyPrefix(s string, prefixes []string) bool {
+func hasAnyPrefix(s string, prefixes []string) int {


please add some comment explaning that -1 is the same as false

nicolasassi · 2021-01-15T12:21:49Z

I did not understand how this change is supposed to break the API. Could you please explain better?
Your changes seems promising!

zzwx · 2021-01-15T12:39:28Z

Thank you for reviewing.

See, what is expected is that Start and End would refer to the original source string. However it is being sanitized right inside Tokenize and all calculations are now based on a possibly modified source, and the expectation here is not true anymore. (That's why I commented them with a reference to sanitized text, to remind of that). The break would be to remove sanitizing and leave it up to the caller to choose. Then he will have a reference to the correct source string.

The fear though is that the library depends on this sanitizing step.

Start, end of tokens in sanitized text

f7164bc

nicolasassi reviewed Jan 15, 2021

View reviewed changes

nicolasassi added Status: Revision Needed Type: Enhancement labels Jan 15, 2021

nicolasassi self-assigned this Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start, end of tokens in sanitized text #69

Start, end of tokens in sanitized text #69

zzwx commented Nov 30, 2020

nicolasassi Jan 15, 2021

nicolasassi Jan 15, 2021

nicolasassi commented Jan 15, 2021

zzwx commented Jan 15, 2021

		Start int // The token's start in bytes in sanitized text.
		End int // The token's end in bytes in sanitized text.

Start, end of tokens in sanitized text #69

Are you sure you want to change the base?

Start, end of tokens in sanitized text #69

Conversation

zzwx commented Nov 30, 2020

nicolasassi Jan 15, 2021

Choose a reason for hiding this comment

nicolasassi Jan 15, 2021

Choose a reason for hiding this comment

nicolasassi commented Jan 15, 2021

zzwx commented Jan 15, 2021