Skip to content
This repository has been archived by the owner on May 14, 2023. It is now read-only.

Start, end of tokens in sanitized text #69

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zzwx
Copy link

@zzwx zzwx commented Nov 30, 2020

This update is to keep track of original locations in the provided text, however I couldn't figure out how better to deal with sanitized (clean) text step inside Tokenize without breaking API here:

	clean, white := t.sanitizer.Replace(text), false
	length := len(clean)

Obviously clean becomes the actual source for Start and End, and not the original text.

Possible solution: Leave sanitizing up to caller so that they can have both the original string & the locations in it.

Comment on lines +9 to +10
Start int // The token's start in bytes in sanitized text.
End int // The token's end in bytes in sanitized text.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusing comments, please make it simplier as the concept of "sanitized" is outside this struct

@@ -52,24 +52,24 @@ func getDiskAsset(path string) *gob.Decoder {
return gob.NewDecoder(f)
}

func hasAnyPrefix(s string, prefixes []string) bool {
func hasAnyPrefix(s string, prefixes []string) int {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some comment explaning that -1 is the same as false

@nicolasassi
Copy link
Collaborator

I did not understand how this change is supposed to break the API. Could you please explain better?
Your changes seems promising!

@zzwx
Copy link
Author

zzwx commented Jan 15, 2021

Thank you for reviewing.

See, what is expected is that Start and End would refer to the original source string. However it is being sanitized right inside Tokenize and all calculations are now based on a possibly modified source, and the expectation here is not true anymore. (That's why I commented them with a reference to sanitized text, to remind of that). The break would be to remove sanitizing and leave it up to the caller to choose. Then he will have a reference to the correct source string.

The fear though is that the library depends on this sanitizing step.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants