-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we expose token offsets? #3
Comments
After normalization the original text is not same as the input text. If you are returning token offsets, you have to also return the normalized text. |
hm... actually no, I need to know the orgiinal offsets. Is there any way to achieve that? |
It is difficult to achieve the original offsets. The normalization code needs to be rewritten with offsets in mind. |
If I foke this and create a version without normalization, how much impact it has for the accuracy? |
You can omit the normalization, if your input doesn't have control and other special characters. |
great, is it possible for you to grant me write permission to this repository? So I can send out code changes to add such an option? Or you prefer I fork this repository and make my own version? |
I think it is better to fork this repo and make your own change. |
Sure. https://github.com/susanhuhu/go-sentencepiece-encoder is the fork and with my changes already. How shall I do attribution please? |
I have a scenario that will benefit from knowing the offsets for each token, can we expose a function like below? I tried to send a PR but got 403 error
// Tokenize tokenizes text into pieces and return tokens and offsets
func (s *Sentencepiece) TokenizeToOffsets(text string) []TokenOffset {
text = normalize(text)
if s.lowercase {
text = strings.ToLower(text)
}
runes := torunes(text)
replaceWhiteSpace(runes)
slices := s.decodeForwardToken(runes)
slices = s.decodeBackwards(slices)
return s.sliceToTokens(slices)
}
The text was updated successfully, but these errors were encountered: