Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can we expose token offsets? #3

Open
susanhuhu opened this issue Apr 19, 2023 · 8 comments
Open

Can we expose token offsets? #3

susanhuhu opened this issue Apr 19, 2023 · 8 comments

Comments

@susanhuhu
Copy link

I have a scenario that will benefit from knowing the offsets for each token, can we expose a function like below? I tried to send a PR but got 403 error

// Tokenize tokenizes text into pieces and return tokens and offsets
func (s *Sentencepiece) TokenizeToOffsets(text string) []TokenOffset {
text = normalize(text)
if s.lowercase {
text = strings.ToLower(text)
}
runes := torunes(text)
replaceWhiteSpace(runes)
slices := s.decodeForwardToken(runes)
slices = s.decodeBackwards(slices)
return s.sliceToTokens(slices)
}

@vikesh-raj
Copy link
Owner

After normalization the original text is not same as the input text.
So the token offsets would have incorrect values.

If you are returning token offsets, you have to also return the normalized text.
Is this OK for you ?

@susanhuhu
Copy link
Author

hm... actually no, I need to know the orgiinal offsets. Is there any way to achieve that?

@vikesh-raj
Copy link
Owner

It is difficult to achieve the original offsets. The normalization code needs to be rewritten with offsets in mind.

@susanhuhu
Copy link
Author

If I foke this and create a version without normalization, how much impact it has for the accuracy?

@vikesh-raj
Copy link
Owner

You can omit the normalization, if your input doesn't have control and other special characters.

@susanhuhu
Copy link
Author

great, is it possible for you to grant me write permission to this repository? So I can send out code changes to add such an option? Or you prefer I fork this repository and make my own version?

@vikesh-raj
Copy link
Owner

I think it is better to fork this repo and make your own change.
A little bit attribution would be great.

@susanhuhu
Copy link
Author

Sure. https://github.com/susanhuhu/go-sentencepiece-encoder is the fork and with my changes already. How shall I do attribution please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants