Can we expose token offsets? #3

susanhuhu · 2023-04-19T00:19:12Z

I have a scenario that will benefit from knowing the offsets for each token, can we expose a function like below? I tried to send a PR but got 403 error

// Tokenize tokenizes text into pieces and return tokens and offsets
func (s *Sentencepiece) TokenizeToOffsets(text string) []TokenOffset {
text = normalize(text)
if s.lowercase {
text = strings.ToLower(text)
}
runes := torunes(text)
replaceWhiteSpace(runes)
slices := s.decodeForwardToken(runes)
slices = s.decodeBackwards(slices)
return s.sliceToTokens(slices)
}

vikesh-raj · 2023-04-20T08:04:29Z

After normalization the original text is not same as the input text.
So the token offsets would have incorrect values.

If you are returning token offsets, you have to also return the normalized text.
Is this OK for you ?

susanhuhu · 2023-04-20T16:11:41Z

hm... actually no, I need to know the orgiinal offsets. Is there any way to achieve that?

vikesh-raj · 2023-04-21T12:31:03Z

It is difficult to achieve the original offsets. The normalization code needs to be rewritten with offsets in mind.

susanhuhu · 2023-04-21T17:04:18Z

If I foke this and create a version without normalization, how much impact it has for the accuracy?

vikesh-raj · 2023-04-21T17:53:52Z

You can omit the normalization, if your input doesn't have control and other special characters.

susanhuhu · 2023-04-21T19:10:14Z

great, is it possible for you to grant me write permission to this repository? So I can send out code changes to add such an option? Or you prefer I fork this repository and make my own version?

vikesh-raj · 2023-04-24T08:10:24Z

I think it is better to fork this repo and make your own change.
A little bit attribution would be great.

susanhuhu · 2023-04-24T17:03:12Z

Sure. https://github.com/susanhuhu/go-sentencepiece-encoder is the fork and with my changes already. How shall I do attribution please?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we expose token offsets? #3

Can we expose token offsets? #3

susanhuhu commented Apr 19, 2023

vikesh-raj commented Apr 20, 2023

susanhuhu commented Apr 20, 2023

vikesh-raj commented Apr 21, 2023

susanhuhu commented Apr 21, 2023

vikesh-raj commented Apr 21, 2023

susanhuhu commented Apr 21, 2023

vikesh-raj commented Apr 24, 2023

susanhuhu commented Apr 24, 2023

Can we expose token offsets? #3

Can we expose token offsets? #3

Comments

susanhuhu commented Apr 19, 2023

vikesh-raj commented Apr 20, 2023

susanhuhu commented Apr 20, 2023

vikesh-raj commented Apr 21, 2023

susanhuhu commented Apr 21, 2023

vikesh-raj commented Apr 21, 2023

susanhuhu commented Apr 21, 2023

vikesh-raj commented Apr 24, 2023

susanhuhu commented Apr 24, 2023