You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to understand training on raw text and have reproduced a section of the training module code locally in a test app. It looks like the cut_chunk_for_newline function can throw away the beginning text if the first newline falls before newline_favor_len characters.
Is this by design? I think it's generally ok to discard text because of the overlap, but at the beginning there is nothing to overlap with.
Is there an existing issue for this?
I have searched the existing issues
Reproduction
With the following text and newline_favor_len = 128, the first sentence ("Lacrosse is a contact team sport played with a lacrosse stick and a lacrosse ball") is discarded:
"Lacrosse is a contact team sport played with a lacrosse stick and a lacrosse ball.
It is the oldest organized sport in North America, with its origins with the indigenous people of North America as early as the 12th century. The game was extensively modified by European colonists, reducing the violence, to create its current collegiate and professional form."
Screenshot
No response
Logs
N/A
System Info
N/A
The text was updated successfully, but these errors were encountered:
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.
Describe the bug
I'm trying to understand training on raw text and have reproduced a section of the training module code locally in a test app. It looks like the cut_chunk_for_newline function can throw away the beginning text if the first newline falls before newline_favor_len characters.
Is this by design? I think it's generally ok to discard text because of the overlap, but at the beginning there is nothing to overlap with.
Is there an existing issue for this?
Reproduction
With the following text and newline_favor_len = 128, the first sentence ("Lacrosse is a contact team sport played with a lacrosse stick and a lacrosse ball") is discarded:
"Lacrosse is a contact team sport played with a lacrosse stick and a lacrosse ball.
It is the oldest organized sport in North America, with its origins with the indigenous people of North America as early as the 12th century. The game was extensively modified by European colonists, reducing the violence, to create its current collegiate and professional form."
Screenshot
No response
Logs
System Info
The text was updated successfully, but these errors were encountered: