Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cut_chunk_for_newline can discard beginning text #3675

Closed
1 task done
cthompson165 opened this issue Aug 24, 2023 · 3 comments
Closed
1 task done

cut_chunk_for_newline can discard beginning text #3675

cthompson165 opened this issue Aug 24, 2023 · 3 comments
Labels
bug Something isn't working stale

Comments

@cthompson165
Copy link

Describe the bug

I'm trying to understand training on raw text and have reproduced a section of the training module code locally in a test app. It looks like the cut_chunk_for_newline function can throw away the beginning text if the first newline falls before newline_favor_len characters.

Is this by design? I think it's generally ok to discard text because of the overlap, but at the beginning there is nothing to overlap with.

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

With the following text and newline_favor_len = 128, the first sentence ("Lacrosse is a contact team sport played with a lacrosse stick and a lacrosse ball") is discarded:

"Lacrosse is a contact team sport played with a lacrosse stick and a lacrosse ball.

It is the oldest organized sport in North America, with its origins with the indigenous people of North America as early as the 12th century. The game was extensively modified by European colonists, reducing the violence, to create its current collegiate and professional form."

Screenshot

No response

Logs

N/A

System Info

N/A
@cthompson165 cthompson165 added the bug Something isn't working label Aug 24, 2023
@oobabooga
Copy link
Owner

Could you test #3476?

@cthompson165
Copy link
Author

Could you test #3476?

Yep - looks good

@github-actions github-actions bot added the stale label Oct 5, 2023
@github-actions
Copy link

github-actions bot commented Oct 5, 2023

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

@github-actions github-actions bot closed this as completed Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

2 participants