-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More sentence examples #10
Comments
Thanks for the issue! I'll look into these examples and see what I can do about fixing some of these incorrectly parsed sentence examples. |
I've added a branch with all the golden rules: https://github.com/neurosnap/sentences/tree/golden-rule I excluded 7 tests because I felt they were unfair (breaking up lists is not the job of a sentence tokenizer imo) So this library successfully passed 27/44 tests, failing 17 tests. It's not great, but I think most of these could be solved in the go test ./...
ok github.com/neurosnap/sentences 0.174s
? github.com/neurosnap/sentences/cmd/sentences [no test files]
? github.com/neurosnap/sentences/data [no test files]
--- FAIL: TestGoldenRules (0.00s)
golden_rules_test.go:11: 10. Two letter (prepositive) abbreviations
golden_rules_test.go:12: Actual: [<Sentence [0:13] 'I can see Mt.'> <Sentence [13:29] ' Fuji from here.'>]
golden_rules_test.go:13: Actual: 2, Expected: 1
golden_rules_test.go:14: ===
golden_rules_test.go:11: 12. Possesive two letter abbreviations
golden_rules_test.go:12: Actual: [<Sentence [0:17] 'That is JFK Jr.'s'> <Sentence [17:23] ' book.'>]
golden_rules_test.go:13: Actual: 2, Expected: 1
golden_rules_test.go:14: ===
golden_rules_test.go:11: 14. Multi-period abbreviations at the end of a sentence
golden_rules_test.go:12: Actual: [<Sentence [0:33] 'I live in the E.U. How about you?'>]
golden_rules_test.go:13: Actual: 1, Expected: 2
golden_rules_test.go:14: ===
golden_rules_test.go:11: 15. U.S. as sentence boundary
golden_rules_test.go:12: Actual: [<Sentence [0:33] 'I live in the U.S. How about you?'>]
golden_rules_test.go:13: Actual: 1, Expected: 2
golden_rules_test.go:14: ===
golden_rules_test.go:11: 18. A.M. / P.M. as non sentence boundary and sentence boundary
golden_rules_test.go:12: Actual: [<Sentence [0:37] 'At 5 a.m. Mr. Smith went to the bank.'> <Sentence [37:98] ' He left the bank at 6 P.M. Mr. Smith then went to the store.'>]
golden_rules_test.go:13: Actual: 2, Expected: 3
golden_rules_test.go:14: ===
golden_rules_test.go:11: 21. Parenthetical inside sentence
golden_rules_test.go:12: Actual: [<Sentence [0:69] 'He teaches science (He previously worked for 5 years as an engineer.)'> <Sentence [69:94] ' at the local University.'>]
golden_rules_test.go:13: Actual: 2, Expected: 1
golden_rules_test.go:14: ===
golden_rules_test.go:11: 24. Single quotations inside sentence
golden_rules_test.go:12: Actual: [<Sentence [0:35] 'She turned to him, 'This is great.''> <Sentence [35:45] ' she said.'>]
golden_rules_test.go:13: Actual: 2, Expected: 1
golden_rules_test.go:14: ===
golden_rules_test.go:11: 25. Double quotations inside sentence
golden_rules_test.go:12: Actual: [<Sentence [0:35] 'She turned to him, "This is great."'> <Sentence [35:45] ' she said.'>]
golden_rules_test.go:13: Actual: 2, Expected: 1
golden_rules_test.go:14: ===
golden_rules_test.go:11: 32. List (period followed by parens and period to end item)
golden_rules_test.go:12: Actual: [<Sentence [0:3] '1.)'> <Sentence [3:19] ' The first item.'> <Sentence [19:23] ' 2.)'> <Sentence [23:40] ' The second item.'>]
golden_rules_test.go:13: Actual: 4, Expected: 2
golden_rules_test.go:14: ===
golden_rules_test.go:11: 36. List (period to mark list and period to end item)
golden_rules_test.go:12: Actual: [<Sentence [0:2] '1.'> <Sentence [2:18] ' The first item.'> <Sentence [18:21] ' 2.'> <Sentence [21:38] ' The second item.'>]
golden_rules_test.go:13: Actual: 4, Expected: 2
golden_rules_test.go:14: ===
golden_rules_test.go:20: 43. Geo Coordinates
golden_rules_test.go:21: Actual: [You can find it at N°.] Expected: [You can find it at N°. 1026.253.553.]
golden_rules_test.go:22: ===
golden_rules_test.go:11: 44. Named entities with an exclamation point
golden_rules_test.go:12: Actual: [<Sentence [0:19] 'She works at Yahoo!'> <Sentence [19:49] ' in the accounting department.'>]
golden_rules_test.go:13: Actual: 2, Expected: 1
golden_rules_test.go:14: ===
golden_rules_test.go:11: 46. Ellipsis at end of quotation
golden_rules_test.go:12: Actual: [<Sentence [0:102] 'Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex.'> <Sentence [102:104] ' .'> <Sentence [104:106] ' .'> <Sentence [106:111] ' .”'>]
golden_rules_test.go:13: Actual: 4, Expected: 1
golden_rules_test.go:14: ===
golden_rules_test.go:11: 48. Ellipsis as sentence boundary (standard ellipsis rules)
golden_rules_test.go:12: Actual: [<Sentence [0:215] 'If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period .'> <Sentence [215:217] ' .'> <Sentence [217:219] ' .'> <Sentence [219:221] ' .'> <Sentence [221:236] ' Next sentence.'>]
golden_rules_test.go:13: Actual: 5, Expected: 2
golden_rules_test.go:14: ===
golden_rules_test.go:11: 49. Ellipsis as sentence boundary (non-standard ellipsis rules)
golden_rules_test.go:12: Actual: [<Sentence [0:42] 'I never meant that.... She left the store.'>]
golden_rules_test.go:13: Actual: 1, Expected: 2
golden_rules_test.go:14: ===
golden_rules_test.go:11: 50. Ellipsis as non sentence boundary
golden_rules_test.go:12: Actual: [<Sentence [0:47] 'I wasn’t really ... well, what I mean...see .'> <Sentence [47:49] ' .'> <Sentence [49:51] ' .'> <Sentence [51:83] ' what I'm saying, the thing is .'> <Sentence [83:85] ' .'> <Sentence [85:87] ' .'> <Sentence [87:107] ' I didn’t mean it.'>]
golden_rules_test.go:13: Actual: 7, Expected: 1
golden_rules_test.go:14: ===
golden_rules_test.go:11: 51. 4-dot ellipsis
golden_rules_test.go:12: Actual: [<Sentence [0:47] 'One further habit which was somewhat weakened .'> <Sentence [47:49] ' .'> <Sentence [49:51] ' .'> <Sentence [51:113] ' was that of combining words into self-interpreting compounds.'> <Sentence [113:115] ' .'> <Sentence [115:117] ' .'> <Sentence [117:119] ' .'> <Sentence [119:151] ' The practice was not abandoned.'> <Sentence [151:153] ' .'> <Sentence [153:155] ' .'> <Sentence [155:157] ' .'>]
golden_rules_test.go:13: Actual: 11, Expected: 2
golden_rules_test.go:14: ===
FAIL |
Hi @neurosnap, can I try to fix the failures related to quotations inside sentences, if they are still relevant? |
Absolutely! The English package has some ad-how fixes, best to start there. Happy to help anyway I can (it has been awhile since I was in this codebase) |
The word "she" is not in the SentStarters in english.json, although "i", "he", "it", "they" are in this list. Together with other changes, this could fix some sentences. However, my changes to the file english.json don't seem to be read by the program. Does it make sense to change SentStarters in english.json manually, and how do I do it? |
After you make a change to |
I debug with a breakpoint in the
|
Did you run |
Thanks! Running |
Can I work on a PR to fix some other sentences? I would like to try to fix the list examples. |
Yep! Happy to review any PRs |
Hi @neurosnap , |
Sure thanks! |
The 50th sentence is not intuitive for me: |
Fine with me! Thanks |
Can I work on a PR to fix some other sentences? I would like to try to add rules for Geo Coordinates. |
Great, thanks! |
In the last PR #33 , formatting needed to be changed. What do you think about adding a format check to the pipeline? I would like to open an issue for that, and then make a PR. |
Sounds good! Feel free to open separate issues as well. We don't want to overload this one task |
The python lib pragmatic_segmenter has a list of 50+ sentence split examples that this lib fails to parse. You can use their list to test this lib.
For example:
Which neurosnap/sentences assumes is one sentence.
The text was updated successfully, but these errors were encountered: