Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More sentence examples #10

Open
xeoncross opened this issue Dec 14, 2016 · 20 comments
Open

More sentence examples #10

xeoncross opened this issue Dec 14, 2016 · 20 comments

Comments

@xeoncross
Copy link

The python lib pragmatic_segmenter has a list of 50+ sentence split examples that this lib fails to parse. You can use their list to test this lib.

For example:

He left the bank at 6 P.M. Mr. Smith then went to the store.

Which neurosnap/sentences assumes is one sentence.

@neurosnap
Copy link
Owner

Thanks for the issue! I'll look into these examples and see what I can do about fixing some of these incorrectly parsed sentence examples.

@neurosnap neurosnap added the bug label Dec 14, 2016
@neurosnap
Copy link
Owner

I've added a branch with all the golden rules: https://github.com/neurosnap/sentences/tree/golden-rule

I excluded 7 tests because I felt they were unfair (breaking up lists is not the job of a sentence tokenizer imo)

So this library successfully passed 27/44 tests, failing 17 tests. It's not great, but I think most of these could be solved in the english package.

go test ./...
ok  	github.com/neurosnap/sentences	0.174s
?   	github.com/neurosnap/sentences/cmd/sentences	[no test files]
?   	github.com/neurosnap/sentences/data	[no test files]
--- FAIL: TestGoldenRules (0.00s)
	golden_rules_test.go:11: 10. Two letter (prepositive) abbreviations
	golden_rules_test.go:12: Actual: [<Sentence [0:13] 'I can see Mt.'> <Sentence [13:29] ' Fuji from here.'>]
	golden_rules_test.go:13: Actual: 2, Expected: 1
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 12. Possesive two letter abbreviations
	golden_rules_test.go:12: Actual: [<Sentence [0:17] 'That is JFK Jr.'s'> <Sentence [17:23] ' book.'>]
	golden_rules_test.go:13: Actual: 2, Expected: 1
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 14. Multi-period abbreviations at the end of a sentence
	golden_rules_test.go:12: Actual: [<Sentence [0:33] 'I live in the E.U. How about you?'>]
	golden_rules_test.go:13: Actual: 1, Expected: 2
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 15. U.S. as sentence boundary
	golden_rules_test.go:12: Actual: [<Sentence [0:33] 'I live in the U.S. How about you?'>]
	golden_rules_test.go:13: Actual: 1, Expected: 2
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 18. A.M. / P.M. as non sentence boundary and sentence boundary
	golden_rules_test.go:12: Actual: [<Sentence [0:37] 'At 5 a.m. Mr. Smith went to the bank.'> <Sentence [37:98] ' He left the bank at 6 P.M. Mr. Smith then went to the store.'>]
	golden_rules_test.go:13: Actual: 2, Expected: 3
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 21. Parenthetical inside sentence
	golden_rules_test.go:12: Actual: [<Sentence [0:69] 'He teaches science (He previously worked for 5 years as an engineer.)'> <Sentence [69:94] ' at the local University.'>]
	golden_rules_test.go:13: Actual: 2, Expected: 1
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 24. Single quotations inside sentence
	golden_rules_test.go:12: Actual: [<Sentence [0:35] 'She turned to him, 'This is great.''> <Sentence [35:45] ' she said.'>]
	golden_rules_test.go:13: Actual: 2, Expected: 1
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 25. Double quotations inside sentence
	golden_rules_test.go:12: Actual: [<Sentence [0:35] 'She turned to him, "This is great."'> <Sentence [35:45] ' she said.'>]
	golden_rules_test.go:13: Actual: 2, Expected: 1
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 32. List (period followed by parens and period to end item)
	golden_rules_test.go:12: Actual: [<Sentence [0:3] '1.)'> <Sentence [3:19] ' The first item.'> <Sentence [19:23] ' 2.)'> <Sentence [23:40] ' The second item.'>]
	golden_rules_test.go:13: Actual: 4, Expected: 2
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 36. List (period to mark list and period to end item)
	golden_rules_test.go:12: Actual: [<Sentence [0:2] '1.'> <Sentence [2:18] ' The first item.'> <Sentence [18:21] ' 2.'> <Sentence [21:38] ' The second item.'>]
	golden_rules_test.go:13: Actual: 4, Expected: 2
	golden_rules_test.go:14: ===
	golden_rules_test.go:20: 43. Geo Coordinates
	golden_rules_test.go:21: Actual: [You can find it at N°.] Expected: [You can find it at N°. 1026.253.553.]
	golden_rules_test.go:22: ===
	golden_rules_test.go:11: 44. Named entities with an exclamation point
	golden_rules_test.go:12: Actual: [<Sentence [0:19] 'She works at Yahoo!'> <Sentence [19:49] ' in the accounting department.'>]
	golden_rules_test.go:13: Actual: 2, Expected: 1
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 46. Ellipsis at end of quotation
	golden_rules_test.go:12: Actual: [<Sentence [0:102] 'Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex.'> <Sentence [102:104] ' .'> <Sentence [104:106] ' .'> <Sentence [106:111] ' .”'>]
	golden_rules_test.go:13: Actual: 4, Expected: 1
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 48. Ellipsis as sentence boundary (standard ellipsis rules)
	golden_rules_test.go:12: Actual: [<Sentence [0:215] 'If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period .'> <Sentence [215:217] ' .'> <Sentence [217:219] ' .'> <Sentence [219:221] ' .'> <Sentence [221:236] ' Next sentence.'>]
	golden_rules_test.go:13: Actual: 5, Expected: 2
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 49. Ellipsis as sentence boundary (non-standard ellipsis rules)
	golden_rules_test.go:12: Actual: [<Sentence [0:42] 'I never meant that.... She left the store.'>]
	golden_rules_test.go:13: Actual: 1, Expected: 2
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 50. Ellipsis as non sentence boundary
	golden_rules_test.go:12: Actual: [<Sentence [0:47] 'I wasn’t really ... well, what I mean...see .'> <Sentence [47:49] ' .'> <Sentence [49:51] ' .'> <Sentence [51:83] ' what I'm saying, the thing is .'> <Sentence [83:85] ' .'> <Sentence [85:87] ' .'> <Sentence [87:107] ' I didn’t mean it.'>]
	golden_rules_test.go:13: Actual: 7, Expected: 1
	golden_rules_test.go:14: ===
	golden_rules_test.go:11: 51. 4-dot ellipsis
	golden_rules_test.go:12: Actual: [<Sentence [0:47] 'One further habit which was somewhat weakened .'> <Sentence [47:49] ' .'> <Sentence [49:51] ' .'> <Sentence [51:113] ' was that of combining words into self-interpreting compounds.'> <Sentence [113:115] ' .'> <Sentence [115:117] ' .'> <Sentence [117:119] ' .'> <Sentence [119:151] ' The practice was not abandoned.'> <Sentence [151:153] ' .'> <Sentence [153:155] ' .'> <Sentence [155:157] ' .'>]
	golden_rules_test.go:13: Actual: 11, Expected: 2
	golden_rules_test.go:14: ===
FAIL

@ryzheboka
Copy link
Contributor

Hi @neurosnap, can I try to fix the failures related to quotations inside sentences, if they are still relevant?

@neurosnap
Copy link
Owner

Absolutely! The English package has some ad-how fixes, best to start there. Happy to help anyway I can (it has been awhile since I was in this codebase)

@ryzheboka
Copy link
Contributor

The word "she" is not in the SentStarters in english.json, although "i", "he", "it", "they" are in this list. Together with other changes, this could fix some sentences. However, my changes to the file english.json don't seem to be read by the program. Does it make sense to change SentStarters in english.json manually, and how do I do it?

@neurosnap
Copy link
Owner

After you make a change to english.json, what do you do to test if it worked or not?

@ryzheboka
Copy link
Contributor

ryzheboka commented Nov 28, 2022

I debug with a breakpoint in the func (a *MultiPunctWordAnnotation) tokenAnnotation(tokOne, tokTwo *sentences.Token) in the main.go in the english package. During debug, I look up the value of a.SentStarters in the following code (same function as above) using the debug instruments of the IDE :

/*
		[4.1.3. Frequent Sentence Starter Heruistic] If the
		next word is capitalized, and is a member of the
		frequent-sentence-starters list, then label tok as a
		sentence break.
	*/
	if a.TokenParser.FirstUpper(tokTwo) && a.SentStarters[nextTyp] != 0 {
		tokOne.SentBreak = true
		return
	}

@neurosnap
Copy link
Owner

neurosnap commented Nov 28, 2022

Did you run make english first? We embed the json data into the binary so it won't literally use what's in data/english.json, you first have to generate the data/english.go file which is what we then use in our build.

@ryzheboka
Copy link
Contributor

Thanks! Running make english helped.

@ryzheboka
Copy link
Contributor

Can I work on a PR to fix some other sentences? I would like to try to fix the list examples.

@neurosnap
Copy link
Owner

Yep! Happy to review any PRs

@ryzheboka
Copy link
Contributor

Hi @neurosnap ,
can I work on fixing the rules regarding ellipsis?

@neurosnap
Copy link
Owner

Sure thanks!

@ryzheboka
Copy link
Contributor

The 50th sentence is not intuitive for me:
"I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it."
It's not really clear to me that this example is one sentence and not two sentences. "I did't mean it" could be a sentence on its own, as well as the part of the sentence before that. Can I just leave the example out when handlich ellipsis?

@neurosnap
Copy link
Owner

Fine with me! Thanks

@ryzheboka
Copy link
Contributor

Can I work on a PR to fix some other sentences? I would like to try to add rules for Geo Coordinates.

@neurosnap
Copy link
Owner

Great, thanks!

@ryzheboka
Copy link
Contributor

In the last PR #33 , formatting needed to be changed. What do you think about adding a format check to the pipeline? I would like to open an issue for that, and then make a PR.

@neurosnap
Copy link
Owner

Sounds good! Feel free to open separate issues as well. We don't want to overload this one task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants