Best way to get sentence spans #8

fhamborg · 2019-12-12T18:38:26Z

Hi, thank you for the awesome library! I don't know what you did, but at least for the data I need to process (mostly news articles), it seems that syntok performs best for sentence splitting. :-)

I was wondering what would be the most efficient way to get char-based sentence spans. Currently, I got:

spans = []
for paragraph in segmenter.analyze(text):
        for sent in paragraph:
                spans.append((sent[0].offset, sent[-1].offset + len(sent[-1].value)))

Do you think this is most efficient or is there a better way? Thanks in advance for your reply!

The text was updated successfully, but these errors were encountered:

fnl · 2019-12-18T22:31:29Z

Hi Felix, thank you for you kind words, glad you like syntok.

Well, semantically, there is no easier way to get to these offsets, because they depend on all the methods you called there, plus the length of the last token.
But if you want to submit a function that wraps the above functionality in a simple API, such as a offsets function in the segmenter module, I'd be happy to accept a clean PR, if you think that is useful, that is, probably an API like this:

def offset_iterator(text: str) -> Iterator[(Int, Int)]:
    for paragraph in analyze(text):
        for sent in paragraph:
            yield sent[0].offset, sent[-1].offset + len(sent[-1].value)

def offsets(text: str) -> List[(Int, Int)]:
    return list(offset_iterator(text))

It would have at least one other user, see issue #5 , too! ;)

fnl · 2019-12-18T22:49:14Z

Maybe I should add that if efficiency does matter to you, you could forego the end offset, entirely, and just generate a list of (start) integers. The end then can be either the offset of the next sentence start, and is the end of your str object for the last sentence. Because anything dangling must be spaces that can be str.rstrip-ped, if it matters. But then, maybe you do need to report exact offsets, for UI highlighting or whatever reasons...

As a generator:

def offset_iterator(text: str) -> Iterator[Int]:
    for paragraph in analyze(text):
        for sent in paragraph:
            yield sent[0].offset

Or, as a comprehension:

def offsets(text: str) -> List[Int]:
    return [sent[0].offset for para in analyze(text) for sent in para]

fhamborg · 2019-12-19T13:18:59Z

Hi Florian! Cool! Sure, what specifically do you have in mind? To me, your code snippets look like what would be more or less the PR I would send, i.e., they are almost complete already, aren't they?
Cheers,
Felix

fnl · 2019-12-19T15:26:12Z

Correct, I would particularly vouch for the first two functions (with the end offset), because the second two (without) seem almost too trivial to add. The two functions should fit verbatim into the module, but I didn't run and much less test the code...

fnl closed this as completed Dec 18, 2019

fnl reopened this Dec 18, 2019

fnl added enhancement New feature or request help wanted Extra attention is needed question Further information is requested labels Dec 18, 2019

fnl changed the title ~~best way to get sentence spans~~ Best way to get sentence spans Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to get sentence spans #8

Best way to get sentence spans #8

fhamborg commented Dec 12, 2019

fnl commented Dec 18, 2019 •

edited

Loading

fnl commented Dec 18, 2019

fhamborg commented Dec 19, 2019

fnl commented Dec 19, 2019

Best way to get sentence spans #8

Best way to get sentence spans #8

Comments

fhamborg commented Dec 12, 2019

fnl commented Dec 18, 2019 • edited Loading

fnl commented Dec 18, 2019

fhamborg commented Dec 19, 2019

fnl commented Dec 19, 2019

fnl commented Dec 18, 2019 •

edited

Loading