Regular expression NER #118

francolq · 2016-12-12T20:40:50Z

Described here:
https://groups.google.com/forum/?hl=es-419#!topic/iepy/NqIP0nb0-ic

jmansilla · 2016-12-20T13:29:03Z

iepy/preprocess/ner/regexp.py

+            try:
+                m = next(i)
+                start, end = m.span()
+                # FIXME: do not count from the beggining


What should we do with this FIXME note? Is it already fixed and you just forgot to remove the comment?

Sorry, my bad. This is not a FIXME, it is at most a TODO, because a small optimization can be done here. I think the comment can be safely removed.

jmansilla · 2016-12-20T13:31:29Z

iepy/preprocess/ner/regexp.py

+    # preprocess the regular expression
+    regexp = re.sub(r'\s', '', regexp)
+    # replace < and > only if not double (<< or >>):
+    # FIXME: avoid matching \< and \>.


Same question re FIXME here.
Can't be solved before merging into develop?

Almost same answer here. My bad to call this a FIXME. It would be an enhancement to allow escaping '<' and '>'. The comment can be removed.

…nts (as discussed in the pull request).

rafacarrascosa · 2016-12-26T17:28:52Z

iepy/preprocess/ner/regexp.py

+
+    def __init__(self, tokens):
+        # replace < and > inside tokens with \< and \>
+        _raw = '><'.join(w.replace('<', '\<').replace('>', '\>') for w in tokens)


I'm no completely sure, but would it be a problem if there is a token with \< (or \>) inside it?

rafacarrascosa · 2016-12-26T17:30:46Z

iepy/preprocess/ner/regexp.py

+    def run_ner(self, doc):
+        entities = []
+        tokens = doc.tokens
+        searcher = TokenSearcher(tokens)


TokenSearcher is implemented as a class but it is stateless in a practical sense and it is used more like a function than a class.

rafacarrascosa · 2016-12-26T17:31:31Z

iepy/preprocess/ner/regexp.py

+import re
+import codecs
+
+from nltk.text import TokenSearcher as NLTKTokenSearcher


This import looks unused since (apparently) no method from NLTKTokenSearcher is used.

rafacarrascosa · 2016-12-26T17:33:32Z

iepy/preprocess/ner/regexp.py

+                token_start = self._raw[:start].count('><')
+                token_end = self._raw[:end].count('><')
+                yield MatchObject(m, token_start, token_end)
+            except:


This try...except is dangerous because it silently hides any error that can happen inside the loop.

Why not to use for m in i: instead?

francolq added 5 commits November 30, 2016 12:09

New Regular Expression NER. Based on, and improving, NLTKTokenSearcher.

7d818c6

Basic tests for regexp NER runner.

2e912ee

Test for regexps with named groups.

45cf828

Remove commented code.

a9e5d70

Some minimal documentation.

34ca219

jmansilla reviewed Dec 20, 2016

View reviewed changes

Remove FIXME comments for things that actually are possible enhanceme…

c817913

…nts (as discussed in the pull request).

rafacarrascosa reviewed Dec 26, 2016

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular expression NER #118

Regular expression NER #118

francolq commented Dec 12, 2016

jmansilla Dec 20, 2016

francolq Dec 20, 2016

jmansilla Dec 20, 2016

francolq Dec 20, 2016

rafacarrascosa Dec 26, 2016

rafacarrascosa Dec 26, 2016

rafacarrascosa Dec 26, 2016

rafacarrascosa Dec 26, 2016

Regular expression NER #118

Are you sure you want to change the base?

Regular expression NER #118

Conversation

francolq commented Dec 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment