Broken support for short days / months with periods in them #221

bpotard · 2017-07-26T16:57:32Z

Hello,

It seems that short months / short days are not handled correctly generally speaking - pretty much none of the abbreviations in locales from ICU are being recognised because of this.

For example:

>>> ces = parsedatetime.Constants('es_ES')
>>> print ces.locale.shortMonths
[u'ene.', u'feb.', u'mar.', u'abr.', u'may.', u'jun.', u'jul.', u'ago.', u'sept.', u'oct.', u'nov.', u'dic.']
>>> cal = parsedatetime.Calendar(ces)
>>> cal.parseDT('3 ene. 2014')
(datetime.datetime(2017, 7, 26, 20, 14), 2)
>>> cal.parseDT('3 enero 2014')
(datetime.datetime(2014, 1, 3, 17, 14, 2), 1)

This can be traced to the way the matching regex are built: there is a word boundary imposed at the end of each matching group - which will normally match a period, but not the ". " transition , so I believe the final period in all abbreviations should really be removed from the regexes.

BTW, as a side effect you can create some evil crashes:

>>> cal.parseDT('3 ene.2014')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1771, in parseDT
    version=version)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1845, in parse
    retS, retTime, matched = parseMeth(s, sourceTime)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1483, in _partialParseDateStr
    sourceTime = self._evalDateStr(parseStr, sourceTime)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1113, in _evalDateStr
    return self.parseDateText(s, sourceTime)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 513, in parseDateText
    mth = m.group('mthname')
AttributeError: 'NoneType' object has no attribute 'group'

I came up with the following patch:

diff --git a/parsedatetime/__init__.py b/parsedatetime/__init__.py
index 990b9ca..0e3b41e 100644
--- a/parsedatetime/__init__.py
+++ b/parsedatetime/__init__.py
@@ -2393,6 +2393,9 @@ class Constants(object):
 
         if self.locale is not None:
 
+            def _sanitize_key(k):
+                return re.sub(".$","", k)
+
             def _getLocaleDataAdjusted(localeData):
                 """
                 If localeData is defined as ["mon|mnd", 'tu|tues'...] then this
@@ -2401,9 +2404,9 @@ class Constants(object):
                 adjusted = []
                 for d in localeData:
                     if '|' in d:
-                        adjusted += d.split("|")
+                        adjusted += map(_sanitize_key, d.split("|"))
                     else:
-                        adjusted.append(d)
+                        adjusted.append(_sanitize_key(d))
                 return adjusted
 
             def re_join(g):
@@ -2446,9 +2449,9 @@ class Constants(object):
                 for key in localeData:
                     if '|' in key:
                         for k in key.split('|'):
-                            offsetDict[k] = o
+                            offsetDict[_sanitize_key(k)] = o
                     else:
-                        offsetDict[key] = o
+                        offsetDict[_sanitize_key(key)] = o
                     o += 1
 
             _buildOffsets(self.locale.WeekdayOffsets,

After applying this patch you can happily do the following:

>>> cal.parseDT('3 ene. 2014')
(datetime.datetime(2014, 1, 3, 17, 49, 21), 1)
>>> cal.parseDT('3 ene 2014')
(datetime.datetime(2014, 1, 3, 17, 50, 9), 1)
>>> cal.parseDT('3 ene.2014')
(datetime.datetime(2018, 1, 3, 17, 50, 41), 1)

I believe this issue should be addressed as this would make ICU supported locales work much more reliably.

idpaterson · 2017-07-27T01:30:15Z

This fix looks good to me, it could also be addressed in the icu locale.

TL;DR for anyone seeing this issue: ICU uses punctuated month abbreviations (e.g. aug.) which wreak havoc on date parsing. The problem does not affect locales built in to parsedatetime because the abbreviations are not punctuated (e.g. aug).

bpotard · 2017-07-27T09:05:12Z

Thanks for the quick response!

You are right, I did not have a look at how the icu locale were built, and it makes more sense to sanitize the abbreviated days / months there rather than in __init__.py

Also, I have just noticed that my sanitization function was completely broken anyway, as it was actually removing the last character from everything not just the period, and did not handle keys with "|" separators. Long story short, it was breaking up pretty much everything...

Here is a new attempt:

diff --git a/parsedatetime/pdt_locales/icu.py b/parsedatetime/pdt_locales/icu.py
index 8bee64b..4479f6b 100644
--- a/parsedatetime/pdt_locales/icu.py
+++ b/parsedatetime/pdt_locales/icu.py
@@ -35,6 +35,11 @@ def merge_weekdays(base_wd, icu_wd):
 
 
 def get_icu(locale):
+
+    def _sanitize_key(k):
+        import re
+        return re.sub("\\.(\\||$)", "\\1", k)
+
     from . import base
     result = dict([(key, getattr(base, key))
                    for key in dir(base) if not key.startswith('_')])
@@ -58,16 +63,16 @@ def get_icu(locale):
 
     # grab ICU list of weekdays, skipping first entry which
     # is always blank
-    wd = [w.lower() for w in symbols.getWeekdays()[1:]]
-    swd = [sw.lower() for sw in symbols.getShortWeekdays()[1:]]
+    wd = [_sanitize_key(w.lower()) for w in symbols.getWeekdays()[1:]]
+    swd = [_sanitize_key(sw.lower()) for sw in symbols.getShortWeekdays()[1:]]
 
     # store them in our list with Monday first (ICU puts Sunday first)
     result['Weekdays'] = merge_weekdays(result['Weekdays'],
                                         wd[1:] + wd[0:1])
     result['shortWeekdays'] = merge_weekdays(result['shortWeekdays'],
                                              swd[1:] + swd[0:1])
-    result['Months'] = [m.lower() for m in symbols.getMonths()]
-    result['shortMonths'] = [sm.lower() for sm in symbols.getShortMonths()]
+    result['Months'] = [_sanitize_key(m.lower()) for m in symbols.getMonths()]
+    result['shortMonths'] = [_sanitize_key(sm.lower()) for sm in symbols.getShortMonths()]
     keys = ['full', 'long', 'medium', 'short']
 
     createDateInstance = pyicu.DateFormat.createDateInstance

idpaterson · 2017-07-28T02:44:23Z

Thanks for updating, I think there are a few more things to look at here. I reviewed some of the locales defined in ICU and it seems that most abbreviations are found in day names, month names, eras (e.g. A.D.), and meridian (e.g. P.M.).

A similar problem could exist for meridian, which for the es locale is defined as a. m. and p. m.. I won't be able to review this for a few days, would you please test whether the meridian needs to be fixed as well? At a glance it doesn't look like the meridian from icu makes it into any regular expressions.

bear · 2019-11-19T02:35:51Z

I merged the PR as the meridian does not get bundled into any regular expressions

bpotard mentioned this issue Jul 27, 2017

Fix to sanitize abbreviated months from icu #222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken support for short days / months with periods in them #221

Broken support for short days / months with periods in them #221

bpotard commented Jul 26, 2017

idpaterson commented Jul 27, 2017 •

edited

Loading

bpotard commented Jul 27, 2017 •

edited

Loading

idpaterson commented Jul 28, 2017

bear commented Nov 19, 2019

Broken support for short days / months with periods in them #221

Broken support for short days / months with periods in them #221

Comments

bpotard commented Jul 26, 2017

idpaterson commented Jul 27, 2017 • edited Loading

bpotard commented Jul 27, 2017 • edited Loading

idpaterson commented Jul 28, 2017

bear commented Nov 19, 2019

idpaterson commented Jul 27, 2017 •

edited

Loading

bpotard commented Jul 27, 2017 •

edited

Loading