Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken support for short days / months with periods in them #221

Open
bpotard opened this issue Jul 26, 2017 · 4 comments
Open

Broken support for short days / months with periods in them #221

bpotard opened this issue Jul 26, 2017 · 4 comments

Comments

@bpotard
Copy link

bpotard commented Jul 26, 2017

Hello,

It seems that short months / short days are not handled correctly generally speaking - pretty much none of the abbreviations in locales from ICU are being recognised because of this.

For example:

>>> ces = parsedatetime.Constants('es_ES')
>>> print ces.locale.shortMonths
[u'ene.', u'feb.', u'mar.', u'abr.', u'may.', u'jun.', u'jul.', u'ago.', u'sept.', u'oct.', u'nov.', u'dic.']
>>> cal = parsedatetime.Calendar(ces)
>>> cal.parseDT('3 ene. 2014')
(datetime.datetime(2017, 7, 26, 20, 14), 2)
>>> cal.parseDT('3 enero 2014')
(datetime.datetime(2014, 1, 3, 17, 14, 2), 1)

This can be traced to the way the matching regex are built: there is a word boundary imposed at the end of each matching group - which will normally match a period, but not the ". " transition , so I believe the final period in all abbreviations should really be removed from the regexes.

BTW, as a side effect you can create some evil crashes:

>>> cal.parseDT('3 ene.2014')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1771, in parseDT
    version=version)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1845, in parse
    retS, retTime, matched = parseMeth(s, sourceTime)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1483, in _partialParseDateStr
    sourceTime = self._evalDateStr(parseStr, sourceTime)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 1113, in _evalDateStr
    return self.parseDateText(s, sourceTime)
  File "/home/blaise/.local/lib/python2.7/site-packages/parsedatetime/__init__.py", line 513, in parseDateText
    mth = m.group('mthname')
AttributeError: 'NoneType' object has no attribute 'group'

I came up with the following patch:

diff --git a/parsedatetime/__init__.py b/parsedatetime/__init__.py
index 990b9ca..0e3b41e 100644
--- a/parsedatetime/__init__.py
+++ b/parsedatetime/__init__.py
@@ -2393,6 +2393,9 @@ class Constants(object):
 
         if self.locale is not None:
 
+            def _sanitize_key(k):
+                return re.sub(".$","", k)
+
             def _getLocaleDataAdjusted(localeData):
                 """
                 If localeData is defined as ["mon|mnd", 'tu|tues'...] then this
@@ -2401,9 +2404,9 @@ class Constants(object):
                 adjusted = []
                 for d in localeData:
                     if '|' in d:
-                        adjusted += d.split("|")
+                        adjusted += map(_sanitize_key, d.split("|"))
                     else:
-                        adjusted.append(d)
+                        adjusted.append(_sanitize_key(d))
                 return adjusted
 
             def re_join(g):
@@ -2446,9 +2449,9 @@ class Constants(object):
                 for key in localeData:
                     if '|' in key:
                         for k in key.split('|'):
-                            offsetDict[k] = o
+                            offsetDict[_sanitize_key(k)] = o
                     else:
-                        offsetDict[key] = o
+                        offsetDict[_sanitize_key(key)] = o
                     o += 1
 
             _buildOffsets(self.locale.WeekdayOffsets,

After applying this patch you can happily do the following:

>>> cal.parseDT('3 ene. 2014')
(datetime.datetime(2014, 1, 3, 17, 49, 21), 1)
>>> cal.parseDT('3 ene 2014')
(datetime.datetime(2014, 1, 3, 17, 50, 9), 1)
>>> cal.parseDT('3 ene.2014')
(datetime.datetime(2018, 1, 3, 17, 50, 41), 1)

I believe this issue should be addressed as this would make ICU supported locales work much more reliably.

@idpaterson
Copy link
Collaborator

idpaterson commented Jul 27, 2017

This fix looks good to me, it could also be addressed in the icu locale.

TL;DR for anyone seeing this issue: ICU uses punctuated month abbreviations (e.g. aug.) which wreak havoc on date parsing. The problem does not affect locales built in to parsedatetime because the abbreviations are not punctuated (e.g. aug).

@bpotard
Copy link
Author

bpotard commented Jul 27, 2017

Thanks for the quick response!

You are right, I did not have a look at how the icu locale were built, and it makes more sense to sanitize the abbreviated days / months there rather than in __init__.py

Also, I have just noticed that my sanitization function was completely broken anyway, as it was actually removing the last character from everything not just the period, and did not handle keys with "|" separators. Long story short, it was breaking up pretty much everything...

Here is a new attempt:

diff --git a/parsedatetime/pdt_locales/icu.py b/parsedatetime/pdt_locales/icu.py
index 8bee64b..4479f6b 100644
--- a/parsedatetime/pdt_locales/icu.py
+++ b/parsedatetime/pdt_locales/icu.py
@@ -35,6 +35,11 @@ def merge_weekdays(base_wd, icu_wd):
 
 
 def get_icu(locale):
+
+    def _sanitize_key(k):
+        import re
+        return re.sub("\\.(\\||$)", "\\1", k)
+
     from . import base
     result = dict([(key, getattr(base, key))
                    for key in dir(base) if not key.startswith('_')])
@@ -58,16 +63,16 @@ def get_icu(locale):
 
     # grab ICU list of weekdays, skipping first entry which
     # is always blank
-    wd = [w.lower() for w in symbols.getWeekdays()[1:]]
-    swd = [sw.lower() for sw in symbols.getShortWeekdays()[1:]]
+    wd = [_sanitize_key(w.lower()) for w in symbols.getWeekdays()[1:]]
+    swd = [_sanitize_key(sw.lower()) for sw in symbols.getShortWeekdays()[1:]]
 
     # store them in our list with Monday first (ICU puts Sunday first)
     result['Weekdays'] = merge_weekdays(result['Weekdays'],
                                         wd[1:] + wd[0:1])
     result['shortWeekdays'] = merge_weekdays(result['shortWeekdays'],
                                              swd[1:] + swd[0:1])
-    result['Months'] = [m.lower() for m in symbols.getMonths()]
-    result['shortMonths'] = [sm.lower() for sm in symbols.getShortMonths()]
+    result['Months'] = [_sanitize_key(m.lower()) for m in symbols.getMonths()]
+    result['shortMonths'] = [_sanitize_key(sm.lower()) for sm in symbols.getShortMonths()]
     keys = ['full', 'long', 'medium', 'short']
 
     createDateInstance = pyicu.DateFormat.createDateInstance

@idpaterson
Copy link
Collaborator

Thanks for updating, I think there are a few more things to look at here. I reviewed some of the locales defined in ICU and it seems that most abbreviations are found in day names, month names, eras (e.g. A.D.), and meridian (e.g. P.M.).

A similar problem could exist for meridian, which for the es locale is defined as a. m. and p. m.. I won't be able to review this for a few days, would you please test whether the meridian needs to be fixed as well? At a glance it doesn't look like the meridian from icu makes it into any regular expressions.

@bear
Copy link
Owner

bear commented Nov 19, 2019

I merged the PR as the meridian does not get bundled into any regular expressions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants