-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken support for short days / months with periods in them #221
Comments
This fix looks good to me, it could also be addressed in the icu locale. TL;DR for anyone seeing this issue: ICU uses punctuated month abbreviations (e.g. aug.) which wreak havoc on date parsing. The problem does not affect locales built in to parsedatetime because the abbreviations are not punctuated (e.g. aug). |
Thanks for the quick response! You are right, I did not have a look at how the icu locale were built, and it makes more sense to sanitize the abbreviated days / months there rather than in Also, I have just noticed that my sanitization function was completely broken anyway, as it was actually removing the last character from everything not just the period, and did not handle keys with "|" separators. Long story short, it was breaking up pretty much everything... Here is a new attempt: diff --git a/parsedatetime/pdt_locales/icu.py b/parsedatetime/pdt_locales/icu.py
index 8bee64b..4479f6b 100644
--- a/parsedatetime/pdt_locales/icu.py
+++ b/parsedatetime/pdt_locales/icu.py
@@ -35,6 +35,11 @@ def merge_weekdays(base_wd, icu_wd):
def get_icu(locale):
+
+ def _sanitize_key(k):
+ import re
+ return re.sub("\\.(\\||$)", "\\1", k)
+
from . import base
result = dict([(key, getattr(base, key))
for key in dir(base) if not key.startswith('_')])
@@ -58,16 +63,16 @@ def get_icu(locale):
# grab ICU list of weekdays, skipping first entry which
# is always blank
- wd = [w.lower() for w in symbols.getWeekdays()[1:]]
- swd = [sw.lower() for sw in symbols.getShortWeekdays()[1:]]
+ wd = [_sanitize_key(w.lower()) for w in symbols.getWeekdays()[1:]]
+ swd = [_sanitize_key(sw.lower()) for sw in symbols.getShortWeekdays()[1:]]
# store them in our list with Monday first (ICU puts Sunday first)
result['Weekdays'] = merge_weekdays(result['Weekdays'],
wd[1:] + wd[0:1])
result['shortWeekdays'] = merge_weekdays(result['shortWeekdays'],
swd[1:] + swd[0:1])
- result['Months'] = [m.lower() for m in symbols.getMonths()]
- result['shortMonths'] = [sm.lower() for sm in symbols.getShortMonths()]
+ result['Months'] = [_sanitize_key(m.lower()) for m in symbols.getMonths()]
+ result['shortMonths'] = [_sanitize_key(sm.lower()) for sm in symbols.getShortMonths()]
keys = ['full', 'long', 'medium', 'short']
createDateInstance = pyicu.DateFormat.createDateInstance |
Thanks for updating, I think there are a few more things to look at here. I reviewed some of the locales defined in ICU and it seems that most abbreviations are found in day names, month names, eras (e.g. A.D.), and meridian (e.g. P.M.). A similar problem could exist for meridian, which for the es locale is defined as |
I merged the PR as the meridian does not get bundled into any regular expressions |
Hello,
It seems that short months / short days are not handled correctly generally speaking - pretty much none of the abbreviations in locales from ICU are being recognised because of this.
For example:
This can be traced to the way the matching regex are built: there is a word boundary imposed at the end of each matching group - which will normally match a period, but not the ". " transition , so I believe the final period in all abbreviations should really be removed from the regexes.
BTW, as a side effect you can create some evil crashes:
I came up with the following patch:
After applying this patch you can happily do the following:
I believe this issue should be addressed as this would make ICU supported locales work much more reliably.
The text was updated successfully, but these errors were encountered: