Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Updating CLDR data #941

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

gavishpoddar
Copy link
Contributor

  • Updating CLDR data to 39.0.0.
  • Fixing CLDR downlaod error.
  • Updating CLDR Data URL : https://github.com/unicode-cldr/cldr-dates-full (archived) -> https://github.com/unicode-org/cldr-json.

TODO :

  • Fixing tests

Fixes issue #940

@gavishpoddar gavishpoddar changed the title WIP : Updating CLDR Data & [WIP] : Updating CLDR Data & Jul 3, 2021
@gavishpoddar
Copy link
Contributor Author

Many tests seem to be wrong just for example in tests/test_languages.py:805 for language Zulu

Current data translates son 23 umasingana 1996 to sunday 23 january 1996
but according to Google Translate it's isonto 23 Januwari 1996 for sunday 23 january 1996

Additionally languages like as are poorly translated.

This PR fixes those issues but currently, the tests are not updated.

@noviluni, please suggest should I update the tests accordingly.

A review will be helpful.

Thanks

Note: This PR breaks 39 tests.

@gavishpoddar gavishpoddar changed the title [WIP] : Updating CLDR Data & [WIP] : Updating CLDR data Jul 3, 2021
@noviluni
Copy link
Collaborator

noviluni commented Jul 4, 2021

Hi @gavishpoddar,
I created a "guide" to handle this (CLDR updates), but we never started doing it. It would be nice if you read it to see if you missed anything: #826

My initial idea was to update version by version, but it's OK if we update directly to the last version as you did. After that we will need to check file by file to see if we are removing things that could generate "breaking changes" (and possibly adding them to our own data), but before starting the review I would like to understand why you removed the "version".

It is really important to point to a specific version and not directly to master to easily understand which version are we pointing and to be able to update easily in the future (master could be "incomplete" or "wrong"). In the past we didn't have a way to know it, so we didn't know which version we were using and how outdated we were, so I would like you to reconsider adding again the cldr_version and the repo.git.co(cldr_version) statements. We need to keep this. If it doesn't work because they are tags instead of branches, etc. maybe you need to change the step, but as I mentioned we need to point to a specific version.

thanks! :)

Copy link
Contributor Author

@gavishpoddar gavishpoddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I am trying to fix update the CLDR data which is breaking the multiple tests so I am trying to highlight a few changes I have made along with the reasoning.

Please check and suggest.

@@ -802,7 +802,7 @@ def setUp(self):

# zu
param('zu', "3 mashi 2007 ulwesibili 10:08", "3 march 2007 tuesday 10:08"),
param('zu', "son 23 umasingana 1996", "sunday 23 january 1996"),
param('zu', "isonto 23 Januwari 1996", "sunday 23 january 1996"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was incorrectly translated verified via Google Translation

@@ -573,7 +573,7 @@ def setUp(self):
param('mn', "12 9-р сар 2019 пүрэв", "12 september 2019 thursday"),

# mr
param('mr', "16 फेब्रुवारी 1908 गुरु 02:03 मउ", "16 february 1908 thursday 02:03 pm"),
param('mr', "16 फेब्रुवारी 1908 गुरु 02:03", "16 february 1908 thursday 02:03"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLDR 39 has removed मउ (pm) -> pm. Either way, both of them are wrong.

@@ -210,7 +210,7 @@ def setUp(self):

# as
param('as', '17 জানুৱাৰী 1885', '17 january 1885'),
param('as', 'বৃহষ্পতিবাৰ 1 জুলাই 2009', 'thursday 1 july 2009'),
param('as', 'বৃহস্পতিবাৰ 1 জুলাই 2009', 'thursday 1 july 2009'),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect previous test: I know the language

@@ -270,7 +270,7 @@ def setUp(self):

# bs-Latn
param('bs-Latn', "23 septembar 1879, petak", "23 september 1879 friday"),
param('bs-Latn', "subota 1 avg 2009 02:27 popodne", "saturday 1 august 2009 02:27 pm"),
param('bs-Latn', "subota 1 aug 2009 02:27 popodne", "saturday 1 august 2009 02:27 pm"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't be verified but new CLDR 39 data updates this.

param('ce', "6 январь 1987 пӏераскан де", "6 january 1987 friday"),
param('ce', "оршотан де 3 июль 1890", "monday 3 july 1890"),
param('ce', "6 январь 1987 пӏераска", "6 january 1987 friday"),
param('ce', "оршот де 3 июль 1890", "monday 3 july 1890"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chechen Language: Can't be verified but new CLDR 39 data updates this.

param('kl', "2 martsi 2001 ataasinngorneq", "2 march 2001 monday"),
param('kl', "pin 1 oktoberi 1901", "wednesday 1 october 1901"),
param('kl', "2 marsi 2001 ataasinngorneq", "2 march 2001 monday"),
param('kl', "pin 1 oktobari 1901", "wednesday 1 october 1901"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kalaallisut; Greenlandic: Can't be verified but new CLDR 39 data updates this.


# kln
param('kln', "3 ng'atyaato koang'wan 10:09 kooskoliny", "3 february thursday 10:09 pm"),
param('kln', "kipsuunde nebo aeng' 14 2009 kos", "december 14 2009 wednesday"),

# kok
param('kok', "1 नोव्हेंबर 2000 आदित्यवार 01:19 मनं", "1 november 2000 sunday 01:19 pm"),
param('kok', "1 नोव्हेंबर 2000 आयतार 01:19", "1 november 2000 sunday 01:19"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Konkani (Indian language): Can't be verified but new CLDR 39 data updates this.

param('qu', "5 pauqar waray 1878 miércoles", "5 march 1878 wednesday"),
param('qu', "6 int 2009 domingo", "6 june 2009 sunday"),
param('qu', "5 marzo 1878 miércoles", "5 march 1878 wednesday"),
param('qu', "6 jun 2009 domingo", "6 june 2009 sunday"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quechua : Can't be verified but new CLDR 39 data updates this.

param('so', "sab 5 bisha saddexaad 1765 11:08 gn", "saturday 5 march 1765 11:08 pm"),
param('so', "16 lit 2008 axd", "16 december 2008 sunday"),
param('so', "sabti 5 bisha saddexaad 1765 11:08 gd", "saturday 5 march 1765 11:08 pm"),
param('so', "16 desembar 2008 axd", "16 december 2008 sunday"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somali: Verified via google translation.

@@ -741,7 +741,7 @@ def setUp(self):
param('sv', "onsdag 16 mars 08:15 eftermiddag", "wednesday 16 march 08:15 pm"),

# sw
param('sw', "5 mei 1994 jumapili 10:17 asubuhi", "5 may 1994 sunday 10:17 am"),
param('sw', "5 mei 1994 jumapili 10:17", "5 may 1994 sunday 10:17"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swahili: asubuhi means in the morning . Verified via google translation.

Copy link
Contributor Author

@gavishpoddar gavishpoddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing tests

@@ -1159,7 +1159,7 @@ def test_translation(self, shortname, datetime_string, expected_translation):
param('dav', "15 juma", "15 week"),
# de
param('de', "nächstes jahr", "in 1 year"),
param('de', "letzte woche 04:25 nachm", "1 week ago 04:25 pm"),
param('de', "vor einer Woche 04:25 nachm", "1 week ago 04:25 pm"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

German: Verified via google translate

@@ -1139,7 +1139,7 @@ def test_translation(self, shortname, datetime_string, expected_translation):
param('cgg', "5 omwaka", "5 year"),
# chr
param('chr', "ᎯᎠ ᎢᏯᏔᏬᏍᏔᏅ", "0 minute ago"),
param('chr', "ᎾᎿ 8 ᎧᎸᎢ ᏥᎨᏒ", "8 month ago"),
param('chr', "8 ꭷꮈ ꮵꭸꮢ", "8 month ago"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cherokee: Can't be verified but new CLDR 39 data updates this.

@@ -1197,7 +1197,7 @@ def test_translation(self, shortname, datetime_string, expected_translation):
param('et', "1 a pärast", "in 1 year"),
param('et', "4 tunni eest", "4 hour ago"),
# eu
param('eu', "aurreko hilabetea", "1 month ago"),
param('eu', "aurreko hilabetean", "1 month ago"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basque: Verified via google translate.

@@ -1266,7 +1266,7 @@ def test_translation(self, shortname, datetime_string, expected_translation):
param('id', "dalam 43 menit", "in 43 minute"),
param('id', "dlm 23 dtk", "in 23 second"),
# ig
param('ig', "nnyaafụ", "1 day ago"),
param('ig', "ụnyaahụ", "1 day ago"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Igbo: Partially correct. Verified via google translate. Better than previous.

@@ -1240,7 +1240,7 @@ def test_translation(self, shortname, datetime_string, expected_translation):
param('gsw', "moorn", "in 1 day"),
param('gsw', "geschter", "1 day ago"),
# gu
param('gu', "2 વર્ષ પહેલા", "2 year ago"),
param('gu', "2 વર્ષ પહેલાં", "2 year ago"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gujarati: Same output via google translate for પહેલા or પહેલાં

@@ -1420,7 +1420,7 @@ def test_translation(self, shortname, datetime_string, expected_translation):
param('ms', "bulan depan", "in 1 month"),
# mt
param('mt', "ix-xahar li għadda", "1 month ago"),
param('mt', "2 sena ilu", "2 year ago"),
param('mt', "2 snin ilu", "2 year ago"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maltese: Same output via google translate for sena or snin

param('nn', "for 5 minutter siden", "5 minute ago"),
param('nn', "om 3 uker", "in 3 week"),
param('nn', "for 5 min sidan", "5 minute ago"),
param('nn', "om 3 veke", "in 3 week"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Norwegian Nynorsk: Can't be verified but new CLDR 39 data updates this. Google translate has no support for Nynorsk

@@ -608,8 +608,7 @@ def test_splitting_of_not_parsed(self, shortname, string, expected, settings=Non

# Hindi
param('hi',
'जुलाई 1937 में, मार्को-पोलो ब्रिज हादसे का बहाना लेकर जापान ने चीन पर हमला कर दिया और चीनी साम्राज्य '
'की राजधानी बीजिंग पर कब्जा कर लिया,'),
'जुलाई 1937 में, मार्को-पोलो ब्रिज हादसे का बहाना की राजधानी बीजिंग पर कब्जा कर लिया. '),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hindi: Test failed because of incorrect language detection.

@gavishpoddar
Copy link
Contributor Author

At this point, 7 tests are failing in tests/test_freshness_date_parser.py.

I am unable to fix them please help.

@noviluni

@lopuhin
Copy link
Member

lopuhin commented Jul 8, 2021

@gavishpoddar the builds for this PR were not enabled (it's a newish github feature), sorry about that - just enabled them.

@gavishpoddar gavishpoddar changed the title [WIP] : Updating CLDR data [WIP] Updating CLDR data Jul 20, 2021
@gavishpoddar gavishpoddar mentioned this pull request Aug 9, 2021
21 tasks
@codecov
Copy link

codecov bot commented Oct 9, 2021

Codecov Report

Merging #941 (4580337) into master (507dc6d) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #941   +/-   ##
=======================================
  Coverage   98.29%   98.29%           
=======================================
  Files         234      234           
  Lines        2694     2700    +6     
=======================================
+ Hits         2648     2654    +6     
  Misses         46       46           
Impacted Files Coverage Δ
dateparser/languages/locale.py 98.71% <100.00%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 507dc6d...4580337. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants