Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arabic definite article correction may not be current standard. #12

Open
SamGerber-zz opened this issue Jun 27, 2017 · 0 comments
Open

Comments

@SamGerber-zz
Copy link

Hello, @tenderlove! Thank you for porting this library!

While using this library, I received feedback that the way it handle the Arabic definite article 'al-' is not quite right.

Currently the rule looks like:

localstring.gsub!(/\bAl(?=\s+\w)/, 'al')  # al Arabic or forename Al.

The corresponding test asserts al Fahd is the accepted standard.

It seems common to hyphenate the definite article (al-Fahd).

"Al-" and its variants (ash-, ad-, ar-, etc.) are always written in lower case (unless beginning a sentence), and a hyphen separates it from the following word.

https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Arabic#Definite_article

Looking closer at the regular expressions fixing "son (daughter) of" etc
, they seem to have different terminators based on whether the prefix can also be a 'forename' (Ben, Al, or Van). In those cases, rather than using \b, it uses (?=\s+\w). If \b(?=.+\w) were used instead, I think it would fix the Arabic issue.

Hebrew seems to also be not quite up to the current standard, as the test case 'ben Gurion' is actually more commonly seen as Ben-Gurion:
screen shot 2017-06-27 at 00 00 08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant