Skip to content

Test repository for the ABNF translation of language tag codes into a regular expression for CSAF.

License

Notifications You must be signed in to change notification settings

tschmidtb51/language-tag-abnf-regex

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

language-tag-abnf-regex

Test repository for the ABNF translation of language tag codes into a regular expression for CSAF.

Base information from:

[BCP47] Phillips, A. and M. Davis, "Matching of Language Tags", BCP 47, RFC 4647, September 2006. 
        Phillips, A., Ed., and M. Davis, Ed., "Tags for Identifying Languages", BCP 47, RFC 5646, September 2009.
        https://www.rfc-editor.org/info/bcp47

Direct link: https://www.rfc-editor.org/rfc/bcp/bcp47.txt

Status

Experimental.

Transformation

Here is the regex suggested initially for CSAF:

^(([A-Za-z]{2,3}(-[A-Za-z]{3}(-[A-Za-z]{3}){0,2})?|[A-Za-z]{4,8})(-[A-Za-z]{4})?(-([A-Za-z]{2}|[0-9]{3}))?(-([A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3}))*(-[A-WY-Za-wy-z0-9](-[A-Za-z0-9]{2,8})+)*(-x(-[A-Za-z0-9]{1,8})+)?|x(-[A-Za-z0-9]{1,8})+|i-default|i-mingo)$

And the way to it:

Language-Tag = langtag | privateuse | grandfathered

langtag = language(-script)?(-region)?(-variant)*(-extension)*(-privateuse)?

language = ([A-Za-z]{2,3}(-[A-Za-z]{3}(-[A-Za-z]{3}){0,2})?|[A-Za-z]{4,8})
script = [A-Za-z]{4}
region = ([A-Za-z]{2}|[0-9]{3})
variant = ([A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3})
extension = [A-WY-Za-wy-z0-9](-[A-Za-z0-9]{2,8})+

privateuse = x(-[A-Za-z0-9]{1,8})+
grandfathered = i-default|i-mingo

Source: @tschmidtb51 in oasis-tcs/csaf#71

A regex that consumes matches all present test cases derived from ABNF grammar of BCP 47 is:

^(([A-Za-z]{2,3}(-[A-Za-z]{3}(-[A-Za-z]{3}){0,2})?|[A-Za-z]{4,8})(-[A-Za-z]{4})?(-([A-Za-z]{2}|[0-9]{3}))?(-([A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3}))*(-[A-WY-Za-wy-z0-9](-[A-Za-z0-9]{2,8})+)*(-[xX](-[A-Za-z0-9]{1,8})+)?|[xX](-[A-Za-z0-9]{1,8})+|[eE][nN]-[gG][bB]-[oO][eE][dD]|[iI]-[aA][mM][iI]|[iI]-[bB][nN][nN]|[iI]-[dD][eE][fF][aA][uU][lL][tT]|[iI]-[eE][nN][oO][cC][hH][iI][aA][nN]|[iI]-[hH][aA][kK]|[iI]-[kK][lL][iI][nN][gG][oO][nN]|[iI]-[lL][uU][xX]|[iI]-[mM][iI][nN][gG][oO]|[iI]-[nN][aA][vV][aA][jJ][oO]|[iI]-[pP][wW][nN]|[iI]-[tT][aA][oO]|[iI]-[tT][aA][yY]|[iI]-[tT][sS][uU]|[sS][gG][nN]-[bB][eE]-[fF][rR]|[sS][gG][nN]-[bB][eE]-[nN][lL]|[sS][gG][nN]-[cC][hH]-[dD][eE])$

Report

The current regular expression candidate passes all 1000 tests from corpus.

Details

collected 1014 items

tests/test_regex.py ................................................................................................................................................................................................................................................... [ 23%]
....................................................................................................................................................................................................................................................................... [ 49%]
....................................................................................................................................................................................................................................................................... [ 75%]
.....................................................................................................................................................................................................................................................                   [100%]

============================================================================================================================ 1014 passed in 2.17s =============================================================================================================================

Other Reports

Using only the first 100 test cases and a different external validator from Christoph Schneegans:

https://schneegans.de/lv/?tags=X-Z...

A report of these first 100 test cases in JSON format as generated by schneegans.de validator is temporarily stored here.

Note: The default branch is default.

About

Test repository for the ABNF translation of language tag codes into a regular expression for CSAF.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%