Test repository for the ABNF translation of language tag codes into a regular expression for CSAF.
Base information from:
[BCP47] Phillips, A. and M. Davis, "Matching of Language Tags", BCP 47, RFC 4647, September 2006.
Phillips, A., Ed., and M. Davis, Ed., "Tags for Identifying Languages", BCP 47, RFC 5646, September 2009.
https://www.rfc-editor.org/info/bcp47
Direct link: https://www.rfc-editor.org/rfc/bcp/bcp47.txt
Experimental.
Here is the regex suggested initially for CSAF:
^(([A-Za-z]{2,3}(-[A-Za-z]{3}(-[A-Za-z]{3}){0,2})?|[A-Za-z]{4,8})(-[A-Za-z]{4})?(-([A-Za-z]{2}|[0-9]{3}))?(-([A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3}))*(-[A-WY-Za-wy-z0-9](-[A-Za-z0-9]{2,8})+)*(-x(-[A-Za-z0-9]{1,8})+)?|x(-[A-Za-z0-9]{1,8})+|i-default|i-mingo)$
And the way to it:
Language-Tag = langtag | privateuse | grandfathered
langtag = language(-script)?(-region)?(-variant)*(-extension)*(-privateuse)?
language = ([A-Za-z]{2,3}(-[A-Za-z]{3}(-[A-Za-z]{3}){0,2})?|[A-Za-z]{4,8})
script = [A-Za-z]{4}
region = ([A-Za-z]{2}|[0-9]{3})
variant = ([A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3})
extension = [A-WY-Za-wy-z0-9](-[A-Za-z0-9]{2,8})+
privateuse = x(-[A-Za-z0-9]{1,8})+
grandfathered = i-default|i-mingo
Source: @tschmidtb51 in oasis-tcs/csaf#71
A regex that consumes matches all present test cases derived from ABNF grammar of BCP 47 is:
^(([A-Za-z]{2,3}(-[A-Za-z]{3}(-[A-Za-z]{3}){0,2})?|[A-Za-z]{4,8})(-[A-Za-z]{4})?(-([A-Za-z]{2}|[0-9]{3}))?(-([A-Za-z0-9]{5,8}|[0-9][A-Za-z0-9]{3}))*(-[A-WY-Za-wy-z0-9](-[A-Za-z0-9]{2,8})+)*(-[xX](-[A-Za-z0-9]{1,8})+)?|[xX](-[A-Za-z0-9]{1,8})+|[eE][nN]-[gG][bB]-[oO][eE][dD]|[iI]-[aA][mM][iI]|[iI]-[bB][nN][nN]|[iI]-[dD][eE][fF][aA][uU][lL][tT]|[iI]-[eE][nN][oO][cC][hH][iI][aA][nN]|[iI]-[hH][aA][kK]|[iI]-[kK][lL][iI][nN][gG][oO][nN]|[iI]-[lL][uU][xX]|[iI]-[mM][iI][nN][gG][oO]|[iI]-[nN][aA][vV][aA][jJ][oO]|[iI]-[pP][wW][nN]|[iI]-[tT][aA][oO]|[iI]-[tT][aA][yY]|[iI]-[tT][sS][uU]|[sS][gG][nN]-[bB][eE]-[fF][rR]|[sS][gG][nN]-[bB][eE]-[nN][lL]|[sS][gG][nN]-[cC][hH]-[dD][eE])$
The current regular expression candidate passes all 1000 tests from corpus.
collected 1014 items
tests/test_regex.py ................................................................................................................................................................................................................................................... [ 23%]
....................................................................................................................................................................................................................................................................... [ 49%]
....................................................................................................................................................................................................................................................................... [ 75%]
..................................................................................................................................................................................................................................................... [100%]
============================================================================================================================ 1014 passed in 2.17s =============================================================================================================================
Using only the first 100 test cases and a different external validator from Christoph Schneegans:
https://schneegans.de/lv/?tags=X-Z...
A report of these first 100 test cases in JSON format as generated by schneegans.de validator is temporarily stored here.
Note: The default branch is default
.