-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
request for a new field, to specify a decimal separator #394
Comments
Thanks for this report, @mobb . While I agree we don't explicitly name a field for this LOCALE information, it is standard practice (e.g., in database systems and operating systems, and in MIME types) to include LOCALE info in with the character encoding for text files. For example, here's a table IBM maintains with these values: https://www.ibm.com/docs/en/aix/7.2?topic=globalization-supported-languages-locales The standard syntax for it is LOCALE.ENCODING, where both the values for LOCALE and ENCODING come from the standard vocabularies maintained in the Unicode Common Locale Data Repository and ISO encoding values, with the specific list of supported locale values maintained in github: https://github.com/unicode-org/cldr/tree/main/common/main. For example, to indicate British, Canadian, and US locales for UTF-8 encoded files, the character encoding would be set to ...
<physical>
...
<characterEncoding>EN_GB.UTF-8</characterEncoding>
</physical>
... We frequently omit |
LTER IMs and EDI are currently updating EML best practices. If including |
I think that would be a great addition to the best practices to always use Within the US, I think assuming |
@mbjones I looked at a sample list of encodings. I didn't see EN_US on there. ASCII and UTF-8 are on there, and those two are also listed as examples in the EML spec. Would we satisfactorily alleviate headaches if we recommend UTF-8 in the EML best practice document we are authoring? |
Here's the best practice text I'm proposing. The physical tree (/eml:eml/dataset/[entity]/physical) further describes the physical format of the data. Within physical, we recommend populating the characterEncoding element if you can determine the encoding. For most U.S. data, an encoding of UTF-8 is typically correct, with ASCII being another typical encoding. Whatever you choose, if you do provide an encoding, please be sure it is not an incorrect one, e.g., do not choose ASCII if your data include extended Latin characters. |
Hey @twhiteaker -- yeah, that list appears to be a list of character encodings, and not locales. Both are needed to properly interpret a file. Most computing systems assume the LOCALE of the local computer applies unless otherwise specified. On Mac and linux machines, you can often use the ❯ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL= Note that the Nevertheless, I think people interpreting the EML |
We have encountered data files from Europe where a comma is used as the decimal separator (rather than a period which is common in the US). We have not found a place in the EML schema to record this, so this issue records that request. Commas are common in other parts of the world. https://i.redd.it/omgfapht3qn51.png
A comma decimal separator is not always correctly interpreted automatically by packages (e.g., pandas), although most have a mechanism for specifying this in the import statement (e.g.,
pd.read_csv(file_name,sep=';', decimal=","
). EML metadata can be used to aid importing data tables, and so could populate that statement. Most likely, an optional field nameddecimalSeparator
would suffice.We agree that it would be almost impossible to interpret a table that used commas as both the field separator and the decimal separator without differentiating them somehow. Therefore, its likely that a best practice would be to not construct a table this way. We have not explored the effect of using the literalCharacter field, for example ‘
2021-03-28; 20\,27
’.The text was updated successfully, but these errors were encountered: