-
Notifications
You must be signed in to change notification settings - Fork 1
/
internationalisation.txt
302 lines (233 loc) · 11.7 KB
/
internationalisation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
Message-ID: <[email protected]>
Date: Wed, 16 Sep 2009 00:06:25 +0100
From: Peter Knaggs <[email protected]>
Newsgroups: comp.lang.forth
Subject: RfD: Internationalisation
ANS Forth Internationalisation
==============================
2009-09-16 Remove SUBSTITUTE, now subject to a separate proposal.
2009-09-02 Converted into text file.
2007-06-26 Updated rationale section, LOCALE@, and minor wordsmithing
2005-10-23 Added GET-ENCODING and SET-ENCODING,
Changed stack action of SUBSTITUTE.
2001-04-25 Added GET-ESCAPE to provide restoration capabilities.
2001-03-25 Minor text changes
1999-06-21 Wordsmithed at ANS meeting
1999-06-20 Tightened up some wording
Added references to more standards.
1999-06-14 Added an ambiguous condition to SUBSTITUTE.
Changed COUNTRY and LANGUAGE to SET-COUNTRY and
SET-LANGUAGE returning an ior.
1999-05-30 Derived from parallel discussion document
Problem
=======
Forth Applications designed to run in many countries and languages
cannot yet make enough assumptions about strings and character sets
to be portable. The LOCALE word set is designed to provide words for
portable internationalisation of application programs. The proposal
does not attempt to cover text processing in general, but only to
permit conversion of a limited set of application-defined text for
internationalisation. The proposal is based on techniques used in
large Forth applications for many years.
In practice, many applications are not localised by the software
developer, but by their agents in other countries. The LOCALE word
set permits the software developer to provide tools that will produce
text files that can be edited and converted to another language
locally without dependency on computer language or operating system
specific tools such as resource compilers and managers. At the same
time, the proposed word set does not inhibit the use of sets of
statically compiled strings for each language, it just does not
define the mechanism.
The basis of the LOCALE word set is that all strings for
internationalisation are compiled as LOCALE structures, and all
access to the strings is through these structures. It appears that
the following word set is adequate in the first place. The word set
is designed to cope with character sets that are of different size
to the native set.
The word set is split into a base and extension sets to indicate what
factors need to be language sensitive. It is also likely that all
LOCALE structures will need to be linked in case reindexing of hash
tables or other internal structures is necessary.
The word L" is proposed for language sensitive strings, and behaves
in a similar way to the ANS word C", but returns a string identifier
known as a locale string identifier (lsid) from which the required
language string can be extracted. The reason for this is so that text
information in the native development language is still available in
the source, making source maintenance easier because the intention of
the string is still available to the developer. In addition, the
Forth compiler can be extended to produce a text file containing the
native strings.
The number of items to be displayed which are, or may be, language
sensitive is large. Not all applications may need to deal with all
of them. In addition, many applications need to be able to perform
text substitution, for example:
Your balance at <time> on <date> is <currency-value>.
This is the subject of a separate RfD (Substitute).
Terminology and assumptions
===========================
LOCALE
------
We use the word locale to mean the mixture of country, language,
font, date/time formatting and so on in use when an application
program runs.
Character sets
--------------
The language and character set encoding used by the Forth system at
development time is referred to as the Development Character Set
(DCS). The development character set is assumed never to change.
It is furthermore assumed that character manipulation in the Forth
system is defined in terms of the DCS, and that the action of
character operations such as CMOVE is locked to the DCS.
The language and character set encoding used by any underlying
operating system is referred to as the Operating Character Set
(OCS). The OCS may or may not be the same as the DCS.
The language and character set encoding used at application run time
is referred to as the Application Character Set (ACS). It is assumed
that the largest character in an ACS fits in the native cell of the
development Forth system. The ACS may or may not be the same as the
OCS.
The DCS is usually seven or eight bit ASCII in the majority of
today's Forth systems, but we will see Unicode systems in the near
future. The OCS is defined by the host machine, and is defined by
the user of the application. Thus, an application written in a Forth
designed for ISO-Latin1 may be running on an O/S with a Chinese OCS,
and a visitor may switch the application into yet another ACS, such
as Russian. Such scenarios are rare within the US and Europe, but
are common elsewhere in the world. Countries such as South Africa
exist with 17 official languages, and some languages such as
Portuguese and English are spoken in many different countries.
LOCALE structures
-----------------
We do not wish to constrain or influence implementation techniques
in any way. A specific string for internationalisation needs to be
referred to by a single parameter, which we call the "locale string
identifier", or lsid. This is an opaque type, in other words the
programmer should make no assumptions about what it means, except
that different strings have different lsids. In many cases, a lsid
may well be an address, or a hash code.
LOCALE strings
--------------
At application run time, locale strings need to be manipulated.
Locale strings are described in terms of address units.
Country and language constants
------------------------------
There are a number of standardisation efforts for country and
language codes. Since the objective of this proposal is to provide
for source portability of applications, we do not need to mandate
numeric or string values, but only to define language and country
source names that can be used as Forth words.
Assuming that text processing is mostly affected by language
selection, and that formatting is heavily influenced by both country
and corporate standards, we suggest that country be defined by the
ISO3166:1998 two letter country codes (Alpha-2). For this standard
an algorithm has been defined to produce unique numeric codes for
each country. A set of language codes (ISO639:1998) also exists.
Octets and Bytes
----------------
Since the vast majority of character sets are defined in terms of 8
bit units commonly referred to as bytes or octets, it is likely that
the implementation of any internationalisation code will require the
presence of byte/octet access words, regardless of the underlying
DCS character size.
The presence and definition of an octet/byte access mechanism is
outside the scope of this proposal.
The optional LOCALE word set
============================
Environmental queries
---------------------
Append the table below to table xxx
String value Data type Constant? Meaning
LOCALE Flag No LOCALE word set present
LOCALE-EXT flag No LOCALE extension word set present
Additional documentation requirements
-------------------------------------
Implementation-defined options
- Default encoding setting (SET-ENCODING)
- Default language setting (SET-LANGUAGE)
- Default country setting (SET-COUNTRY)
Ambiguous conditions
- use of an invalid locale string identifier (lsid)
- a locale string is too big for a destination buffer
- an invalid locale string identifier (lsid) is used
LOCALE words
------------
SET-ENCODING ( encoding -- ior )
Sets the character encoding for the LOCALE system. The value of
encoding is implementation defined, where zero is the default. The
ior is returned false (zero) if the operation succeeds, otherwise it
returns a non-zero implementation-dependent ior. If the operation
does not succeed, the current encoding remains unchanged.
GET-ENCODING ( -- encoding )
Returns the encoding last set by SET-ENCODING. The default encoding
is implementation defined.
SET-LANGUAGE ( lang -- ior )
Sets the current language for the LOCALE system to lang. The ior
is returned false if the operation succeeds, otherwise it returns a
non-zero implementation-dependent ior. If the operation does not
succeed, the current language remains unchanged.
GET-LANGUAGE ( -- lang )
Returns the language code last set by SET-LANGUAGE. The default
language is implementation defined.
SET-COUNTRY ( country - ior )
Sets the current country for the LOCALE system to country. The ior
is returned false if the operation succeeds, otherwise it returns a
non-zero implementation-dependent ior. If the operation does not
succeed, the current country remains unchanged.
GET-COUNTRY ( -- country )
Returns the country code last set by SET-COUNTRY. The default
language is implementation defined.
L" "L-Quote"
Interpretation:
The interpretation semantics for this word are undefined.
Compilation: ( "ccc<quote>" -- )
Parse ccc delimited by a " (double-quote) and append the run-time
semantics given below to the current definition.
Runtime: ( -- lsid )
Return lsid, an identifier for a locale string. Other words use
lsid to extract language specific information.
LOCALE@ ( lsid -- c-addr len(au) )
Return the address and length in address units of the string (in
the current language) that corresponds to the native string
identified by lsid. The format of the string at c-addr is
implementation dependent. The length of the string is returned in
address units so that it may be copied by MOVE without knowledge
of the character set width.
The returned string is valid until the next use of LOCALE@,
SET-COUNTRY, SET-LANGUAGE or SET-ENCODING.
Ambiguous conditions occur if
1) the lsid is invalid,
2) a lifetime condition has been exceeded,
3) an underlying mass storage system fails.
LOCALE extension words
----------------------
These words are provided here to give portability of implementation
techniques. They are building blocks for a practical implementation.
LOCALE-INDEX ( lsid -- )
Updates the internal data structure. Useful if structures are
added and changes to internal structures are required.
LOCALE-NEXT ( lsid1 -- lsid2 )
Given the lsid of one LOCALE structure, returns the lsid of the
next. A return value of zero indicates that there is no next lsid.
LOCALE-START ( -- lsid )
Returns an lsid from which all others can be found using
LOCALE-NEXT. A return value of zero indicates that no locale
strings have been defined.
LOCALE-TYPE ( c-addr len -- )
Displays the LOCALE string whose address and length in address
units are given.
NATIVE@ ( lsid -- c-addr len )
Given a LOCALE structure, returns the address and length of the
corresponding DCS native string that was compiled by L".
Remarks
=======
It is expected that this proposal be combined with the XChars proposal
to form a new Globalisation word set.
Authors
=======
Stephen Pelc, MicroProcessor Engineering, [email protected]
Peter Knaggs, University of Exeter, [email protected]
Contributions from:
Willem Botha, Construction Computer Software
Nick Nelson, Micross Electronics, [email protected]
Greg Bailey, Athena Programming, [email protected]