-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathusort.sthlp
executable file
·221 lines (192 loc) · 9.38 KB
/
usort.sthlp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
{smcl}
{* *! version 1.1.3 31oct2024}{...}
{viewerjumpto "Syntax" "usort##syntax"}{...}
{viewerjumpto "Description" "usort##description"}{...}
{viewerjumpto "Methods and formulas" "usort##methods"}{...}
{viewerjumpto "Examples" "usort##examples"}{...}
{p2colset 1 15 17 2}{...}
{title:Title}
{phang}
{bf:usort} {hline 2} byable locale-based ascending and descending sort that
supports conditional statements, observation ranges, and user-defined handling
of substrings and missing values
{marker syntax}{...}
{title:Syntax}
{p 8 17 2}
{cmdab:usort}
[{cmd:+}|{cmd:-}]
{varname}
[[{cmd:+}|{cmd:-}]
{varname} {it:...}]
{ifin}
[{cmd:,} {it:options}]
{synoptset 35 tabbed}{...}
{synopthdr}
{synoptline}
{syntab:Substrings and missing values}
{synopt:{opth f:irst(strings:string [, pos|rpos|regex]])}} sort the words, i.e.,
substrings (not) enclosed in double quotes and separated by
whitespace(s), within the {help strings:{it:string}}: first (for
[+]{varname}) or last (for -{varname}) in the specified order (word 1
is first, word 2 is second, etc.). Sorting is done by: a)
comparing {varname} with each word (without options), b) using
{help f_ustrpos:ustrpos({it:varname}, word)} (for {it:pos}), c) using
{help f_ustrrpos:ustrrpos({it:varname}, word)} (for {it:rpos}), or
d) using {help f_ustrregexm:ustrregexm({it:varname}, word)} (for
{it:regex}). {break} {bf:Note:} System numerical and string missing
values are coded as {bf:.}, while non-system missing values are coded as
{bf:.a}, {bf:.b}, etc. in non-regex representation. For regular
expressions, you can use {bf:^[.]$}, {bf:^[.]a$}, {bf:^[.]b$}, etc. in
{bf:first(}{help strings:{it:string}}, {it:regex}{bf:)}.{p_end}
{synopt:{opth l:ast(strings:string [, pos|rpos|regex]])}} sort the words, i.e.,
substrings not enclosed in double quotes and separated by
whitespace(s), within the {help strings:{it:string}}: last (for
[+]{varname}) or first (for -{varname}) in the specified order (word 1
is last, word 2 is second to last, etc.). Sorting is performed by: a)
comparing {varname} with each word (without options), b) using
{help f_ustrpos:ustrpos({it:varname}, word)} (for {it:pos}), c) using
{help f_ustrrpos:ustrrpos({it:varname}, word)} (for {it:rpos}), or
d) using {help f_ustrregexm:ustrregexm({it:varname}, word)} (for
{it:regex}). {break} {bf:Note:} System numerical and string missing
values are coded as {bf:.}, while non-system missing values are coded as
{bf:.a}, {bf:.b}, etc. in non-regex representation. For regular
expressions, you can use {bf:^[.]$}, {bf:^[.]a$}, {bf:^[.]b$}, etc. in
{bf:first(}{help strings:{it:string}}, {it:regex}{bf:)}.{p_end}
{synopt:{opt ignorec}}ignore case sensitivity in {bf:first()} and {bf:last()}.
{p_end}
{synopt:{opt mf:irst}}sort missing values first (for [+]{varname}) or last (for
-{varname}).{p_end}
{synopt:{opt ml:ast}}sort missing values last (for [+]{varname}) or first (for
-{varname}).{p_end}
{synopt:{opt ignorem}}ignore missing values when using {bf:first()},
{bf:last()}, {bf:mfirst}, and {bf:mlast}.{p_end}
{syntab:Locale}
{synopt:{opth loc:ale(string)}}locale code from the {stata unicode locale list}
or {bf:c(locale_functions)} by default.{p_end}
{synopt:{opth st(#)}}argument {it:st} in
{help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
{p_end}
{synopt:{opth case(#)}}argument {it:case} in
{help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
{p_end}
{synopt:{opth cslv(#)}}argument {it:cslv} in
{help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
{p_end}
{synopt:{opth norm(#)}}argument {it:norm} in
{help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
{p_end}
{synopt:{opth num(#)}}argument {it:num} in
{help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
{p_end}
{synopt:{opth alt(#)}}argument {it:alt} in
{help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
{p_end}
{synopt:{opth fr(#)}}argument {it:fr} in
{help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
{p_end}
{syntab:Miscellaneous}
{synopt:{opth format(%fmt)}}format for converting numerical sort variables into
strings (sorting is performed on string values only). The default format
is {bf:%32.16f}.{p_end}
{synopt:{opth codepoint(#)}}code point location of a symbol from the bottom of
the UTF-8 table used to make {bf:last()} work. The default value is
{bf:129769}.{p_end}
{synoptline}
{p2colreset}{...}
{p 4 6 2}
{opt by} is allowed; see {help by}.{p_end}
{marker weight}{...}
{p 4 6 2}
{opt weight}s are not allowed; see {help weights}.
{p_end}
{marker description}{...}
{title:Description}
{pstd}
This program is a byable sort command, which allows for: a) custom first and
last substrings, including system missing values ({bf:.}) and all other missing
values, b) {helpb gsort}-like syntax for sorting in ascending or descending
order, and c) conditional sorting using [{help if:{it:if}}] or range-based
sorting using [{help in:{it:in}}]. The program is built around Stata's
{helpb sort} command and will mark the dataset as sorted (sorted by) if all rows
are selected. If a subset of rows is selected, it applies {help mata:Mata}'s
{help mf_sort:_collate()}.
{pstd}
Sorting large datasets may be more taxing on machine CPU, memory, and/or disk
space as compared to {helpb sort} and {helpb gsort}.
{marker methods}{...}
{title:Methods and Formulas}
{pstd}
Sorting occurs in {bf:two steps}:
{pstd}
{bf:1.} Generating a permutation vector in {help mata:Mata} from the sort
variables under {helpb preserve}. Since non-numeric sorting values cannot be
'destringed', the sort variable type must be {help data_types:{it:str#/strL}}
to allow sorting them as a single matrix using {help mata:Mata}'s
{help mf_sort:sort()} function. The precision for sorting 'tostringed' numeric
values is determined by the {help format:{it:%fmt}} (either default or
user-specified) in {bf:format()}.
{pmore}
To ensure that substrings specified by the {bf:first()} option are sorted first,
they are replaced within the sort variables by {bf:" #"}, where {bf:" "} is a
string of whitespaces (a Unicode character from the top of the UTF-8 table) with
a length of max(strlen(sort variable)). This step is skipped for already
'tostringed' missing values ({bf:.}, {bf:.a}, ..., {bf:.z}) if {bf:ignorem} is
specified.
{pmore}
To ensure that substrings specified by the {bf:last()} option are sorted last,
they are replaced within the sort variables by {bf:"©#"}, where {bf:"©"} is a
string of identical Unicode characters from the bottom of the UTF-8 table. The
code point for this character (either default or user-specified) is set by
{bf:codepoint()}, and the length is again max(strlen(sort variable)). This step
is also skipped for 'tostringed' missing values if {bf:ignorem} is specified.
{pmore}
For natural sorting, leading zeros are appended to the integer parts of
'tostringed' numeric values.
{pstd}
{bf:2.} Collating all rows with or a subset without adding the data-sorted flag
(sorted by) using the permutation vector.
{pmore}
The flag is created by preserving the original string and numeric values of the
sort variables in two ancillary matrices in {help mata:Mata}, replacing them
with the permutation vector, performing the regular Stata {helpb sort} (i.e.,
reordering and collating), and then restoring the original sort variable values,
now collated on the permutation vector, using {help mata:Mata}'s
{help mf_sort:_collate()}.
{pmore}
The program sets a {bf:data-changed flag} when variable rows are collated.
{pstd}
{bf:Note:} The {helpb by} prefix is processed using {helpb egen} {it:group},
in conjunction with {helpb preserve}, {helpb append}, and {helpb save}, which
store interim results in a temporary file.
{marker examples}{...}
{title:Examples}
{pstd}Setup:{p_end}
{phang2}{cmd:. sysuse auto}
{pstd}Sort observations in ascending order by {cmd:price}:{p_end}
{phang2}{cmd:. usort price}
{pstd}Sort observations in ascending order by {cmd:rep78}, missing first:{p_end}
{phang2}{cmd:. usort rep78, mfirst}
{pstd}Sort observations in ascending order by {cmd:make} in Czech, grouped by
{cmd:foreign}, with VW models placed at the top:{p_end}
{phang2}{cmd:. bysort foreign: usort make, first(VW, pos) loc(cs_CS)}
{pstd}Sort observations in descending order by {cmd:mpg} and {cmd:price}:{p_end}
{phang2}{cmd:. usort -mpg -price}
{pstd}Sort observations in descending order by {cmd:price} for domestic cars
only:{p_end}
{phang2}{cmd:. usort -mpg -price if ! foreign}
{title:Acknowledgements}
{pstd}
A special thanks to Leonardo Guizzetti for requesting and testing this program.
{title:Author}
{pstd}
{bf:Ilya Bolotov}
{break}Prague University of Economics and Business
{break}Prague, Czech Republic
{break}{browse "mailto:[email protected]":[email protected]}
{pstd}
Thanks for citing this software and my works on the topic:
{p 8 8 2}
Bolotov, I. (2024). USORT: Stata module to perform locale-based ascending
and descending sort that supports conditional statements, observation
ranges, and user-defined handling of substrings and missing values.
Available from {browse "https://ideas.repec.org/c/boc/bocode/s459385.html"}.