usort.sthlp

{smcl}
{* *! version 1.1.3  31oct2024}{...}
{viewerjumpto "Syntax" "usort##syntax"}{...}
{viewerjumpto "Description" "usort##description"}{...}
{viewerjumpto "Methods and formulas" "usort##methods"}{...}
{viewerjumpto "Examples" "usort##examples"}{...}
{p2colset 1 15 17 2}{...}

{title:Title}
{phang}
{bf:usort} {hline 2} byable locale-based ascending and descending sort that
supports conditional statements, observation ranges, and user-defined handling
of substrings and missing values

{marker syntax}{...}
{title:Syntax}

{p 8 17 2}
{cmdab:usort}
[{cmd:+}|{cmd:-}]
{varname}
[[{cmd:+}|{cmd:-}]
{varname} {it:...}]
{ifin}
[{cmd:,} {it:options}]

{synoptset 35 tabbed}{...}
{synopthdr}
{synoptline}
{syntab:Substrings and missing values}
{synopt:{opth f:irst(strings:string [, pos|rpos|regex]])}} sort the words, i.e.,
        substrings (not) enclosed in double quotes and separated by
        whitespace(s), within the {help strings:{it:string}}: first (for
        [+]{varname}) or last (for -{varname}) in the specified order (word 1
        is first, word 2 is second, etc.). Sorting is done by: a)
        comparing {varname} with each word (without options), b) using
        {help f_ustrpos:ustrpos({it:varname}, word)} (for {it:pos}), c) using
        {help f_ustrrpos:ustrrpos({it:varname}, word)} (for {it:rpos}), or
        d) using {help f_ustrregexm:ustrregexm({it:varname}, word)} (for
        {it:regex}). {break} {bf:Note:} System numerical and string missing
        values are coded as {bf:.}, while non-system missing values are coded as
        {bf:.a}, {bf:.b}, etc. in non-regex representation. For regular
        expressions, you can use {bf:^[.]$}, {bf:^[.]a$}, {bf:^[.]b$}, etc. in
        {bf:first(}{help strings:{it:string}}, {it:regex}{bf:)}.{p_end}
{synopt:{opth l:ast(strings:string [, pos|rpos|regex]])}}  sort the words, i.e.,
        substrings not enclosed in double quotes and separated by
        whitespace(s), within the {help strings:{it:string}}: last (for
        [+]{varname}) or first (for -{varname}) in the specified order (word 1
        is last, word 2 is second to last, etc.). Sorting is performed by: a)
        comparing {varname} with each word (without options), b) using
        {help f_ustrpos:ustrpos({it:varname}, word)} (for {it:pos}), c) using
        {help f_ustrrpos:ustrrpos({it:varname}, word)} (for {it:rpos}), or
        d) using {help f_ustrregexm:ustrregexm({it:varname}, word)} (for
        {it:regex}). {break} {bf:Note:} System numerical and string missing
        values are coded as {bf:.}, while non-system missing values are coded as
        {bf:.a}, {bf:.b}, etc. in non-regex representation. For regular
        expressions, you can use {bf:^[.]$}, {bf:^[.]a$}, {bf:^[.]b$}, etc. in
        {bf:first(}{help strings:{it:string}}, {it:regex}{bf:)}.{p_end}
{synopt:{opt ignorec}}ignore case sensitivity in {bf:first()} and {bf:last()}.
        {p_end}
{synopt:{opt mf:irst}}sort missing values first (for [+]{varname}) or last (for
        -{varname}).{p_end}
{synopt:{opt ml:ast}}sort missing values last (for [+]{varname}) or first (for
        -{varname}).{p_end}
{synopt:{opt ignorem}}ignore missing values when using {bf:first()},
        {bf:last()}, {bf:mfirst}, and {bf:mlast}.{p_end}

{syntab:Locale}
{synopt:{opth loc:ale(string)}}locale code from the {stata unicode locale list}
        or {bf:c(locale_functions)} by default.{p_end}
{synopt:{opth st(#)}}argument {it:st} in
        {help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
        {p_end}
{synopt:{opth case(#)}}argument {it:case} in
        {help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
        {p_end}
{synopt:{opth cslv(#)}}argument {it:cslv} in
        {help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
        {p_end}
{synopt:{opth norm(#)}}argument {it:norm} in
        {help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
        {p_end}
{synopt:{opth num(#)}}argument {it:num} in
        {help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
        {p_end}
{synopt:{opth alt(#)}}argument {it:alt} in
        {help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
        {p_end}
{synopt:{opth fr(#)}}argument {it:fr} in
        {help f_ustrsortkeyex:ustrsortkeyex()}, with a default value of {bf:-1}.
        {p_end}

{syntab:Miscellaneous}
{synopt:{opth format(%fmt)}}format for converting numerical sort variables into
        strings (sorting is performed on string values only). The default format
        is {bf:%32.16f}.{p_end}
{synopt:{opth codepoint(#)}}code point location of a symbol from the bottom of
        the UTF-8 table used to make {bf:last()} work. The default value is
        {bf:129769}.{p_end}
{synoptline}
{p2colreset}{...}
{p 4 6 2}
{opt by} is allowed; see {help by}.{p_end}
{marker weight}{...}
{p 4 6 2}
{opt weight}s are not allowed; see {help weights}.
{p_end}

{marker description}{...}
{title:Description}

{pstd}
This program is a byable sort command, which allows for: a) custom first and
last substrings, including system missing values ({bf:.}) and all other missing
values, b) {helpb gsort}-like syntax for sorting in ascending or descending
order, and c) conditional sorting using [{help if:{it:if}}] or range-based
sorting using [{help in:{it:in}}]. The program is built around Stata's
{helpb sort} command and will mark the dataset as sorted (sorted by) if all rows
are selected. If a subset of rows is selected, it applies {help mata:Mata}'s
{help mf_sort:_collate()}.

{pstd}
Sorting large datasets may be more taxing on machine CPU, memory, and/or disk
space as compared to {helpb sort} and {helpb gsort}.

{marker methods}{...}
{title:Methods and Formulas}
{pstd}
Sorting occurs in {bf:two steps}:

{pstd}
{bf:1.} Generating a permutation vector in {help mata:Mata} from the sort
variables under {helpb preserve}. Since non-numeric sorting values cannot be
'destringed', the sort variable type must be {help data_types:{it:str#/strL}}
to  allow sorting them as a single matrix using {help mata:Mata}'s
{help mf_sort:sort()} function. The precision for sorting 'tostringed' numeric
values is determined by the {help format:{it:%fmt}} (either default or
user-specified) in {bf:format()}.

{pmore}
To ensure that substrings specified by the {bf:first()} option are sorted first,
they are replaced within the sort variables by {bf:" #"}, where {bf:" "} is a
string of whitespaces (a Unicode character from the top of the UTF-8 table) with
a length of max(strlen(sort variable)). This step is skipped for already
'tostringed' missing values ({bf:.}, {bf:.a}, ..., {bf:.z}) if {bf:ignorem} is
specified.

{pmore}
To ensure that substrings specified by the {bf:last()} option are sorted last,
they are replaced within the sort variables by {bf:"©#"}, where {bf:"©"} is a
string of identical Unicode characters from the bottom of the UTF-8 table. The
code point for this character (either default or user-specified) is set by
{bf:codepoint()}, and the length is again max(strlen(sort variable)). This step
is also skipped for 'tostringed' missing values if {bf:ignorem} is specified.

{pmore}
For natural sorting, leading zeros are appended to the integer parts of
'tostringed' numeric values.

{pstd}
{bf:2.} Collating all rows with or a subset without adding the data-sorted flag
(sorted by) using the permutation vector.

{pmore}
The flag is created by preserving the original string and numeric values of the
sort variables in two ancillary matrices in {help mata:Mata}, replacing them
with the permutation vector, performing the regular Stata {helpb sort} (i.e.,
reordering and collating), and then restoring the original sort variable values,
now collated on the permutation vector, using {help mata:Mata}'s
{help mf_sort:_collate()}.

{pmore}
The program sets a {bf:data-changed flag} when variable rows are collated.

{pstd}
{bf:Note:} The {helpb by} prefix is processed using {helpb egen} {it:group},
in conjunction with {helpb preserve}, {helpb append}, and {helpb save}, which
store interim results in a temporary file.

{marker examples}{...}
{title:Examples}
{pstd}Setup:{p_end}
{phang2}{cmd:. sysuse auto}

{pstd}Sort observations in ascending order by {cmd:price}:{p_end}
{phang2}{cmd:. usort price}

{pstd}Sort observations in ascending order by {cmd:rep78}, missing first:{p_end}
{phang2}{cmd:. usort rep78, mfirst}

{pstd}Sort observations in ascending order by {cmd:make} in Czech, grouped by
{cmd:foreign}, with VW models placed at the top:{p_end}
{phang2}{cmd:. bysort foreign: usort make, first(VW, pos) loc(cs_CS)}

{pstd}Sort observations in descending order by {cmd:mpg} and {cmd:price}:{p_end}
{phang2}{cmd:. usort -mpg -price}

{pstd}Sort observations in descending order by {cmd:price} for domestic cars
only:{p_end}
{phang2}{cmd:. usort -mpg -price if ! foreign}

{title:Acknowledgements}
{pstd}
A special thanks to Leonardo Guizzetti for requesting and testing this program.

{title:Author}

{pstd}
{bf:Ilya Bolotov}
{break}Prague University of Economics and Business
{break}Prague, Czech Republic
{break}{browse "mailto:ilya.bolotov@vse.cz":ilya.bolotov@vse.cz}

{pstd}
Thanks for citing this software and my works on the topic:

{p 8 8 2}
    Bolotov, I. (2024). USORT: Stata module to perform locale-based ascending
    and descending sort that supports conditional statements, observation
    ranges, and user-defined handling of substrings and missing values.
    Available from {browse "https://ideas.repec.org/c/boc/bocode/s459385.html"}.