Handling non utf 8 names

On Linux (and perhaps Unix generally other than osx), the OS constraint on filenames is just that they are byte sequences not including nul or slash.

By convention, they are normally UTF-8. (Or at least some well defined encoding across the filesystem, but typically UTF-8.) But this is not required: it's possible to have files with names in arbitrary different encodings.

Rust handles this using the OSString API which avoids implicit conversations. But the application still needs to say how it wants to convert.

Why even bother?

Some users (although not me) may have extensive filesystems intentionally stored in a national encoding
Some filesystems may accidentally have a few anomalous files, perhaps from old data or perhaps filesystem test cases. But, probably those users would be better advised to just fix the files using something like convmv, since many other tools, even ls, will have trouble.
If it can be done at reasonable cost (of complexity, effort, bugs) then it's nice to support everything the underlying OS supports. But, this is arguably a historic accident in the underlying OS, forbidden on osx, and softly deprecated on Linux.

In some cases the user will know what the encoding is and in some cases perhaps not.

Conserve internal filenames are Unicode and this seems right:

Clearly defined
Can sort etc
Clear how to print them
Stable across platforms
Can serialize into UTF-8 json
Localize handling of other encodings near the boundary

Options

Just skip these files. 0.6.2, at least, does this. They're rare, and becoming rarer. It's probably the right answer.
Lossy conversion to placeholder characters. The data is accessible. Might make the filenames not unique.
User specifies the fs encoding: either instead of utf-8, or as a fallback if the name is not utf-8.
As above but configured per directory?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling non utf 8 names

Why even bother?

Options

Clone this wiki locally