Skip to content

Handling non utf 8 names

Martin Pool edited this page Feb 9, 2020 · 2 revisions

On Linux (and perhaps Unix generally other than osx), the OS constraint on filenames is just that they are byte sequences not including nul or slash.

By convention, they are normally UTF-8. (Or at least some well defined encoding across the filesystem, but typically UTF-8.) But this is not required: it's possible to have files with names in arbitrary different encodings.

Rust handles this using the OSString API which avoids implicit conversations. But the application still needs to say how it wants to convert.

Why even bother?

  • Some users (although not me) may have extensive filesystems intentionally stored in a national encoding
  • Some filesystems may accidentally have a few anomalous files, perhaps from old data or perhaps filesystem test cases. But, probably those users would be better advised to just fix the files using something like convmv, since many other tools, even ls, will have trouble.
  • If it can be done at reasonable cost (of complexity, effort, bugs) then it's nice to support everything the underlying OS supports. But, this is arguably a historic accident in the underlying OS, forbidden on osx, and softly deprecated on Linux.

In some cases the user will know what the encoding is and in some cases perhaps not.

Conserve internal filenames are Unicode and this seems right:

  • Clearly defined
  • Can sort etc
  • Clear how to print them
  • Stable across platforms
  • Can serialize into UTF-8 json
  • Localize handling of other encodings near the boundary

Options

  1. Just skip these files. 0.6.2, at least, does this. They're rare, and becoming rarer. It's probably the right answer.

  2. Lossy conversion to placeholder characters. The data is accessible. Might make the filenames not unique.

  3. User specifies the fs encoding: either instead of utf-8, or as a fallback if the name is not utf-8.

  4. As above but configured per directory?

Clone this wiki locally