Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_json(std::filesystem::path) can create invalid UTF-8 chars on windows #4271

Open
2 tasks done
MHebes opened this issue Jan 19, 2024 · 13 comments
Open
2 tasks done
Labels
kind: bug state: please discuss please discuss the issue or vote for your favorite option

Comments

@MHebes
Copy link

MHebes commented Jan 19, 2024

Description

This conversion function:

https://github.com/nlohmann/json/blob/7efe875495a3ed7d805ddbb01af0c7725f50c88b/include/nlohmann/detail/conversions/to_json.hpp#L416C1-L420C2

template<typename BasicJsonType>
inline void to_json(BasicJsonType& j, const std_fs::path& p)
{
    j = p.string();
}

uses p.string(), which does not give a UTF-8-encoded string on windows (in some cases, maybe?). Trying to dump() the resultant JSON throws a "invalid UTF-8 byte" exception.

Reproduction steps

Convert a std::filesystem::path, which contains a unicode "Right Single Quotation Mark" character (U+2019), to a json implicitly or with to_json.

Inspect the new json (string_t)'s bytes, either by dump()ing, or converting to BSON.

Expected vs. actual results

Expected: "Strings are stored in UTF-8 encoding." per https://json.nlohmann.me/api/basic_json/string_t/

Actual: The string gets converted by std::filesystem::path::string(), which appears to convert it to Windows-1252 encoding. Its bytes end up as \x92 rather than \xe2\x80\x99.

Minimal code example

#include <filesystem>
#include <iostream>
#include <nlohmann/json.hpp>

int main() {
  try {
    wchar_t wide_unicode_right_quote[2] = {0x2019, 0};  // came from a directory_iterator in reality
    nlohmann::json apost = std::filesystem::path(wide_unicode_right_quote);
    std::cout << apost << std::endl;
    return 0;
  } catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
    return 1;
  }
}

Workaround I'm using is to use WideCharToMultiByte + .native() to get the string in UTF-8 before passing to nlohmann:

inline std::string Narrow(std::wstring_view wstr) {
  if (wstr.empty()) return {};
  int len = ::WideCharToMultiByte(CP_UTF8, 0, &wstr[0], wstr.size(), nullptr, 0, nullptr, nullptr);
  std::string out(len, 0);
  ::WideCharToMultiByte(CP_UTF8, 0, &wstr[0], wstr.size(), &out[0], len, nullptr, nullptr);
  return out;
}

int main() {
  try {
    wchar_t wide_unicode_right_quote[2] = {0x2019, 0};  // came from a directory_iterator in reality
    nlohmann::json apost = Narrow(std::filesystem::path(wide_unicode_right_quote).native());
    std::cout << apost << std::endl;
    return 0;
  } catch (const std::exception& e) {
    std::cerr << e.what() << std::endl;
    return 1;
  }
}

Error messages

"[json.exception.type_error.316] invalid UTF-8 byte at index 0: 0x92

Compiler and operating system

MSVC 2022 Professional, C++ 20

Library version

develop - a259ecc

Validation

@MHebes
Copy link
Author

MHebes commented Feb 16, 2024

I can also workaround this problem by adding a manifest XML that sets my app's code page to CP_UTF8 on supported versions of windows.

In CMake I wrapped this in a function:

# target_add_manifest(<target> <manifest file>)
#
# You probably want to use ${MANIFEST_FILE_UTF8} defined below this function
#
# Adds a manifest file (https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests)
# to an EXE
function(target_add_manifest TARGET_NAME MANIFEST_FILE)
  if(NOT TARGET_NAME)
	  message(FATAL_ERROR "You must provide a target")
	endif()
	if(NOT MANIFEST_FILE)
	  message(FATAL_ERROR "You must provide a manifest file")
	endif()
	add_custom_command(
		TARGET ${TARGET_NAME}
		POST_BUILD
		COMMAND "mt.exe" -manifest \"${MANIFEST_FILE}\" \"-updateresource:$<TARGET_FILE:${TARGET_NAME}>\"
	)
endfunction()

which is used like this (probably want to wrap in a platform check):

add_executable(myapp main.cpp)
target_add_manifest(myapp "${CMAKE_CURRENT_SOURCE_DIR}/cmake/utf8.manifest")

with utf8.manifest being:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
  <application>
    <windowsSettings>
      <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
    </windowsSettings>
  </application>
</assembly>

This solves the problem, if the app is running on at least Windows Version 1903. Still a bug but wanted to share this workaround because it's useful for many libraries that have the same issue.

@MHebes
Copy link
Author

MHebes commented Feb 16, 2024

Proposed diff to do the conversion to UTF-8 when targeting windows:

diff --git a/include/nlohmann/detail/conversions/to_json.hpp b/include/nlohmann/detail/conversions/to_json.hpp
index 562089c3..a8b74688 100644
--- a/include/nlohmann/detail/conversions/to_json.hpp
+++ b/include/nlohmann/detail/conversions/to_json.hpp
@@ -413,10 +413,20 @@ inline void to_json(BasicJsonType& j, const T& t)
 }
 
 #if JSON_HAS_FILESYSTEM || JSON_HAS_EXPERIMENTAL_FILESYSTEM
+#if defined(_WIN32)
+#include <windows.h>
+#endif
 template<typename BasicJsonType>
 inline void to_json(BasicJsonType& j, const std_fs::path& p)
 {
+#if defined(_WIN32)
+    int len = ::WideCharToMultiByte(CP_UTF8, 0, &p.native()[0], p.native().size(), nullptr, 0, nullptr, nullptr);
+    std::string as_utf8(len, 0);
+    ::WideCharToMultiByte(CP_UTF8, 0, &p.native()[0], p.native().size(), &narrowed_string[0], len, nullptr, nullptr);
+    j = std::move(as_utf8);
+#else
     j = p.string();
+#endif
 }
 #endif

@zel1b08a
Copy link

zel1b08a commented Jun 1, 2024

path may be represented in some ways (native/generic_string/string/u8string/e.t.c), so, I think it should be decided on client side how to store it before put it to json object.

Just j = p.u8string(); may be should be there to corresponds the docs.

@nlohmann
Copy link
Owner

I am not sure how to proceed here as I am not a Windows user. Any idea?

@nlohmann nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label Nov 18, 2024
@MHebes
Copy link
Author

MHebes commented Nov 18, 2024

I am not sure how to proceed here as I am not a Windows user. Any idea?

I'm happy to make a PR, up to you what solution to use though:

  1. Use OS APIs to convert to UTF-8 explicitly like in the above diff. Adds a Windows.h dependency. May not fix it for some theoretical non-windows but non-utf-8-native platforms (if they exist).
  2. Use std::filesystem::path::u8string() when supported. Use the current (buggy) behavior otherwise.
  3. Deprecate the built in path converters and/or document the deficiency but otherwise make no code changes.

With (1) and (2) I'm not sure the effect on custom StringTypes. The documentation says "Strings are stored in UTF-8 encoding" which implies that StringType has to have a char value type.

@gregmarr
Copy link
Contributor

gregmarr commented Nov 18, 2024

Use std::filesystem::path::u8string() when supported. Use the current (buggy) behavior otherwise.

When is it not supported? I don't see any indication that it's optional. I see that on Windows it can throw if the string can't be represented in utf-8, so that's a consideration.

@MHebes
Copy link
Author

MHebes commented Nov 18, 2024

When is it not supported?

Sorry, read the reference page wrong!

In that case, @zel1b08a's solution seems best by far. Replace with .u8string(), done & easy.

Small chance of changing the behavior of already-broken code I guess (e.g. someone calling path::path(const char*) on windows with UTF8-encoded chars instead of the active code page)? Not sure if that matters or is possible.

@nlohmann
Copy link
Owner

Awesome. Looking forward to a pull request :)

@risa2000
Copy link

The OS "favored" way how to process unicode on Windows is to use std::wstring, which uses UTF-16LE encoding. (Un)fortunately everyone else favors UTF-8. std::filesystem::path can be constructed from std::string, std::wstring and bunch of specialized unicode types, where in the first case it uses "multibyte" encoding (local code page), in the second case it uses UTF-16LE. Similarly, when converting the path back to the string, using .string() will emit local code page "multibyte" encoding, while .wstring() will emit UTF-16LE encoding. No one wants to mess with local code page encoding and/or converting.

In C++17 there were two functions std::filesystem::path::u8path() to create a path from UTF-8 encoded std::string and std::filesystem::path::u8string() to convert the path back to UTF-8 encoded std::string.
This has been changed in C++20 in a way that u8path has been dropped and u8string() now returns std::u8string, which is basically useless.

Since nlohmann::json assumes that strings are implicitly in UTF-8 the only way to handle this C++ discrepancy correctly on Windows would be to convert std::string into an "intermediate format", either std::wstring (UTF-16LE encoding), or std::u8string (with UTF-8), then pass it to std::filesystem::path constructor, and vice versa, i.e. when serializing to JSON, "extract" either of types from std::filesystem::path and then convert to UTF-8 encoded std::string.

Windows "native" way is to use already mentioned ::WideCharToMultiByte() for conversion from std::wstring to UTF-8 encoded std::string, or, which I have not tried, but I imagine, there might be a more straightforward way to convert between std::u8string and std::string (with UTF-8 encoding). Something like a char copy in either direction.

@nlohmann
Copy link
Owner

If it helps: the library contains code to convert UTF-16 or UTF-32 to UTF-8 (see wide_string_input_helper in include/nlohmann/detail/input/input_adapters.hpp. And since the issue of converting to UTF-8 is only relevant to Windows, maybe relying on ::WideCharToMultiByte() could be preferred.

@risa2000
Copy link

risa2000 commented Jan 19, 2025

I was thinking more about using all the power of the standard lib to avoid calling Win32 API and just before starting to write such an implementation I found exactly this already done here: https://github.com/alf-p-steinbach/C---how-to---make-non-English-text-work-in-Windows/blob/main/microlibs/cppm/stdlib_workarounds/fs_path.hpp#L19

I just cleaned it up a bit and did few cosmetic changes: removed external types and dependencies, and replaced string_view with std::string parameter in path_from_u8.
I also had to better define the C++ std detection as __cplusplus macro is only defined if /Zc:__cplusplus compiler option is defined (which is usually not), so therefore _MSVC_LANG is used instead.
I do not know if the library already uses conditionally compiled paths based on __cplusplus, but if it does, the already defined macro should be used instead.

The cleaned up code for an explicit conversion between std::filesystem::path and UTF-8 encoded std::string looks like this:

#if defined(_MSVC_LANG) && !defined(__clang__)
#define NLOHMANN_JSON_CPP_STD _MSVC_LANG
#else
#define NLOHMANN_JSON_CPP_STD __cplusplus
#endif

inline auto path_from_u8(const std::string& path) -> std::filesystem::path
{
#if NLOHMANN_JSON_CPP_STD < 202002 // `<` b/c `u8path` is deprecated in C++20; ⇨ warnings.
    return std::filesystem::u8path(path);
#else
    return std::filesystem::path(std::u8string_view(reinterpret_cast<const char8_t*>(path.data()), path.size()));
#endif
}

inline auto to_u8_string(const std::filesystem::path& p) -> std::string
{
#if NLOHMANN_JSON_CPP_STD < 202002
    return p.u8string(); // Returns a `std::string` in C++17.
#else
    const std::u8string s = p.u8string();
    return std::string(s.begin(), s.end()); // Needless copy except for C++20 nonsense.
#endif
}

I see certain beauty in this solution: The same codepath works on any platform, regardless its locale, and it does not need to use Win32 API call. The price to pay is one useless copy in to_u8_string. Is it worthy - I do not know.

BTW: Are the "other" platforms (e.g. linux and macOS) only allowing UTF-8 encoding in locale nowadays? I remember there were times when LC_LOCALE could be set to ISO page and in this case I believe even there the "default" std::string used by std::filesystem::path will not get UTF-8 encoded?

@risa2000
Copy link

@nlohmann
If you agree with the aforementioned approach I will submit a PR for this.

@nlohmann
Copy link
Owner

Sure, please go ahead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug state: please discuss please discuss the issue or vote for your favorite option
Projects
None yet
Development

No branches or pull requests

5 participants