-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_json(std::filesystem::path) can create invalid UTF-8 chars on windows #4271
Comments
I can also workaround this problem by adding a manifest XML that sets my app's code page to In CMake I wrapped this in a function:
which is used like this (probably want to wrap in a platform check):
with <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
<application>
<windowsSettings>
<activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
</windowsSettings>
</application>
</assembly> This solves the problem, if the app is running on at least Windows Version 1903. Still a bug but wanted to share this workaround because it's useful for many libraries that have the same issue. |
Proposed diff to do the conversion to UTF-8 when targeting windows: diff --git a/include/nlohmann/detail/conversions/to_json.hpp b/include/nlohmann/detail/conversions/to_json.hpp
index 562089c3..a8b74688 100644
--- a/include/nlohmann/detail/conversions/to_json.hpp
+++ b/include/nlohmann/detail/conversions/to_json.hpp
@@ -413,10 +413,20 @@ inline void to_json(BasicJsonType& j, const T& t)
}
#if JSON_HAS_FILESYSTEM || JSON_HAS_EXPERIMENTAL_FILESYSTEM
+#if defined(_WIN32)
+#include <windows.h>
+#endif
template<typename BasicJsonType>
inline void to_json(BasicJsonType& j, const std_fs::path& p)
{
+#if defined(_WIN32)
+ int len = ::WideCharToMultiByte(CP_UTF8, 0, &p.native()[0], p.native().size(), nullptr, 0, nullptr, nullptr);
+ std::string as_utf8(len, 0);
+ ::WideCharToMultiByte(CP_UTF8, 0, &p.native()[0], p.native().size(), &narrowed_string[0], len, nullptr, nullptr);
+ j = std::move(as_utf8);
+#else
j = p.string();
+#endif
}
#endif |
path may be represented in some ways (native/generic_string/string/u8string/e.t.c), so, I think it should be decided on client side how to store it before put it to json object. Just |
I am not sure how to proceed here as I am not a Windows user. Any idea? |
I'm happy to make a PR, up to you what solution to use though:
With (1) and (2) I'm not sure the effect on custom |
When is it not supported? I don't see any indication that it's optional. I see that on Windows it can throw if the string can't be represented in utf-8, so that's a consideration. |
Sorry, read the reference page wrong! In that case, @zel1b08a's solution seems best by far. Replace with Small chance of changing the behavior of already-broken code I guess (e.g. someone calling |
Awesome. Looking forward to a pull request :) |
The OS "favored" way how to process unicode on Windows is to use In Since Windows "native" way is to use already mentioned |
If it helps: the library contains code to convert UTF-16 or UTF-32 to UTF-8 (see |
I was thinking more about using all the power of the standard lib to avoid calling I just cleaned it up a bit and did few cosmetic changes: removed external types and dependencies, and replaced The cleaned up code for an explicit conversion between #if defined(_MSVC_LANG) && !defined(__clang__)
#define NLOHMANN_JSON_CPP_STD _MSVC_LANG
#else
#define NLOHMANN_JSON_CPP_STD __cplusplus
#endif
inline auto path_from_u8(const std::string& path) -> std::filesystem::path
{
#if NLOHMANN_JSON_CPP_STD < 202002 // `<` b/c `u8path` is deprecated in C++20; ⇨ warnings.
return std::filesystem::u8path(path);
#else
return std::filesystem::path(std::u8string_view(reinterpret_cast<const char8_t*>(path.data()), path.size()));
#endif
}
inline auto to_u8_string(const std::filesystem::path& p) -> std::string
{
#if NLOHMANN_JSON_CPP_STD < 202002
return p.u8string(); // Returns a `std::string` in C++17.
#else
const std::u8string s = p.u8string();
return std::string(s.begin(), s.end()); // Needless copy except for C++20 nonsense.
#endif
} I see certain beauty in this solution: The same codepath works on any platform, regardless its locale, and it does not need to use BTW: Are the "other" platforms (e.g. linux and macOS) only allowing |
@nlohmann |
Sure, please go ahead. |
Description
This conversion function:
https://github.com/nlohmann/json/blob/7efe875495a3ed7d805ddbb01af0c7725f50c88b/include/nlohmann/detail/conversions/to_json.hpp#L416C1-L420C2
uses
p.string()
, which does not give a UTF-8-encoded string on windows (in some cases, maybe?). Trying todump()
the resultant JSON throws a "invalid UTF-8 byte" exception.Reproduction steps
Convert a
std::filesystem::path
, which contains a unicode "Right Single Quotation Mark" character (U+2019), to ajson
implicitly or withto_json
.Inspect the new
json (string_t)
's bytes, either bydump()
ing, or converting to BSON.Expected vs. actual results
Expected: "Strings are stored in UTF-8 encoding." per https://json.nlohmann.me/api/basic_json/string_t/
Actual: The string gets converted by
std::filesystem::path::string()
, which appears to convert it to Windows-1252 encoding. Its bytes end up as\x92
rather than\xe2\x80\x99
.Minimal code example
Workaround I'm using is to use
WideCharToMultiByte
+.native()
to get the string in UTF-8 before passing to nlohmann:Error messages
"[json.exception.type_error.316] invalid UTF-8 byte at index 0: 0x92
Compiler and operating system
MSVC 2022 Professional, C++ 20
Library version
develop - a259ecc
Validation
develop
branch is used.The text was updated successfully, but these errors were encountered: