From aac17bbe45fee36b6c349095d1a66023d0107bb3 Mon Sep 17 00:00:00 2001 From: Ruben Perez Date: Thu, 14 Nov 2024 16:52:20 +0100 Subject: [PATCH] Charset fixes --- doc/qbk/17_charsets.qbk | 39 ++++++++++----------- test/integration/test/snippets/charsets.cpp | 4 ++- 2 files changed, 22 insertions(+), 21 deletions(-) diff --git a/doc/qbk/17_charsets.qbk b/doc/qbk/17_charsets.qbk index 8e8d61a59..732407832 100644 --- a/doc/qbk/17_charsets.qbk +++ b/doc/qbk/17_charsets.qbk @@ -26,14 +26,16 @@ The [*connection's character set determines the encoding for character strings sent to and retrieved from the server]. This includes SQL query strings, string fields and column names in metadata. The connection's collation is used for string literal comparison. +The connection's character set and collation can be changed dynamically +using SQL. By default, Boost.MySQL connections use `utf8mb4_general_ci`, -thus [*using UTF-8 for all strings]. We recommend using the default, -since MySQL character sets are easy to get wrong. +thus [*using UTF-8 for all strings]. We recommend using this default, +as MySQL character sets are easy to get wrong. The connection's character set is not linked to the character set -specified for databases, tables and columns. For example, -with the following declaration: +specified for databases, tables and columns. +Consider the following declaration: ``` CREATE TABLE test_table( @@ -62,16 +64,16 @@ of what's affected: * SQL query strings passed to [refmemunq any_connection async_execute] and [refmemunq any_connection async_prepare_statement] must be sent using - the connection's charset. Otherwise, syntax errors may happen. + the connection's character set. Otherwise, server-side parsing errors may happen. * SQL templates and string values passed to [reflink with_params] - and [reflink format_sql] must be encoded using the connection's charset. + and [reflink format_sql] must be encoded using the connection's character set. Otherwise, values will be rejected by Boost.MySQL when composing the query. Connections [link mysql.charsets.tracking track the character set in use] to detect these errors. If you bypass character set tracking (e.g. by using `SET NAMES` instead of - [refmemunq async_set_character_set]), you may run into vulnerabilities. -* Statement string parameters passed to [refmem statement bind] should use the connection's charset. + [refmemunq any_connection async_set_character_set]), you may run into vulnerabilities. +* Statement string parameters passed to [refmem statement bind] should use the connection's character set. Otherwise, MySQL may reject the values. -* String values in rows and metadata retrieved from the server use the connection's charset. +* String values in rows and metadata retrieved from the server use the connection's character set. * Server-supplied diagnostic messages ([refmem diagnostics server_message]) also use the connection's character set. @@ -92,10 +94,10 @@ stick to the following advice: If you need to use a different encoding in your application, convert your data to/from UTF-8 when interacting with the server. The default [reflink connect_params] ensure that UTF-8 is used, without the need to run any SQL. -* [*Don't execute SET NAMES] statements or the `character_set_client` and +* [*Don't execute SET NAMES] statements or change the `character_set_client` and `character_set_results` session variables using `async_execute`. This breaks character set tracking, which can lead to vulnerabilities. -* Don't use [refmemunq async_reset_connection] unless you know what you're doing. +* Don't use [refmemunq any_connection async_reset_connection] unless you know what you're doing. If you need to reuse connections, use [reflink connection_pool], instead. * Connections obtained from a [reflink connection_pool] always use `utf8mb4`. When connections are returned to the pool, their character set is reset to `utf8mb4`. @@ -113,7 +115,6 @@ There is a number of actions that can change the connection's character set: The [include_file boost/mysql/mysql_collations.hpp] and [include_file boost/mysql/mariadb_collations.hpp] headers contain available collation IDs. - If the server recognizes the passed collation, the connection's character set will be the one associated to the collation. If it doesn't, the connection [*will silently fall back to the server's default character set] (usually `latin1`, which is not Unicode). @@ -159,11 +160,11 @@ Following the above points, this is how tracking works: sets the current character set to the passed one. The same applies for a successful set character set pipeline stage. * Calling [refmemunq any_connection async_reset_connection] - makes the current character set to unknown. + makes the current character set unknown. [warning [*Do not execute `SET NAMES`], `SET CHARACTER SET` or any other SQL statement - that modifies `character_set_client` using `execute`. This will make character set + that modifies `character_set_client` using `async_execute`. This will make character set information stored in the client invalid. ] @@ -206,7 +207,7 @@ for a full implementation. Setting the connection's character set during connection establishment or using [refmemunq any_connection async_set_character_set] has the ultimate effect of changing some session variables. This section lists them as -a reference. We [*strongly encourage not modifying them manually], +a reference. We [*strongly encourage you not to modify them manually], as this will confuse character set tracking. * [mysqllink server-system-variables.html#sysvar_character_set_client character_set_client] @@ -214,15 +215,13 @@ as this will confuse character set tracking. the SQL strings passed to [refmemunq any_connection async_execute] and [refmemunq any_connection async_prepare_statement], and string parameters passed to [refmem statement bind]. - - Not all character sets are permissible in `character_set_client`. The server will accept setting - this variable to any UTF-8 character set, but won't accept UTF-16. + Not all character sets are permissible in `character_set_client`. + For example, UTF-16 and UTF-32 based character sets won't be accepted. * [mysqllink server-system-variables.html#sysvar_character_set_results character_set_results] determines the encoding that the server will use to send any kind of result, including string fields retrieved by [refmem connection execute], metadata like [refmem metadata column_name] and error messages. - - Note that [refmem metadata column_collation] reflects the charset and collation the server + Note that [refmem metadata column_collation] reflects the character set and collation the server has converted the column to before sending it to the client. In the above example, `metadata::column_collation` will be the default collation for UTF16, rather than `latin1_swedish_ci`. diff --git a/test/integration/test/snippets/charsets.cpp b/test/integration/test/snippets/charsets.cpp index a00f85c36..893035007 100644 --- a/test/integration/test/snippets/charsets.cpp +++ b/test/integration/test/snippets/charsets.cpp @@ -13,6 +13,8 @@ #include #include +namespace mysql = boost::mysql; + namespace { //[charsets_next_char @@ -67,7 +69,7 @@ BOOST_AUTO_TEST_CASE(section_charsets) { { // Verify that utf8mb4_next_char can be used in a character_set - boost::mysql::character_set charset{"utf8mb4", utf8mb4_next_char}; + mysql::character_set charset{"utf8mb4", utf8mb4_next_char}; // It works for valid input unsigned char buff_valid[] = {0xc3, 0xb1, 0x50};