Skip to content

Commit

Permalink
Charset fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
anarthal committed Nov 14, 2024
1 parent d3a2981 commit aac17bb
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 21 deletions.
39 changes: 19 additions & 20 deletions doc/qbk/17_charsets.qbk
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,16 @@ The [*connection's character set determines the encoding for character strings
sent to and retrieved from the server].
This includes SQL query strings, string fields and column names in metadata.
The connection's collation is used for string literal comparison.
The connection's character set and collation can be changed dynamically
using SQL.

By default, Boost.MySQL connections use `utf8mb4_general_ci`,
thus [*using UTF-8 for all strings]. We recommend using the default,
since MySQL character sets are easy to get wrong.
thus [*using UTF-8 for all strings]. We recommend using this default,
as MySQL character sets are easy to get wrong.

The connection's character set is not linked to the character set
specified for databases, tables and columns. For example,
with the following declaration:
specified for databases, tables and columns.
Consider the following declaration:

```
CREATE TABLE test_table(
Expand Down Expand Up @@ -62,16 +64,16 @@ of what's affected:

* SQL query strings passed to [refmemunq any_connection async_execute] and
[refmemunq any_connection async_prepare_statement] must be sent using
the connection's charset. Otherwise, syntax errors may happen.
the connection's character set. Otherwise, server-side parsing errors may happen.
* SQL templates and string values passed to [reflink with_params]
and [reflink format_sql] must be encoded using the connection's charset.
and [reflink format_sql] must be encoded using the connection's character set.
Otherwise, values will be rejected by Boost.MySQL when composing the query.
Connections [link mysql.charsets.tracking track the character set in use] to detect these errors.
If you bypass character set tracking (e.g. by using `SET NAMES` instead of
[refmemunq async_set_character_set]), you may run into vulnerabilities.
* Statement string parameters passed to [refmem statement bind] should use the connection's charset.
[refmemunq any_connection async_set_character_set]), you may run into vulnerabilities.
* Statement string parameters passed to [refmem statement bind] should use the connection's character set.
Otherwise, MySQL may reject the values.
* String values in rows and metadata retrieved from the server use the connection's charset.
* String values in rows and metadata retrieved from the server use the connection's character set.
* Server-supplied diagnostic messages ([refmem diagnostics server_message]) also
use the connection's character set.

Expand All @@ -92,10 +94,10 @@ stick to the following advice:
If you need to use a different encoding in your application, convert your data to/from UTF-8
when interacting with the server. The default [reflink connect_params] ensure that UTF-8 is
used, without the need to run any SQL.
* [*Don't execute SET NAMES] statements or the `character_set_client` and
* [*Don't execute SET NAMES] statements or change the `character_set_client` and
`character_set_results` session variables using `async_execute`.
This breaks character set tracking, which can lead to vulnerabilities.
* Don't use [refmemunq async_reset_connection] unless you know what you're doing.
* Don't use [refmemunq any_connection async_reset_connection] unless you know what you're doing.
If you need to reuse connections, use [reflink connection_pool], instead.
* Connections obtained from a [reflink connection_pool] always use `utf8mb4`.
When connections are returned to the pool, their character set is reset to `utf8mb4`.
Expand All @@ -113,7 +115,6 @@ There is a number of actions that can change the connection's character set:
The [include_file boost/mysql/mysql_collations.hpp] and
[include_file boost/mysql/mariadb_collations.hpp] headers contain
available collation IDs.

If the server recognizes the passed collation, the connection's character set
will be the one associated to the collation. If it doesn't, the connection
[*will silently fall back to the server's default character set] (usually `latin1`, which is not Unicode).
Expand Down Expand Up @@ -159,11 +160,11 @@ Following the above points, this is how tracking works:
sets the current character set to the passed one.
The same applies for a successful set character set pipeline stage.
* Calling [refmemunq any_connection async_reset_connection]
makes the current character set to unknown.
makes the current character set unknown.

[warning
[*Do not execute `SET NAMES`], `SET CHARACTER SET` or any other SQL statement
that modifies `character_set_client` using `execute`. This will make character set
that modifies `character_set_client` using `async_execute`. This will make character set
information stored in the client invalid.
]

Expand Down Expand Up @@ -206,23 +207,21 @@ for a full implementation.
Setting the connection's character set during connection establishment
or using [refmemunq any_connection async_set_character_set] has the ultimate
effect of changing some session variables. This section lists them as
a reference. We [*strongly encourage not modifying them manually],
a reference. We [*strongly encourage you not to modify them manually],
as this will confuse character set tracking.

* [mysqllink server-system-variables.html#sysvar_character_set_client character_set_client]
determines the encoding that SQL statements sent to the server should have. This includes
the SQL strings passed to [refmemunq any_connection async_execute] and
[refmemunq any_connection async_prepare_statement], and
string parameters passed to [refmem statement bind].

Not all character sets are permissible in `character_set_client`. The server will accept setting
this variable to any UTF-8 character set, but won't accept UTF-16.
Not all character sets are permissible in `character_set_client`.
For example, UTF-16 and UTF-32 based character sets won't be accepted.
* [mysqllink server-system-variables.html#sysvar_character_set_results character_set_results]
determines the encoding that the server will use to send any kind of result, including
string fields retrieved by [refmem connection execute], metadata
like [refmem metadata column_name] and error messages.

Note that [refmem metadata column_collation] reflects the charset and collation the server
Note that [refmem metadata column_collation] reflects the character set and collation the server
has converted the column to before sending it to the client. In the above example, `metadata::column_collation`
will be the default collation for UTF16, rather than `latin1_swedish_ci`.

Expand Down
4 changes: 3 additions & 1 deletion test/integration/test/snippets/charsets.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
#include <cassert>
#include <cstddef>

namespace mysql = boost::mysql;

namespace {

//[charsets_next_char
Expand Down Expand Up @@ -67,7 +69,7 @@ BOOST_AUTO_TEST_CASE(section_charsets)
{
{
// Verify that utf8mb4_next_char can be used in a character_set
boost::mysql::character_set charset{"utf8mb4", utf8mb4_next_char};
mysql::character_set charset{"utf8mb4", utf8mb4_next_char};

// It works for valid input
unsigned char buff_valid[] = {0xc3, 0xb1, 0x50};
Expand Down

0 comments on commit aac17bb

Please sign in to comment.