Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-4026: [c++] Add new methods to CustomAttributes to allow non-string custom attributes in schemas #3266

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

jhump
Copy link

@jhump jhump commented Dec 16, 2024

This adds new methods to CustomAttributes to allow setting non-string values. These other methods work with JSON-encoded strings.

Unlike #3064 and #3069, this change attempts to be backwards compatible. However, from reading more comments in pull requests, it looks like the "fix" I added (to escape the keys and values in custom attributes when printing to JSON) may actually be a compatibility issue since it seems that users were expected to have to escape string values if they contained any characters that would be escaped in JSON (including quotes). That seems like a really terrible API, and it also meant that the values would not round-trip correctly: reading a data file would not create custom attributes with these strings properly escaped, so later writing out data with the same schema would generate an invalid schema JSON document.

In any event, this uses strings as the values even though it would be ideal if we could pass some sort of structured data as the value type. The ideal types (json::Entity and its accompanying json::Object and json::Array types) are defined in json/JsonDom.hh. But that header files is not part of the Avro include files distribution, which means we cannot #include it from CustomAttributes.hh, so it's a no-go. From a little history spelunking, I see that they indeed used to use a structured form which was simplified to strings in #1821, purely because these JSON header files aren't available to users in the Avro distribution.

Alternatives that I considered for using JSON-encoded strings:

  • Include the JSON types in the Avro distribution. I think this would be as simple as moving the header files out of lang/c++/impl and into lang/c++/include/avro. But then we expand the public API of too much. This approach was already tried and rejected in AVRO-3601: C++ API header contains breaking include #1820.
  • Use a wrapper like std::any as the value type. This can be #included in CustomAttributes.hh but eliminates type safety in the signature. The only concrete accepted value would likely be json::Entity -- though we could make it more sophisticated and also allow the various value types sans wrapper: std::string, bool, int64_t, double, json::Array (aka std::vector<json::Entity>), and json::Object (aka std::map<std::string, json::Entity>). But this isn't really usable by external/user code, at least not for any composite values, since they aren't able to include the JSON headers and then produce valid values of json::Entity.
  • Convert from json::Entity to some other structured type that is defined in CustomAttributes.hh. This could possibly be a concrete std::variant that allows the various options (and could use std::monostate to represent JSON null values). This introduces non-trivial conversion code. From a performance perspective, it could likely be better than converting to/from strings, but it's a non-trivial amount of new-new code to maintain, which didn't feel right.

What is the purpose of the change

Fixes exception from C++ library when compiling schemas with non-string custom attributes.

Verifying this change

This change added tests and can be verified as follows:

  • Updates existing unit test cases to use the new method and use non-string custom attributes.

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? not documented

@github-actions github-actions bot added the C++ Pull Requests for C++ binding label Dec 16, 2024
…deprecated versions (except one case, testing the deprecated ones); add test cases specifically for deprecated methods
@martin-g
Copy link
Member

@wgtmac Do you have an opinion about these changes ?
There are two more PR about the same (#3064 and #3069). We try to find the best one of them.

@wgtmac
Copy link
Member

wgtmac commented Dec 19, 2024

I'm not fully familiar with the context (and the code) yet. I agree that keeping backward compatibility is the bottom line. However, from the JIRA issue it seems that non-string attributes were supported once but broken after fixing other issues. Shouldn't we just fix it to revert the unexpected behavior? Except the alternatives list above, is it possible to add an option to CustomAttributes to control its behavior of dealing with non-string values?

@jhump
Copy link
Author

jhump commented Dec 19, 2024

Except the alternatives list above, is it possible to add an option to CustomAttributes to control its behavior of dealing with non-string values?

Most certainly. I hadn't considered it, but it would be easy to change to that. Would like you me to apply that suggestion in this PR? For back-compat, the "default" mode would have to be the awkward (arguably broken) string-only mode, that just wraps the input in double-quotes (with no other escaping). But a new constructor could be added to enable a mode where the API expects well-formed JSON-encoded data (so string values must be quoted and escaped).

@jhump
Copy link
Author

jhump commented Jan 6, 2025

@wgtmac, I've updated this branch so that instead of deprecating the old methods and adding new ones, there's now a constructor flag to indicate whether values are strings or arbitrary JSON values (in which case string values must be quoted).

In updating tests, I realized that my code wasn't correctly validating that string values were correct -- if it had a botched/unescaped quote, it wouldn't be flagged as an invalid value. So I added some stuff in the json folder so that loadEntity makes sure that it exhausts all input and fails if there appears to be a subsequent value in the stream. This also caught an existing issue in a schema string literal in SchemaTests.cc.

@jhump
Copy link
Author

jhump commented Jan 6, 2025

@wgtmac, if you don't like the look of this API -- using a separate constructor with a bool flag -- let me know. I'm happy to roll back that commit or continue to iterate on this.

@wgtmac
Copy link
Member

wgtmac commented Jan 7, 2025

@jhump I'm not a maintainer so my opinion might not be thorough.

@martin-g @thiru-mg WDYT?

@martin-g
Copy link
Member

martin-g commented Jan 7, 2025

I am not C++ dev/user, so my opinion is even more irrelevant :-)
My 2c:

  • the PR becomes bigger and since there are not many C++ users reviewing/commenting on it the chances to be merged become smaller :-/
  • in Rust and Java I prefer to use a simple enum instead of boolean parameter. 1) it is more clear to read xyz(Enum.Variant1) instead of xyz(false); 2) it is easy to add more logic by adding more variants.

@jhump
Copy link
Author

jhump commented Jan 7, 2025

I am not C++ dev/user, so my opinion is even more irrelevant :-)

So who is the right person that we should tag, to review and approve this?

I prefer to use a simple enum instead of boolean parameter.

Sure, I can do that.

@martin-g
Copy link
Member

martin-g commented Jan 7, 2025

So who is the right person that we should tag, to review and approve this?

No one! The Avro team is notified anyway.

If no one from the Avro team merges the PR for some reasonable time then I could help with the merge only after at least two approvals from the Avro community.

@wgtmac
Copy link
Member

wgtmac commented Jan 8, 2025

I can help review it :)

@jhump
Copy link
Author

jhump commented Jan 8, 2025

@wgtmac, what do you think of the current approach? If I just replace the bool constructor parameter with an enum (as mentioned above), would that be pretty close to an acceptable patch? Any other concerns or feedback?

@wgtmac
Copy link
Member

wgtmac commented Jan 9, 2025

@jhump Yes, I think so.

@jhump
Copy link
Author

jhump commented Jan 9, 2025

@wgtmac, I've updated the PR with that suggestion. Please take a look. Thanks in advance!

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I did a quick scan on this PR and left some comments about the expected behavior. I need to spend more time on the json impl in this code base.

@@ -19,27 +19,76 @@
#ifndef avro_CustomAttributes_hh__
#define avro_CustomAttributes_hh__

#include "Config.hh"
#include <iostream>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it possible to remove this line? It looks weird that a public header exposes iostream.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. One of the methods of CustomAttributes uses it:

void printJson(std::ostream &os, const std::string &name) const;

Comment on lines 37 to 45
enum ValueMode {
// When a CustomAttributes is created using this mode, all values are strings.
// The value should not be quoted, but any interior quotes and special
// characters (such as newlines) must be escaped.
string,
// When a CustomAttributes is created using this mode, all values are JSON
// values. String values must be quoted and escaped.
json
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enum ValueMode {
// When a CustomAttributes is created using this mode, all values are strings.
// The value should not be quoted, but any interior quotes and special
// characters (such as newlines) must be escaped.
string,
// When a CustomAttributes is created using this mode, all values are JSON
// values. String values must be quoted and escaped.
json
};
enum class ValueMode : uint8_t {
// When a CustomAttributes is created using this mode, all values are expected to be string.
// The value should not be quoted, but any interior quotes and special
// characters (such as newlines) must be escaped.
STRING,
// When a CustomAttributes is created using this mode, all values are standard JSON
// values. String values must be quoted and escaped.
JSON
};

I found that the styles of enum are not consistent. Personally I prefer upper-case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the docstring is user-facing, is it better to separate the expected behavior of write and read path? In a more complicated case, what is the expected behavior if ValueModes are different when reading and writing the same CustomAttributes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the expected behavior if ValueModes are different when reading and writing the same CustomAttributes?

This isn't possible. The ValueMode is a property of the CustomAttributes, specified at construction time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't valueMode_ not serialized in the JSON string? Upon creating the CustomAttributes for deserialization, we don't know the ValueMode used to serialize it and only set ValueMode we choose to the constructor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value mode simply determines how this CustomAttributes interprets values provided via setAttribute and in turn how they are returned from getAttributes and the attributes() map accessor. It's an entirely runtime property.

We don't need to know the value mode used to serialize a JSON schema. The JSON string is known to be valid JSON at this point., since the compiler will have already parsed it into an entity. It may have only string custom attributes or it might not. If it was serialized using "json" value mode, we can still process it with "string" value mode, as long as all custom attributes are strings. And we can always process any input in "json" value mode (the new non-deprecated mode).

The current logic (prior to this PR) cannot handle custom attributes unless they are strings. While this library cannot currently generate schemas that have non-string custom attributes (at least as of v1.11.2), all other Avro libraries can generate that sort of file. That's the bug we're trying to fix in this PR: this library currently throws an exception if an Avro file has a schema with non-string custom attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! It sounds good to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the enum to use enum class and upper-case constant names in 2fd25e1


// Creates a new CustomAttributes object.
//
// If the given mode is string, all values must be strings. All values passed to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ValueMode enum has good documentation, we don't have to repeat them here. Otherwise, it is easy to be out of sync.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed away redundant comments in 0f0efd2.

// WILL be escaped and other special characters MAY be escaped.
//
// To support non-string values, use CustomAttributes(CustomAttributes::json) instead.
CustomAttributes() : CustomAttributes(string) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the default value is to keep backward-compatibility. However, we have to swap the default value in the future because the current behavior is wrong for non-string values. (I'm not well-versed in this codebase yet) Is it possible to figure out if the avro file is created by avro-cpp as well as its writer version? If yes, perhaps we can choose the correct default ValueMode when deserializing the CustomAttributes based on its writer version and use ValueMode::string by default for the writer. In this way, we won't create more problematic files after this release.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always use ValueMode::json when we read an Avro file -- it doesn't matter how the file was created because the file always contains valid JSON in the schema. So if it was written with an older version, then it will only ever have string values in custom attributes. But that is fine to later load and read with ValueMode::json since that handles any kind of attribute value.

This is here for backwards compatibility of users of this API since this is a public type defined in a public header. So even though internal usages of CustomAttributes always should use ValueMode::json, that's not necessarily the case for external usages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that makes sense.

@@ -277,7 +277,7 @@ static void getCustomAttributes(const Object &m, CustomAttributes &customAttribu
const std::unordered_set<std::string> &kKnownFields = getKnownFields();
for (const auto &entry : m) {
if (kKnownFields.find(entry.first) == kKnownFields.end()) {
customAttributes.addAttribute(entry.first, entry.second.stringValue());
customAttributes.addAttribute(entry.first, entry.second.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we change this? Is it a breaking change to downstream users who are calling stringValue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a method on json::Entity which I did not touch. This was the source of the exception: if there was a non-string value for a custom attribute, the stringValue method would throw an exception. But now that we support all kinds of values, we use its toString method, which will return properly formatted JSON string to represent whatever is the Entity's value.

// WILL be escaped and other special characters MAY be escaped.
//
// To support non-string values, use CustomAttributes(CustomAttributes::json) instead.
CustomAttributes() : CustomAttributes(string) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark it as [[deprecated]] because we want discourage users to use this one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a very good idea. We might even want to deprecate the ValueMode::string option, too, because it's highly error-prone and is only present for backwards-compatibility with the old string-only behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated these elements in 0f0efd2.

I am testing using clang on OS X. I don't see any other deprecations in this C++ codebase, so I can't find any other examples of muting the compiler warnings that result from using the deprecated elements in the implementation and the tests. Let me know how I should handle this. For now, I've added clang-specific #pragma statements to eliminate the noisy warnings.

void checkCustomAttributes_getAttribute() {
CustomAttributes cf;
cf.addAttribute("field1", std::string("1"));
void checkCustomAttributes_addAndGetAttributeJson() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO these cases are not sufficient. We need to cover the expected behavior of the following cases (even exceptions are expected in some of them):

Read \ Write string json
string ? ?
json ? ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only thing missing from the matrix was the failure tests for passing a "string" value mode style value when mode is actually "json" and vice versa.

I updated the tests in e95f691, and they caught an issue: we were only handling TooManyExceptions when value mode was "string", but it could be triggered in either value mode. So I've fixed that in the same commit.

@@ -177,6 +177,12 @@ AVRO_DECL Entity loadEntity(const uint8_t *text, size_t len);

void writeEntity(JsonGenerator<JsonNullFormatter> &g, const Entity &n);

class AVRO_DECL TooManyValuesException : public virtual std::runtime_error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any other exception is defined except the one in the Exception.hh. Should we simply stick to Exception to be consistent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we catch this type of exception. If we just threw Exception, would the catch code be expected to parse the string message to assess the specific error condition? That seems like an anti-pattern.

@@ -92,7 +92,11 @@ Entity loadEntity(const char *text) {
Entity loadEntity(InputStream &in) {
JsonParser p;
p.init(in);
return readEntity(p);
Entity e = readEntity(p);
if (p.hasMore()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not checking it in the readEntity function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just from the names and signatures, it looked like readValue could potentially be used to read multiple JSON entities from a stream. In that case, the caller has a reference to the JsonParser and might re-use it for subsequent data in the input.

But for this method, the caller can't safely do anything with the input after it's called, so seemed more intuitive that it should enforce that the entire stream is consumed.

None of these have doc comments, so it wasn't clear the intent everywhere, so I left things flexible and only changed this function, which appears safe to change without breaking any caller assumptions.

Copy link
Author

@jhump jhump left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac, thanks for the review! Sorry for the delay in this reply. I haven't pushed any changes, but I will address some of the comments in my next push, later today.

@@ -277,7 +277,7 @@ static void getCustomAttributes(const Object &m, CustomAttributes &customAttribu
const std::unordered_set<std::string> &kKnownFields = getKnownFields();
for (const auto &entry : m) {
if (kKnownFields.find(entry.first) == kKnownFields.end()) {
customAttributes.addAttribute(entry.first, entry.second.stringValue());
customAttributes.addAttribute(entry.first, entry.second.toString());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a method on json::Entity which I did not touch. This was the source of the exception: if there was a non-string value for a custom attribute, the stringValue method would throw an exception. But now that we support all kinds of values, we use its toString method, which will return properly formatted JSON string to represent whatever is the Entity's value.

@@ -92,7 +92,11 @@ Entity loadEntity(const char *text) {
Entity loadEntity(InputStream &in) {
JsonParser p;
p.init(in);
return readEntity(p);
Entity e = readEntity(p);
if (p.hasMore()) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just from the names and signatures, it looked like readValue could potentially be used to read multiple JSON entities from a stream. In that case, the caller has a reference to the JsonParser and might re-use it for subsequent data in the input.

But for this method, the caller can't safely do anything with the input after it's called, so seemed more intuitive that it should enforce that the entire stream is consumed.

None of these have doc comments, so it wasn't clear the intent everywhere, so I left things flexible and only changed this function, which appears safe to change without breaking any caller assumptions.

@@ -177,6 +177,12 @@ AVRO_DECL Entity loadEntity(const uint8_t *text, size_t len);

void writeEntity(JsonGenerator<JsonNullFormatter> &g, const Entity &n);

class AVRO_DECL TooManyValuesException : public virtual std::runtime_error {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we catch this type of exception. If we just threw Exception, would the catch code be expected to parse the string message to assess the specific error condition? That seems like an anti-pattern.

Comment on lines 37 to 45
enum ValueMode {
// When a CustomAttributes is created using this mode, all values are strings.
// The value should not be quoted, but any interior quotes and special
// characters (such as newlines) must be escaped.
string,
// When a CustomAttributes is created using this mode, all values are JSON
// values. String values must be quoted and escaped.
json
};
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the expected behavior if ValueModes are different when reading and writing the same CustomAttributes?

This isn't possible. The ValueMode is a property of the CustomAttributes, specified at construction time.

// WILL be escaped and other special characters MAY be escaped.
//
// To support non-string values, use CustomAttributes(CustomAttributes::json) instead.
CustomAttributes() : CustomAttributes(string) {}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always use ValueMode::json when we read an Avro file -- it doesn't matter how the file was created because the file always contains valid JSON in the schema. So if it was written with an older version, then it will only ever have string values in custom attributes. But that is fine to later load and read with ValueMode::json since that handles any kind of attribute value.

This is here for backwards compatibility of users of this API since this is a public type defined in a public header. So even though internal usages of CustomAttributes always should use ValueMode::json, that's not necessarily the case for external usages.

// WILL be escaped and other special characters MAY be escaped.
//
// To support non-string values, use CustomAttributes(CustomAttributes::json) instead.
CustomAttributes() : CustomAttributes(string) {}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a very good idea. We might even want to deprecate the ValueMode::string option, too, because it's highly error-prone and is only present for backwards-compatibility with the old string-only behavior.

@jhump
Copy link
Author

jhump commented Jan 24, 2025

@wgtmac, I just pushed some revisions that I think address all of your feedback and include all the changes you requested. Please take another look.

@wgtmac
Copy link
Member

wgtmac commented Jan 25, 2025

Thanks for the update! I will take a look after I'm back from vacation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ Pull Requests for C++ binding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants