-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVRO-4026: [c++] Add new methods to CustomAttributes to allow non-string custom attributes in schemas #3266
base: main
Are you sure you want to change the base?
Conversation
…deprecated versions (except one case, testing the deprecated ones); add test cases specifically for deprecated methods
I'm not fully familiar with the context (and the code) yet. I agree that keeping backward compatibility is the bottom line. However, from the JIRA issue it seems that non-string attributes were supported once but broken after fixing other issues. Shouldn't we just fix it to revert the unexpected behavior? Except the alternatives list above, is it possible to add an option to |
Most certainly. I hadn't considered it, but it would be easy to change to that. Would like you me to apply that suggestion in this PR? For back-compat, the "default" mode would have to be the awkward (arguably broken) string-only mode, that just wraps the input in double-quotes (with no other escaping). But a new constructor could be added to enable a mode where the API expects well-formed JSON-encoded data (so string values must be quoted and escaped). |
@wgtmac, I've updated this branch so that instead of deprecating the old methods and adding new ones, there's now a constructor flag to indicate whether values are strings or arbitrary JSON values (in which case string values must be quoted). In updating tests, I realized that my code wasn't correctly validating that string values were correct -- if it had a botched/unescaped quote, it wouldn't be flagged as an invalid value. So I added some stuff in the json folder so that |
@wgtmac, if you don't like the look of this API -- using a separate constructor with a bool flag -- let me know. I'm happy to roll back that commit or continue to iterate on this. |
I am not C++ dev/user, so my opinion is even more irrelevant :-)
|
So who is the right person that we should tag, to review and approve this?
Sure, I can do that. |
No one! The Avro team is notified anyway. If no one from the Avro team merges the PR for some reasonable time then I could help with the merge only after at least two approvals from the Avro community. |
I can help review it :) |
@wgtmac, what do you think of the current approach? If I just replace the bool constructor parameter with an enum (as mentioned above), would that be pretty close to an acceptable patch? Any other concerns or feedback? |
@jhump Yes, I think so. |
@wgtmac, I've updated the PR with that suggestion. Please take a look. Thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change! I did a quick scan on this PR and left some comments about the expected behavior. I need to spend more time on the json impl in this code base.
@@ -19,27 +19,76 @@ | |||
#ifndef avro_CustomAttributes_hh__ | |||
#define avro_CustomAttributes_hh__ | |||
|
|||
#include "Config.hh" | |||
#include <iostream> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is it possible to remove this line? It looks weird that a public header exposes iostream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. One of the methods of CustomAttributes
uses it:
void printJson(std::ostream &os, const std::string &name) const;
enum ValueMode { | ||
// When a CustomAttributes is created using this mode, all values are strings. | ||
// The value should not be quoted, but any interior quotes and special | ||
// characters (such as newlines) must be escaped. | ||
string, | ||
// When a CustomAttributes is created using this mode, all values are JSON | ||
// values. String values must be quoted and escaped. | ||
json | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enum ValueMode { | |
// When a CustomAttributes is created using this mode, all values are strings. | |
// The value should not be quoted, but any interior quotes and special | |
// characters (such as newlines) must be escaped. | |
string, | |
// When a CustomAttributes is created using this mode, all values are JSON | |
// values. String values must be quoted and escaped. | |
json | |
}; | |
enum class ValueMode : uint8_t { | |
// When a CustomAttributes is created using this mode, all values are expected to be string. | |
// The value should not be quoted, but any interior quotes and special | |
// characters (such as newlines) must be escaped. | |
STRING, | |
// When a CustomAttributes is created using this mode, all values are standard JSON | |
// values. String values must be quoted and escaped. | |
JSON | |
}; |
I found that the styles of enum are not consistent. Personally I prefer upper-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the docstring is user-facing, is it better to separate the expected behavior of write and read path? In a more complicated case, what is the expected behavior if ValueMode
s are different when reading and writing the same CustomAttributes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the expected behavior if
ValueMode
s are different when reading and writing the sameCustomAttributes
?
This isn't possible. The ValueMode
is a property of the CustomAttributes
, specified at construction time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't valueMode_
not serialized in the JSON string? Upon creating the CustomAttributes
for deserialization, we don't know the ValueMode
used to serialize it and only set ValueMode
we choose to the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value mode simply determines how this CustomAttributes
interprets values provided via setAttribute
and in turn how they are returned from getAttributes
and the attributes()
map accessor. It's an entirely runtime property.
We don't need to know the value mode used to serialize a JSON schema. The JSON string is known to be valid JSON at this point., since the compiler will have already parsed it into an entity. It may have only string custom attributes or it might not. If it was serialized using "json" value mode, we can still process it with "string" value mode, as long as all custom attributes are strings. And we can always process any input in "json" value mode (the new non-deprecated mode).
The current logic (prior to this PR) cannot handle custom attributes unless they are strings. While this library cannot currently generate schemas that have non-string custom attributes (at least as of v1.11.2), all other Avro libraries can generate that sort of file. That's the bug we're trying to fix in this PR: this library currently throws an exception if an Avro file has a schema with non-string custom attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! It sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the enum to use enum class
and upper-case constant names in 2fd25e1
|
||
// Creates a new CustomAttributes object. | ||
// | ||
// If the given mode is string, all values must be strings. All values passed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the ValueMode
enum has good documentation, we don't have to repeat them here. Otherwise, it is easy to be out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trimmed away redundant comments in 0f0efd2.
// WILL be escaped and other special characters MAY be escaped. | ||
// | ||
// To support non-string values, use CustomAttributes(CustomAttributes::json) instead. | ||
CustomAttributes() : CustomAttributes(string) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know the default value is to keep backward-compatibility. However, we have to swap the default value in the future because the current behavior is wrong for non-string values. (I'm not well-versed in this codebase yet) Is it possible to figure out if the avro file is created by avro-cpp as well as its writer version? If yes, perhaps we can choose the correct default ValueMode
when deserializing the CustomAttributes based on its writer version and use ValueMode::string
by default for the writer. In this way, we won't create more problematic files after this release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always use ValueMode::json
when we read an Avro file -- it doesn't matter how the file was created because the file always contains valid JSON in the schema. So if it was written with an older version, then it will only ever have string values in custom attributes. But that is fine to later load and read with ValueMode::json
since that handles any kind of attribute value.
This is here for backwards compatibility of users of this API since this is a public type defined in a public header. So even though internal usages of CustomAttributes
always should use ValueMode::json
, that's not necessarily the case for external usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that makes sense.
@@ -277,7 +277,7 @@ static void getCustomAttributes(const Object &m, CustomAttributes &customAttribu | |||
const std::unordered_set<std::string> &kKnownFields = getKnownFields(); | |||
for (const auto &entry : m) { | |||
if (kKnownFields.find(entry.first) == kKnownFields.end()) { | |||
customAttributes.addAttribute(entry.first, entry.second.stringValue()); | |||
customAttributes.addAttribute(entry.first, entry.second.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we change this? Is it a breaking change to downstream users who are calling stringValue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is a method on json::Entity
which I did not touch. This was the source of the exception: if there was a non-string value for a custom attribute, the stringValue
method would throw an exception. But now that we support all kinds of values, we use its toString
method, which will return properly formatted JSON string to represent whatever is the Entity
's value.
// WILL be escaped and other special characters MAY be escaped. | ||
// | ||
// To support non-string values, use CustomAttributes(CustomAttributes::json) instead. | ||
CustomAttributes() : CustomAttributes(string) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mark it as [[deprecated]]
because we want discourage users to use this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a very good idea. We might even want to deprecate the ValueMode::string
option, too, because it's highly error-prone and is only present for backwards-compatibility with the old string-only behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecated these elements in 0f0efd2.
I am testing using clang on OS X. I don't see any other deprecations in this C++ codebase, so I can't find any other examples of muting the compiler warnings that result from using the deprecated elements in the implementation and the tests. Let me know how I should handle this. For now, I've added clang-specific #pragma
statements to eliminate the noisy warnings.
void checkCustomAttributes_getAttribute() { | ||
CustomAttributes cf; | ||
cf.addAttribute("field1", std::string("1")); | ||
void checkCustomAttributes_addAndGetAttributeJson() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO these cases are not sufficient. We need to cover the expected behavior of the following cases (even exceptions are expected in some of them):
Read \ Write | string | json |
---|---|---|
string | ? | ? |
json | ? | ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only thing missing from the matrix was the failure tests for passing a "string" value mode style value when mode is actually "json" and vice versa.
I updated the tests in e95f691, and they caught an issue: we were only handling TooManyExceptions
when value mode was "string", but it could be triggered in either value mode. So I've fixed that in the same commit.
@@ -177,6 +177,12 @@ AVRO_DECL Entity loadEntity(const uint8_t *text, size_t len); | |||
|
|||
void writeEntity(JsonGenerator<JsonNullFormatter> &g, const Entity &n); | |||
|
|||
class AVRO_DECL TooManyValuesException : public virtual std::runtime_error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any other exception is defined except the one in the Exception.hh
. Should we simply stick to Exception
to be consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we catch this type of exception. If we just threw Exception
, would the catch code be expected to parse the string message to assess the specific error condition? That seems like an anti-pattern.
@@ -92,7 +92,11 @@ Entity loadEntity(const char *text) { | |||
Entity loadEntity(InputStream &in) { | |||
JsonParser p; | |||
p.init(in); | |||
return readEntity(p); | |||
Entity e = readEntity(p); | |||
if (p.hasMore()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not checking it in the readEntity
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just from the names and signatures, it looked like readValue
could potentially be used to read multiple JSON entities from a stream. In that case, the caller has a reference to the JsonParser
and might re-use it for subsequent data in the input.
But for this method, the caller can't safely do anything with the input after it's called, so seemed more intuitive that it should enforce that the entire stream is consumed.
None of these have doc comments, so it wasn't clear the intent everywhere, so I left things flexible and only changed this function, which appears safe to change without breaking any caller assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac, thanks for the review! Sorry for the delay in this reply. I haven't pushed any changes, but I will address some of the comments in my next push, later today.
@@ -277,7 +277,7 @@ static void getCustomAttributes(const Object &m, CustomAttributes &customAttribu | |||
const std::unordered_set<std::string> &kKnownFields = getKnownFields(); | |||
for (const auto &entry : m) { | |||
if (kKnownFields.find(entry.first) == kKnownFields.end()) { | |||
customAttributes.addAttribute(entry.first, entry.second.stringValue()); | |||
customAttributes.addAttribute(entry.first, entry.second.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is a method on json::Entity
which I did not touch. This was the source of the exception: if there was a non-string value for a custom attribute, the stringValue
method would throw an exception. But now that we support all kinds of values, we use its toString
method, which will return properly formatted JSON string to represent whatever is the Entity
's value.
@@ -92,7 +92,11 @@ Entity loadEntity(const char *text) { | |||
Entity loadEntity(InputStream &in) { | |||
JsonParser p; | |||
p.init(in); | |||
return readEntity(p); | |||
Entity e = readEntity(p); | |||
if (p.hasMore()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just from the names and signatures, it looked like readValue
could potentially be used to read multiple JSON entities from a stream. In that case, the caller has a reference to the JsonParser
and might re-use it for subsequent data in the input.
But for this method, the caller can't safely do anything with the input after it's called, so seemed more intuitive that it should enforce that the entire stream is consumed.
None of these have doc comments, so it wasn't clear the intent everywhere, so I left things flexible and only changed this function, which appears safe to change without breaking any caller assumptions.
@@ -177,6 +177,12 @@ AVRO_DECL Entity loadEntity(const uint8_t *text, size_t len); | |||
|
|||
void writeEntity(JsonGenerator<JsonNullFormatter> &g, const Entity &n); | |||
|
|||
class AVRO_DECL TooManyValuesException : public virtual std::runtime_error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we catch this type of exception. If we just threw Exception
, would the catch code be expected to parse the string message to assess the specific error condition? That seems like an anti-pattern.
enum ValueMode { | ||
// When a CustomAttributes is created using this mode, all values are strings. | ||
// The value should not be quoted, but any interior quotes and special | ||
// characters (such as newlines) must be escaped. | ||
string, | ||
// When a CustomAttributes is created using this mode, all values are JSON | ||
// values. String values must be quoted and escaped. | ||
json | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the expected behavior if
ValueMode
s are different when reading and writing the sameCustomAttributes
?
This isn't possible. The ValueMode
is a property of the CustomAttributes
, specified at construction time.
// WILL be escaped and other special characters MAY be escaped. | ||
// | ||
// To support non-string values, use CustomAttributes(CustomAttributes::json) instead. | ||
CustomAttributes() : CustomAttributes(string) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always use ValueMode::json
when we read an Avro file -- it doesn't matter how the file was created because the file always contains valid JSON in the schema. So if it was written with an older version, then it will only ever have string values in custom attributes. But that is fine to later load and read with ValueMode::json
since that handles any kind of attribute value.
This is here for backwards compatibility of users of this API since this is a public type defined in a public header. So even though internal usages of CustomAttributes
always should use ValueMode::json
, that's not necessarily the case for external usages.
// WILL be escaped and other special characters MAY be escaped. | ||
// | ||
// To support non-string values, use CustomAttributes(CustomAttributes::json) instead. | ||
CustomAttributes() : CustomAttributes(string) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a very good idea. We might even want to deprecate the ValueMode::string
option, too, because it's highly error-prone and is only present for backwards-compatibility with the old string-only behavior.
…; revise comments
…ManyValuesException could occur with either value mode
@wgtmac, I just pushed some revisions that I think address all of your feedback and include all the changes you requested. Please take another look. |
Thanks for the update! I will take a look after I'm back from vacation. |
This adds new methods to
CustomAttributes
to allow setting non-string values. These other methods work with JSON-encoded strings.Unlike #3064 and #3069, this change attempts to be backwards compatible. However, from reading more comments in pull requests, it looks like the "fix" I added (to escape the keys and values in custom attributes when printing to JSON) may actually be a compatibility issue since it seems that users were expected to have to escape string values if they contained any characters that would be escaped in JSON (including quotes). That seems like a really terrible API, and it also meant that the values would not round-trip correctly: reading a data file would not create custom attributes with these strings properly escaped, so later writing out data with the same schema would generate an invalid schema JSON document.
In any event, this uses strings as the values even though it would be ideal if we could pass some sort of structured data as the value type. The ideal types (
json::Entity
and its accompanyingjson::Object
andjson::Array
types) are defined injson/JsonDom.hh
. But that header files is not part of the Avro include files distribution, which means we cannot#include
it fromCustomAttributes.hh
, so it's a no-go. From a little history spelunking, I see that they indeed used to use a structured form which was simplified to strings in #1821, purely because these JSON header files aren't available to users in the Avro distribution.Alternatives that I considered for using JSON-encoded strings:
lang/c++/impl
and intolang/c++/include/avro
. But then we expand the public API of too much. This approach was already tried and rejected in AVRO-3601: C++ API header contains breaking include #1820.std::any
as the value type. This can be#include
d inCustomAttributes.hh
but eliminates type safety in the signature. The only concrete accepted value would likely bejson::Entity
-- though we could make it more sophisticated and also allow the various value types sans wrapper:std::string
,bool
,int64_t
,double
,json::Array
(akastd::vector<json::Entity>
), andjson::Object
(akastd::map<std::string, json::Entity>
). But this isn't really usable by external/user code, at least not for any composite values, since they aren't able to include the JSON headers and then produce valid values ofjson::Entity
.json::Entity
to some other structured type that is defined inCustomAttributes.hh
. This could possibly be a concretestd::variant
that allows the various options (and could usestd::monostate
to represent JSON null values). This introduces non-trivial conversion code. From a performance perspective, it could likely be better than converting to/from strings, but it's a non-trivial amount of new-new code to maintain, which didn't feel right.What is the purpose of the change
Fixes exception from C++ library when compiling schemas with non-string custom attributes.
Verifying this change
This change added tests and can be verified as follows:
Documentation