Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-4026: [c++] Add new methods to CustomAttributes to allow non-string custom attributes in schemas #3266

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
6 changes: 3 additions & 3 deletions lang/c++/impl/Compiler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,7 @@ static void getCustomAttributes(const Object &m, CustomAttributes &customAttribu
const std::unordered_set<std::string> &kKnownFields = getKnownFields();
for (const auto &entry : m) {
if (kKnownFields.find(entry.first) == kKnownFields.end()) {
customAttributes.addAttribute(entry.first, entry.second.stringValue());
customAttributes.addAttribute(entry.first, entry.second.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we change this? Is it a breaking change to downstream users who are calling stringValue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a method on json::Entity which I did not touch. This was the source of the exception: if there was a non-string value for a custom attribute, the stringValue method would throw an exception. But now that we support all kinds of values, we use its toString method, which will return properly formatted JSON string to represent whatever is the Entity's value.

}
}
}
Expand All @@ -300,7 +300,7 @@ static Field makeField(const Entity &e, SymbolTable &st, const string &ns) {
}
GenericDatum d = (it2 == m.end()) ? GenericDatum() : makeGenericDatum(node, it2->second, st);
// Get custom attributes
CustomAttributes customAttributes;
CustomAttributes customAttributes(CustomAttributes::json);
getCustomAttributes(m, customAttributes);
return Field(std::move(n), std::move(aliases), node, d, customAttributes);
}
Expand Down Expand Up @@ -424,7 +424,7 @@ static NodePtr makeArrayNode(const Entity &e, const Object &m,
if (containsField(m, "doc")) {
node->setDoc(getDocField(e, m));
}
CustomAttributes customAttributes;
CustomAttributes customAttributes(CustomAttributes::json);
getCustomAttributes(m, customAttributes);
node->addCustomAttributesForField(customAttributes);
return node;
Expand Down
57 changes: 47 additions & 10 deletions lang/c++/impl/CustomAttributes.cc
Original file line number Diff line number Diff line change
Expand Up @@ -16,26 +16,57 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "CustomAttributes.hh"
#include "Exception.hh"
#include <map>
#include <memory>

#include "CustomAttributes.hh"
#include "Exception.hh"

#include "json/JsonDom.hh"

namespace avro {

CustomAttributes::CustomAttributes(ValueMode valueMode) {
switch (valueMode) {
case CustomAttributes::string:
case CustomAttributes::json:
valueMode_ = valueMode;
break;
default:
throw Exception("invalid ValueMode: " + std::to_string(valueMode));
}
}

std::optional<std::string> CustomAttributes::getAttribute(const std::string &name) const {
std::optional<std::string> result;
std::map<std::string, std::string>::const_iterator iter =
attributes_.find(name);
auto iter = attributes_.find(name);
if (iter == attributes_.end()) {
return result;
return {};
}
result = iter->second;
return result;
return iter->second;
}

void CustomAttributes::addAttribute(const std::string &name,
const std::string &value) {
// Validate the incoming value.
//
// NOTE: This is a bit annoying that we accept the data as a string instead of
// as an Entity. That means the compiler must convert the value to a string only
// for this method to convert it back. But we can't directly refer to the
// json::Entity type in the signatures for this class (and thus cannot accept
// that type directly as a parameter) because then it would need to be included
// from a header file: CustomAttributes.hh. But the json header files are not
// part of the Avro distribution, so CustomAttributes.hh cannot #include any of
// the json header files.
if (valueMode_ == CustomAttributes::string) {
try {
json::loadEntity(("\"" + value + "\"").c_str());
} catch (json::TooManyValuesException e) {
throw Exception("string has malformed or missing escapes");
}
} else {
json::loadEntity(value.c_str());
}

auto iter_and_find =
attributes_.insert(std::pair<std::string, std::string>(name, value));
if (!iter_and_find.second) {
Expand All @@ -45,9 +76,15 @@ void CustomAttributes::addAttribute(const std::string &name,

void CustomAttributes::printJson(std::ostream &os,
const std::string &name) const {
if (attributes().find(name) == attributes().end()) {
auto iter = attributes_.find(name);
if (iter == attributes_.end()) {
throw Exception(name + " doesn't exist");
}
os << "\"" << name << "\": \"" << attributes().at(name) << "\"";
os << json::Entity(std::make_shared<std::string>(name)).toString() << ": ";
if (valueMode_ == CustomAttributes::string) {
os << "\"" << iter->second << "\"";
} else {
os << iter->second;
}
}
} // namespace avro
6 changes: 5 additions & 1 deletion lang/c++/impl/json/JsonDom.cc
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,11 @@ Entity loadEntity(const char *text) {
Entity loadEntity(InputStream &in) {
JsonParser p;
p.init(in);
return readEntity(p);
Entity e = readEntity(p);
if (p.hasMore()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not checking it in the readEntity function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just from the names and signatures, it looked like readValue could potentially be used to read multiple JSON entities from a stream. In that case, the caller has a reference to the JsonParser and might re-use it for subsequent data in the input.

But for this method, the caller can't safely do anything with the input after it's called, so seemed more intuitive that it should enforce that the entire stream is consumed.

None of these have doc comments, so it wasn't clear the intent everywhere, so I left things flexible and only changed this function, which appears safe to change without breaking any caller assumptions.

throw TooManyValuesException();
}
return e;
}

Entity loadEntity(const uint8_t *text, size_t len) {
Expand Down
6 changes: 6 additions & 0 deletions lang/c++/impl/json/JsonDom.hh
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,12 @@ AVRO_DECL Entity loadEntity(const uint8_t *text, size_t len);

void writeEntity(JsonGenerator<JsonNullFormatter> &g, const Entity &n);

class AVRO_DECL TooManyValuesException : public virtual std::runtime_error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any other exception is defined except the one in the Exception.hh. Should we simply stick to Exception to be consistent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we catch this type of exception. If we just threw Exception, would the catch code be expected to parse the string message to assess the specific error condition? That seems like an anti-pattern.

public:
explicit TooManyValuesException() :
std::runtime_error("invalid JSON document: expecting a single JSON value but found more than one") {}
};

} // namespace json
} // namespace avro

Expand Down
22 changes: 22 additions & 0 deletions lang/c++/impl/json/JsonIO.cc
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,28 @@ char JsonParser::next() {
return ch;
}

bool JsonParser::hasMore() {
if (peeked || hasNext && !isspace(nextChar)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (peeked || hasNext && !isspace(nextChar)) {
if (peeked || (hasNext && !isspace(nextChar))) {

return true;
}
if (!hasNext) {
nextChar = ' ';
}
// We return true if there are any more tokens; we ignore
// any trailing whitespace.
while (isspace(nextChar)) {
if (nextChar == '\n') {
line_++;
}
if (!in_.hasMore()) {
return false;
}
nextChar = in_.read();
}
hasNext = true;
return true;
}

void JsonParser::expectToken(Token tk) {
if (advance() != tk) {
if (tk == Token::Double) {
Expand Down
2 changes: 2 additions & 0 deletions lang/c++/impl/json/JsonIO.hh
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,8 @@ public:
return curToken;
}

bool hasMore();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible not to add hasMore() by reusing peek() above?


void expectToken(Token tk);

bool boolValue() const {
Expand Down
58 changes: 54 additions & 4 deletions lang/c++/include/avro/CustomAttributes.hh
Original file line number Diff line number Diff line change
Expand Up @@ -19,27 +19,76 @@
#ifndef avro_CustomAttributes_hh__
#define avro_CustomAttributes_hh__

#include "Config.hh"
#include <iostream>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it possible to remove this line? It looks weird that a public header exposes iostream.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. One of the methods of CustomAttributes uses it:

void printJson(std::ostream &os, const std::string &name) const;

#include <map>
#include <optional>
#include <string>

#include "Config.hh"

namespace avro {

// CustomAttributes class stores avro custom attributes.
// Each attribute is represented by a unique name and value.
// User is supposed to create CustomAttributes object and then add it to Schema.
class AVRO_DECL CustomAttributes {

public:
// Retrieves the custom attribute json entity for that attributeName, returns an
// null if the attribute doesn't exist.
enum ValueMode {
// When a CustomAttributes is created using this mode, all values are strings.
// The value should not be quoted, but any interior quotes and special
// characters (such as newlines) must be escaped.
string,
// When a CustomAttributes is created using this mode, all values are JSON
// values. String values must be quoted and escaped.
json
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enum ValueMode {
// When a CustomAttributes is created using this mode, all values are strings.
// The value should not be quoted, but any interior quotes and special
// characters (such as newlines) must be escaped.
string,
// When a CustomAttributes is created using this mode, all values are JSON
// values. String values must be quoted and escaped.
json
};
enum class ValueMode : uint8_t {
// When a CustomAttributes is created using this mode, all values are expected to be string.
// The value should not be quoted, but any interior quotes and special
// characters (such as newlines) must be escaped.
STRING,
// When a CustomAttributes is created using this mode, all values are standard JSON
// values. String values must be quoted and escaped.
JSON
};

I found that the styles of enum are not consistent. Personally I prefer upper-case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the docstring is user-facing, is it better to separate the expected behavior of write and read path? In a more complicated case, what is the expected behavior if ValueModes are different when reading and writing the same CustomAttributes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the expected behavior if ValueModes are different when reading and writing the same CustomAttributes?

This isn't possible. The ValueMode is a property of the CustomAttributes, specified at construction time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't valueMode_ not serialized in the JSON string? Upon creating the CustomAttributes for deserialization, we don't know the ValueMode used to serialize it and only set ValueMode we choose to the constructor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value mode simply determines how this CustomAttributes interprets values provided via setAttribute and in turn how they are returned from getAttributes and the attributes() map accessor. It's an entirely runtime property.

We don't need to know the value mode used to serialize a JSON schema. The JSON string is known to be valid JSON at this point., since the compiler will have already parsed it into an entity. It may have only string custom attributes or it might not. If it was serialized using "json" value mode, we can still process it with "string" value mode, as long as all custom attributes are strings. And we can always process any input in "json" value mode (the new non-deprecated mode).

The current logic (prior to this PR) cannot handle custom attributes unless they are strings. While this library cannot currently generate schemas that have non-string custom attributes (at least as of v1.11.2), all other Avro libraries can generate that sort of file. That's the bug we're trying to fix in this PR: this library currently throws an exception if an Avro file has a schema with non-string custom attributes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! It sounds good to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the enum to use enum class and upper-case constant names in 2fd25e1


// Creates a new CustomAttributes object where all values are strings.
// All values passed to addAttribute() and returned from getAttribute() or the
// attributes() map will not be enclosed in quotes. However, any internal quotes
// WILL be escaped and other special characters MAY be escaped.
//
// To support non-string values, use CustomAttributes(CustomAttributes::json) instead.
CustomAttributes() : CustomAttributes(string) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the default value is to keep backward-compatibility. However, we have to swap the default value in the future because the current behavior is wrong for non-string values. (I'm not well-versed in this codebase yet) Is it possible to figure out if the avro file is created by avro-cpp as well as its writer version? If yes, perhaps we can choose the correct default ValueMode when deserializing the CustomAttributes based on its writer version and use ValueMode::string by default for the writer. In this way, we won't create more problematic files after this release.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always use ValueMode::json when we read an Avro file -- it doesn't matter how the file was created because the file always contains valid JSON in the schema. So if it was written with an older version, then it will only ever have string values in custom attributes. But that is fine to later load and read with ValueMode::json since that handles any kind of attribute value.

This is here for backwards compatibility of users of this API since this is a public type defined in a public header. So even though internal usages of CustomAttributes always should use ValueMode::json, that's not necessarily the case for external usages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark it as [[deprecated]] because we want discourage users to use this one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a very good idea. We might even want to deprecate the ValueMode::string option, too, because it's highly error-prone and is only present for backwards-compatibility with the old string-only behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated these elements in 0f0efd2.

I am testing using clang on OS X. I don't see any other deprecations in this C++ codebase, so I can't find any other examples of muting the compiler warnings that result from using the deprecated elements in the implementation and the tests. Let me know how I should handle this. For now, I've added clang-specific #pragma statements to eliminate the noisy warnings.


// Creates a new CustomAttributes object.
//
// If the given mode is string, all values must be strings. All values passed to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the ValueMode enum has good documentation, we don't have to repeat them here. Otherwise, it is easy to be out of sync.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trimmed away redundant comments in 0f0efd2.

// addAttribute() and returned from getAttribute() or the attributes() map will not
// be enclosed in quotes. However, any internal quotes and newlines WILL be escaped
// and other special characters MAY be escaped.
//
// If the given mode is json, the values support any valid JSON type. In this more,
// all values passed to addAttribute() and returned from getAttribute() or the
// attributes() map must be valid JSON values; string values must be quoted and
// escaped.
CustomAttributes(ValueMode valueMode);

// Retrieves the custom attribute string for the given name. Returns an empty
// value if the attribute doesn't exist.
//
// If this CustomAttributes was created in json mode, the returned string will
// be a JSON document (and string values will be quoted and escaped). Otherwise,
// the returned string will be a string value, though interior quotes and newlines
// will be escaped and other special characters may be escaped.
std::optional<std::string> getAttribute(const std::string &name) const;

// Adds a custom attribute. If the attribute already exists, throw an exception.
// Adds a custom attribute.
//
// If this CustomAttributes was create in json mode, the given value string must
// be a valid JSON document. So if the value is a string, it must be quoted and
// escaped. Otherwise, the given value string is an unquoted string value (though
// interior quotes and newlines must still be escaped).
//
// If the attribute already exists or if the given value is not a valid JSON
// document or not a correctly escaped string, throw an exception.
void addAttribute(const std::string &name, const std::string &value);

// Provides a way to iterate over the custom attributes or check attribute size.
// The values in this map are the same as those returned from getAttribute. So
// the value may be a JSON document or an unquoted string, depending on whether
// this CustomAttributes was created in json or string mode.
const std::map<std::string, std::string> &attributes() const {
return attributes_;
}
Expand All @@ -48,6 +97,7 @@ public:
void printJson(std::ostream &os, const std::string &name) const;

private:
ValueMode valueMode_;
std::map<std::string, std::string> attributes_;
};

Expand Down
2 changes: 1 addition & 1 deletion lang/c++/test/SchemaTests.cc
Original file line number Diff line number Diff line change
Expand Up @@ -495,7 +495,7 @@ static void testLogicalTypes() {
const char *durationType = R"({"type": "fixed","size": 12,"name": "durationType","logicalType": "duration"})";
const char *uuidType = R"({"type": "string","logicalType": "uuid"})";
// AVRO-2923 Union with LogicalType
const char *unionType = R"([{"type":"string", "logicalType":"uuid"},"null"]})";
const char *unionType = R"([{"type":"string", "logicalType":"uuid"},"null"])";
{
BOOST_TEST_CHECKPOINT(bytesDecimalType);
ValidSchema schema1 = compileJsonSchemaFromString(bytesDecimalType);
Expand Down
57 changes: 37 additions & 20 deletions lang/c++/test/unittest.cc
Original file line number Diff line number Diff line change
Expand Up @@ -440,29 +440,31 @@ struct TestSchema {
std::vector<GenericDatum> defaultValues;
concepts::MultiAttribute<CustomAttributes> customAttributes;

CustomAttributes cf;
cf.addAttribute("stringField", std::string("\\\"field value with \\\"double quotes\\\"\\\""));
cf.addAttribute("booleanField", std::string("true"));
cf.addAttribute("numberField", std::string("1.23"));
cf.addAttribute("nullField", std::string("null"));
cf.addAttribute("arrayField", std::string("[1]"));
cf.addAttribute("mapField", std::string("{\\\"key1\\\":\\\"value1\\\", \\\"key2\\\":\\\"value2\\\"}"));
CustomAttributes ca(CustomAttributes::json);
ca.addAttribute("stringField", std::string("\"foobar\""));
ca.addAttribute("stringFieldComplex", std::string("\"\\\" a field value with \\\"double quotes\\\" \\\"\""));
ca.addAttribute("booleanField", std::string("true"));
ca.addAttribute("numberField", std::string("1.23"));
ca.addAttribute("nullField", std::string("null"));
ca.addAttribute("arrayField", std::string("[1]"));
ca.addAttribute("mapField", std::string("{\"key1\":\"value1\", \"key2\":\"value2\"}"));
fieldNames.add("f1");
fieldValues.add(NodePtr(new NodePrimitive(Type::AVRO_LONG)));
customAttributes.add(cf);
customAttributes.add(ca);

NodeRecord nodeRecordWithCustomAttribute(nameConcept, fieldValues,
fieldNames, fieldAliases, defaultValues,
customAttributes);
std::string expectedJsonWithCustomAttribute =
"{\"type\": \"record\", \"name\": \"Test\",\"fields\": "
"[{\"name\": \"f1\", \"type\": \"long\", "
"\"arrayField\": \"[1]\", "
"\"booleanField\": \"true\", "
"\"mapField\": \"{\\\"key1\\\":\\\"value1\\\", \\\"key2\\\":\\\"value2\\\"}\", "
"\"nullField\": \"null\", "
"\"numberField\": \"1.23\", "
"\"stringField\": \"\\\"field value with \\\"double quotes\\\"\\\"\""
"\"arrayField\": [1], "
"\"booleanField\": true, "
"\"mapField\": {\"key1\":\"value1\", \"key2\":\"value2\"}, "
"\"nullField\": null, "
"\"numberField\": 1.23, "
"\"stringField\": \"foobar\", "
"\"stringFieldComplex\": \"\\\" a field value with \\\"double quotes\\\" \\\"\""
"}]}";
testNodeRecord(nodeRecordWithCustomAttribute,
expectedJsonWithCustomAttribute);
Expand All @@ -489,12 +491,26 @@ struct TestSchema {
expectedJsonWithoutCustomAttribute);
}

void checkCustomAttributes_getAttribute() {
CustomAttributes cf;
cf.addAttribute("field1", std::string("1"));
void checkCustomAttributes_addAndGetAttributeJson() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO these cases are not sufficient. We need to cover the expected behavior of the following cases (even exceptions are expected in some of them):

Read \ Write string json
string ? ?
json ? ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only thing missing from the matrix was the failure tests for passing a "string" value mode style value when mode is actually "json" and vice versa.

I updated the tests in e95f691, and they caught an issue: we were only handling TooManyExceptions when value mode was "string", but it could be triggered in either value mode. So I've fixed that in the same commit.

CustomAttributes ca(CustomAttributes::json);
ca.addAttribute("field1", std::string("true"));

BOOST_CHECK_EQUAL(std::string("1"), *cf.getAttribute("field1"));
BOOST_CHECK_EQUAL(false, cf.getAttribute("not_existing").has_value());
BOOST_CHECK_EQUAL(std::string("true"), *ca.getAttribute("field1"));
BOOST_CHECK_EQUAL(false, ca.getAttribute("not_existing").has_value());
}

void checkCustomAttributes_addAndGetAttributeString() {
CustomAttributes ca;
ca.addAttribute("field1", std::string("true"));
ca.addAttribute("field2", std::string("value with \\\"quotes\\\""));

BOOST_CHECK_EQUAL(std::string("true"), *ca.getAttribute("field1"));
BOOST_CHECK_EQUAL(std::string("value with \\\"quotes\\\""), *ca.getAttribute("field2"));
BOOST_CHECK_EQUAL(false, ca.getAttribute("not_existing").has_value());

std::ostringstream oss;
ca.printJson(oss, "field2");
BOOST_CHECK_EQUAL(std::string("\"field2\": \"value with \\\"quotes\\\"\""), oss.str());
}

void test() {
Expand All @@ -521,7 +537,8 @@ struct TestSchema {

checkNodeRecordWithoutCustomAttribute();
checkNodeRecordWithCustomAttribute();
checkCustomAttributes_getAttribute();
checkCustomAttributes_addAndGetAttributeJson();
checkCustomAttributes_addAndGetAttributeString();
}

ValidSchema schema_;
Expand Down
Loading