Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-4026: [c++] Add new methods to CustomAttributes to allow non-string custom attributes in schemas #3266

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
6 changes: 3 additions & 3 deletions lang/c++/impl/Compiler.cc
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,7 @@ static void getCustomAttributes(const Object &m, CustomAttributes &customAttribu
const std::unordered_set<std::string> &kKnownFields = getKnownFields();
for (const auto &entry : m) {
if (kKnownFields.find(entry.first) == kKnownFields.end()) {
customAttributes.addAttribute(entry.first, entry.second.stringValue());
customAttributes.addAttribute(entry.first, entry.second.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we change this? Is it a breaking change to downstream users who are calling stringValue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a method on json::Entity which I did not touch. This was the source of the exception: if there was a non-string value for a custom attribute, the stringValue method would throw an exception. But now that we support all kinds of values, we use its toString method, which will return properly formatted JSON string to represent whatever is the Entity's value.

}
}
}
Expand All @@ -300,7 +300,7 @@ static Field makeField(const Entity &e, SymbolTable &st, const string &ns) {
}
GenericDatum d = (it2 == m.end()) ? GenericDatum() : makeGenericDatum(node, it2->second, st);
// Get custom attributes
CustomAttributes customAttributes;
CustomAttributes customAttributes(CustomAttributes::ValueMode::JSON);
getCustomAttributes(m, customAttributes);
return Field(std::move(n), std::move(aliases), node, d, customAttributes);
}
Expand Down Expand Up @@ -424,7 +424,7 @@ static NodePtr makeArrayNode(const Entity &e, const Object &m,
if (containsField(m, "doc")) {
node->setDoc(getDocField(e, m));
}
CustomAttributes customAttributes;
CustomAttributes customAttributes(CustomAttributes::ValueMode::JSON);
getCustomAttributes(m, customAttributes);
node->addCustomAttributesForField(customAttributes);
return node;
Expand Down
63 changes: 53 additions & 10 deletions lang/c++/impl/CustomAttributes.cc
Original file line number Diff line number Diff line change
Expand Up @@ -16,26 +16,63 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "CustomAttributes.hh"
#include "Exception.hh"

#if defined(__clang__)
// Even though CustomAttributes::ValueMode::STRING is deprecated, we still have to
// handle/implement it.
#pragma clang diagnostic ignored "-Wdeprecated-declarations"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to add #pragma clang diagnostic push|pop to wrap only #include "CustomAttributes.hh"?

#endif

#include <map>
#include <memory>

#include "CustomAttributes.hh"
#include "Exception.hh"

#include "json/JsonDom.hh"

namespace avro {

CustomAttributes::CustomAttributes(ValueMode valueMode) {
switch (valueMode) {
case ValueMode::STRING:
case ValueMode::JSON:
valueMode_ = valueMode;
break;
default:
throw Exception("invalid ValueMode: " + std::to_string(static_cast<int>(valueMode)));
}
}

std::optional<std::string> CustomAttributes::getAttribute(const std::string &name) const {
std::optional<std::string> result;
std::map<std::string, std::string>::const_iterator iter =
attributes_.find(name);
auto iter = attributes_.find(name);
if (iter == attributes_.end()) {
return result;
return {};
}
result = iter->second;
return result;
return iter->second;
}

void CustomAttributes::addAttribute(const std::string &name,
const std::string &value) {
// Validate the incoming value.
//
// NOTE: This is a bit annoying that we accept the data as a string instead of
// as an Entity. That means the compiler must convert the value to a string only
// for this method to convert it back. But we can't directly refer to the
// json::Entity type in the signatures for this class (and thus cannot accept
// that type directly as a parameter) because then it would need to be included
// from a header file: CustomAttributes.hh. But the json header files are not
// part of the Avro distribution (intentionally), so CustomAttributes.hh cannot
// #include any of the json header files.
const std::string &jsonVal = (valueMode_ == ValueMode::STRING)
? std::move("\"" + value + "\"")
: value;
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expensive to validate the input? If we don't do this, what will happen for malformed input?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't validate it, then we could end up generating invalid Avro files.

I suppose we could also make a private method that skips the validation and declare the Compiler as a friend class that can invoke it. From the compiler, the values will have already been validated when the JSON schema was originally parsed, so it would be fine to skip in those cases.

The existing "string only" usage is error-prone enough, since it still requires that interior quotes and newlines be escaped (since all it does is add a leading and trailing " to turn it into a JSON value). But I figured it would be even more error-prone with allowing any arbitrary JSON element (including complicated values like arrays and objects) and also more error-prone now that it has a modality that is defined at construction time. Failing fast seemed like a much better developer experience than ignoring such issues and letting the library generate an invalid file.

json::loadEntity(jsonVal.c_str());
} catch (json::TooManyValuesException &e) {
throw Exception("string has malformed or missing escapes");
}

auto iter_and_find =
attributes_.insert(std::pair<std::string, std::string>(name, value));
if (!iter_and_find.second) {
Expand All @@ -45,9 +82,15 @@ void CustomAttributes::addAttribute(const std::string &name,

void CustomAttributes::printJson(std::ostream &os,
const std::string &name) const {
if (attributes().find(name) == attributes().end()) {
auto iter = attributes_.find(name);
if (iter == attributes_.end()) {
throw Exception(name + " doesn't exist");
}
os << "\"" << name << "\": \"" << attributes().at(name) << "\"";
os << json::Entity(std::make_shared<std::string>(name)).toString() << ": ";
if (valueMode_ == ValueMode::STRING) {
os << "\"" << iter->second << "\"";
} else {
os << iter->second;
}
}
} // namespace avro
2 changes: 1 addition & 1 deletion lang/c++/impl/Schema.cc
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ RecordSchema::RecordSchema(const std::string &name) : Schema(new NodeRecord) {
}

void RecordSchema::addField(const std::string &name, const Schema &fieldSchema) {
const CustomAttributes emptyCustomAttribute;
const CustomAttributes emptyCustomAttribute(CustomAttributes::ValueMode::JSON);
addField(name, fieldSchema, emptyCustomAttribute);
}

Expand Down
6 changes: 5 additions & 1 deletion lang/c++/impl/json/JsonDom.cc
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,11 @@ Entity loadEntity(const char *text) {
Entity loadEntity(InputStream &in) {
JsonParser p;
p.init(in);
return readEntity(p);
Entity e = readEntity(p);
if (p.hasMore()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not checking it in the readEntity function?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just from the names and signatures, it looked like readValue could potentially be used to read multiple JSON entities from a stream. In that case, the caller has a reference to the JsonParser and might re-use it for subsequent data in the input.

But for this method, the caller can't safely do anything with the input after it's called, so seemed more intuitive that it should enforce that the entire stream is consumed.

None of these have doc comments, so it wasn't clear the intent everywhere, so I left things flexible and only changed this function, which appears safe to change without breaking any caller assumptions.

throw TooManyValuesException();
}
return e;
}

Entity loadEntity(const uint8_t *text, size_t len) {
Expand Down
6 changes: 6 additions & 0 deletions lang/c++/impl/json/JsonDom.hh
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,12 @@ AVRO_DECL Entity loadEntity(const uint8_t *text, size_t len);

void writeEntity(JsonGenerator<JsonNullFormatter> &g, const Entity &n);

class AVRO_DECL TooManyValuesException : public virtual std::runtime_error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any other exception is defined except the one in the Exception.hh. Should we simply stick to Exception to be consistent?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we catch this type of exception. If we just threw Exception, would the catch code be expected to parse the string message to assess the specific error condition? That seems like an anti-pattern.

public:
explicit TooManyValuesException() :
std::runtime_error("invalid JSON document: expecting a single JSON value but found more than one") {}
};

} // namespace json
} // namespace avro

Expand Down
22 changes: 22 additions & 0 deletions lang/c++/impl/json/JsonIO.cc
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,28 @@ char JsonParser::next() {
return ch;
}

bool JsonParser::hasMore() {
if (peeked || hasNext && !isspace(nextChar)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (peeked || hasNext && !isspace(nextChar)) {
if (peeked || (hasNext && !isspace(nextChar))) {

return true;
}
if (!hasNext) {
nextChar = ' ';
}
// We return true if there are any more tokens; we ignore
// any trailing whitespace.
while (isspace(nextChar)) {
if (nextChar == '\n') {
line_++;
}
if (!in_.hasMore()) {
return false;
}
nextChar = in_.read();
}
hasNext = true;
return true;
}

void JsonParser::expectToken(Token tk) {
if (advance() != tk) {
if (tk == Token::Double) {
Expand Down
2 changes: 2 additions & 0 deletions lang/c++/impl/json/JsonIO.hh
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,8 @@ public:
return curToken;
}

bool hasMore();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible not to add hasMore() by reusing peek() above?


void expectToken(Token tk);

bool boolValue() const {
Expand Down
44 changes: 40 additions & 4 deletions lang/c++/include/avro/CustomAttributes.hh
Original file line number Diff line number Diff line change
Expand Up @@ -19,27 +19,62 @@
#ifndef avro_CustomAttributes_hh__
#define avro_CustomAttributes_hh__

#include "Config.hh"
#include <iostream>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it possible to remove this line? It looks weird that a public header exposes iostream.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. One of the methods of CustomAttributes uses it:

void printJson(std::ostream &os, const std::string &name) const;

#include <map>
#include <optional>
#include <string>

#include "Config.hh"

namespace avro {

// CustomAttributes class stores avro custom attributes.
// Each attribute is represented by a unique name and value.
// User is supposed to create CustomAttributes object and then add it to Schema.
class AVRO_DECL CustomAttributes {

public:
// Retrieves the custom attribute json entity for that attributeName, returns an
// null if the attribute doesn't exist.
enum class ValueMode : uint8_t {
// When a CustomAttributes is created using this mode, all values are expected
// to be strings. The value should not be quoted, but any interior quotes and
// special characters (such as newlines) must be escaped.
STRING
[[deprecated("The JSON ValueMode is less error-prone and less limited.")]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add (in the comment) that it has been deprecated since 1.3.0 and will be removed in XXX version?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what to put for "XXX" version. Is there a standard support policy for deprecating things and making backwards-breaking changes like this?


// When a CustomAttributes is created using this mode, all values are formatted
// JSON values. So string values must be quoted and escaped.
JSON
};

// Creates a new CustomAttributes object where all values are strings.
// All values passed to addAttribute() and returned from getAttribute() or the
// attributes() map will not be enclosed in quotes. However, any internal quotes
// WILL be escaped and other special characters MAY be escaped.
//
// To support non-string values, one must instead use
// CustomAttributes(CustomAttributes::ValueMode::JSON)
[[deprecated("Use CustomAttributes(ValueMode) instead.")]]
CustomAttributes() : CustomAttributes(ValueMode::STRING) {}

// Creates a new CustomAttributes object with the given value mode.
CustomAttributes(ValueMode valueMode);

// Retrieves the custom attribute string for the given name. Returns an empty
// value if the attribute doesn't exist.
//
// See ValueMode for details on the format of the returned value.
std::optional<std::string> getAttribute(const std::string &name) const;

// Adds a custom attribute. If the attribute already exists, throw an exception.
// Adds a custom attribute.
//
// See ValueMode for details on the require format of the value parameter.
//
// If the attribute already exists or if the given value is not a valid JSON
// document or not a correctly escaped string, throw an exception.
void addAttribute(const std::string &name, const std::string &value);

// Provides a way to iterate over the custom attributes or check attribute size.
// The values in this map are the same as those returned from getAttribute.
const std::map<std::string, std::string> &attributes() const {
return attributes_;
}
Expand All @@ -48,6 +83,7 @@ public:
void printJson(std::ostream &os, const std::string &name) const;

private:
ValueMode valueMode_;
std::map<std::string, std::string> attributes_;
};

Expand Down
2 changes: 1 addition & 1 deletion lang/c++/test/SchemaTests.cc
Original file line number Diff line number Diff line change
Expand Up @@ -495,7 +495,7 @@ static void testLogicalTypes() {
const char *durationType = R"({"type": "fixed","size": 12,"name": "durationType","logicalType": "duration"})";
const char *uuidType = R"({"type": "string","logicalType": "uuid"})";
// AVRO-2923 Union with LogicalType
const char *unionType = R"([{"type":"string", "logicalType":"uuid"},"null"]})";
const char *unionType = R"([{"type":"string", "logicalType":"uuid"},"null"])";
{
BOOST_TEST_CHECKPOINT(bytesDecimalType);
ValidSchema schema1 = compileJsonSchemaFromString(bytesDecimalType);
Expand Down
Loading
Loading