Skip to content

Commit

Permalink
AVRO-3900: [Rust] Make it possible to use custom names validators (ap…
Browse files Browse the repository at this point in the history
…ache#2643)

* AVRO-3900: [Rust] Make it possible to use custom validators

* AVRO-3900: Add validator trait for enum symbol names

* AVRO-3900: Introduce validator for RecordField's name

* AVRO-3900: Extract the custom validators tests as IT tests

This way they use the public APIs of the crate, as a real user application.

* AVRO-3900: Polish APIs

* AVRO-3900: Split the validator integration tests

Otherwise there are timing issues with the initializations of the
OnceLock's for the different validators.
Parsing a Schema leads to initialization of the default validator and
later it is impossible to set a custom one

* AVRO-3900: Simplify generate_interop_data.rs example to not use avro_test_helper crate

For some reason clearing the log messages in the tests destructor
(#[dtor]) was failing for this example.

Simplify the TestLogger to use the std thread_local!() instead of the
third party ref_thread_local!()

* AVRO-3900: Merge the validators integration tests in one test

Just make sure the setup of the custom validators is done before the
parsing of any schema to prevent registering the default validator
(SpecificationValidator)

* AVRO-3900: Improve the documentation

Signed-off-by: Martin Tzvetanov Grigorov <[email protected]>
  • Loading branch information
martin-g authored and Ranbir Kumar committed May 13, 2024
1 parent 112bd8f commit 00426d6
Show file tree
Hide file tree
Showing 11 changed files with 504 additions and 93 deletions.
7 changes: 0 additions & 7 deletions lang/rust/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

34 changes: 34 additions & 0 deletions lang/rust/avro/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -651,6 +651,40 @@ let writers_schema = Schema::parse_str(r#"{"type": "array", "items":"long"}"#).u
let readers_schema = Schema::parse_str(r#"{"type": "array", "items":"int"}"#).unwrap();
assert!(SchemaCompatibility::can_read(&writers_schema, &readers_schema).is_err());
```
### Custom names validators

By default the library follows the rules by the
[Avro specification](https://avro.apache.org/docs/1.11.1/specification/#names)!

Some of the other Apache Avro language SDKs are not that strict and allow more
characters in names. For interoperability with those SDKs, the library provides
a way to customize the names validation.

```rust
use apache_avro::AvroResult;
use apache_avro::schema::Namespace;
use apache_avro::validator::{SchemaNameValidator, set_schema_name_validator};

struct MyCustomValidator;

impl SchemaNameValidator for MyCustomValidator {
fn validate(&self, name: &str) -> AvroResult<(String, Namespace)> {
todo!()
}
}

// don't parse any schema before registering the custom validator(s) !

set_schema_name_validator(Box::new(MyCustomValidator));

// ... use the library
```

Similar logic could be applied to the schema namespace, enum symbols and field names validation.

**Note**: the library allows to set a validator only once per the application lifetime!
If the application parses schemas before setting a validator, the default validator will be
registered and used!

<!-- cargo-rdme end -->

Expand Down
6 changes: 3 additions & 3 deletions lang/rust/avro/examples/generate_interop_data.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ use apache_avro::{
types::{Record, Value},
Codec, Writer,
};
use apache_avro_test_helper::TestResult;
use std::{
collections::HashMap,
error::Error,
io::{BufWriter, Write},
};
use strum::IntoEnumIterator;
Expand Down Expand Up @@ -75,7 +75,7 @@ fn create_datum(schema: &Schema) -> Record {
datum
}

fn main() -> TestResult {
fn main() -> Result<(), Box<dyn Error>> {
let schema_str = std::fs::read_to_string("../../share/test/schemas/interop.avsc")
.expect("Unable to read the interop Avro schema");
let schema = Schema::parse_str(schema_str.as_str())?;
Expand Down Expand Up @@ -105,7 +105,7 @@ fn main() -> TestResult {
Ok(())
}

fn write_user_metadata<W: Write>(writer: &mut Writer<BufWriter<W>>) -> TestResult {
fn write_user_metadata<W: Write>(writer: &mut Writer<BufWriter<W>>) -> Result<(), Box<dyn Error>> {
writer.add_user_metadata("user_metadata".to_string(), b"someByteArray")?;

Ok(())
Expand Down
4 changes: 2 additions & 2 deletions lang/rust/avro/examples/test_interop_data.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@
// under the License.

use apache_avro::Reader;
use apache_avro_test_helper::TestResult;
use std::{
collections::HashMap,
error::Error,
ffi::OsStr,
io::{BufReader, Read},
};

fn main() -> TestResult {
fn main() -> Result<(), Box<dyn Error>> {
let mut expected_user_metadata: HashMap<String, Vec<u8>> = HashMap::new();
expected_user_metadata.insert("user_metadata".to_string(), b"someByteArray".to_vec());

Expand Down
36 changes: 36 additions & 0 deletions lang/rust/avro/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -764,6 +764,41 @@
//! let readers_schema = Schema::parse_str(r#"{"type": "array", "items":"int"}"#).unwrap();
//! assert!(SchemaCompatibility::can_read(&writers_schema, &readers_schema).is_err());
//! ```
//! ## Custom names validators
//!
//! By default the library follows the rules by the
//! [Avro specification](https://avro.apache.org/docs/1.11.1/specification/#names)!
//!
//! Some of the other Apache Avro language SDKs are not that strict and allow more
//! characters in names. For interoperability with those SDKs, the library provides
//! a way to customize the names validation.
//!
//! ```rust
//! use apache_avro::AvroResult;
//! use apache_avro::schema::Namespace;
//! use apache_avro::validator::{SchemaNameValidator, set_schema_name_validator};
//!
//! struct MyCustomValidator;
//!
//! impl SchemaNameValidator for MyCustomValidator {
//! fn validate(&self, name: &str) -> AvroResult<(String, Namespace)> {
//! todo!()
//! }
//! }
//!
//! // don't parse any schema before registering the custom validator(s) !
//!
//! set_schema_name_validator(Box::new(MyCustomValidator));
//!
//! // ... use the library
//! ```
//!
//! Similar logic could be applied to the schema namespace, enum symbols and field names validation.
//!
//! **Note**: the library allows to set a validator only once per the application lifetime!
//! If the application parses schemas before setting a validator, the default validator will be
//! registered and used!
//!

mod bigdecimal;
mod codec;
Expand All @@ -782,6 +817,7 @@ pub mod rabin;
pub mod schema;
pub mod schema_compatibility;
pub mod types;
pub mod validator;

pub use codec::Codec;
pub use de::from_value;
Expand Down
84 changes: 14 additions & 70 deletions lang/rust/avro/src/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,17 @@
// under the License.

//! Logic for parsing and interacting with schemas in Avro format.
use crate::{error::Error, types, util::MapHelper, AvroResult};
use crate::{
error::Error,
types,
util::MapHelper,
validator::{
validate_enum_symbol_name, validate_namespace, validate_record_field_name,
validate_schema_name,
},
AvroResult,
};
use digest::Digest;
use regex_lite::Regex;
use serde::{
ser::{SerializeMap, SerializeSeq},
Deserialize, Serialize, Serializer,
Expand All @@ -33,37 +41,9 @@ use std::{
hash::Hash,
io::Read,
str::FromStr,
sync::OnceLock,
};
use strum_macros::{EnumDiscriminants, EnumString};

fn enum_symbol_name_r() -> &'static Regex {
static ENUM_SYMBOL_NAME_ONCE: OnceLock<Regex> = OnceLock::new();
ENUM_SYMBOL_NAME_ONCE.get_or_init(|| Regex::new(r"^[A-Za-z_][A-Za-z0-9_]*$").unwrap())
}

// An optional namespace (with optional dots) followed by a name without any dots in it.
fn schema_name_r() -> &'static Regex {
static SCHEMA_NAME_ONCE: OnceLock<Regex> = OnceLock::new();
SCHEMA_NAME_ONCE.get_or_init(|| {
Regex::new(
r"^((?P<namespace>([A-Za-z_][A-Za-z0-9_]*(\.[A-Za-z_][A-Za-z0-9_]*)*)?)\.)?(?P<name>[A-Za-z_][A-Za-z0-9_]*)$",
).unwrap()
})
}

fn field_name_r() -> &'static Regex {
static FIELD_NAME_ONCE: OnceLock<Regex> = OnceLock::new();
FIELD_NAME_ONCE.get_or_init(|| Regex::new(r"^[A-Za-z_][A-Za-z0-9_]*$").unwrap())
}

fn namespace_r() -> &'static Regex {
static NAMESPACE_ONCE: OnceLock<Regex> = OnceLock::new();
NAMESPACE_ONCE.get_or_init(|| {
Regex::new(r"^([A-Za-z_][A-Za-z0-9_]*(\.[A-Za-z_][A-Za-z0-9_]*)*)?$").unwrap()
})
}

/// Represents an Avro schema fingerprint
/// More information about Avro schema fingerprints can be found in the
/// [Avro Schema Fingerprint documentation](https://avro.apache.org/docs/current/spec.html#schema_fingerprints)
Expand Down Expand Up @@ -279,13 +259,7 @@ impl Name {
}

fn get_name_and_namespace(name: &str) -> AvroResult<(String, Namespace)> {
let caps = schema_name_r()
.captures(name)
.ok_or_else(|| Error::InvalidSchemaName(name.to_string(), schema_name_r().as_str()))?;
Ok((
caps["name"].to_string(),
caps.name("namespace").map(|s| s.as_str().to_string()),
))
validate_schema_name(name)
}

/// Parse a `serde_json::Value` into a `Name`.
Expand All @@ -312,12 +286,7 @@ impl Name {
.filter(|ns| !ns.is_empty());

if let Some(ref ns) = namespace {
if !namespace_r().is_match(ns) {
return Err(Error::InvalidNamespace(
ns.to_string(),
namespace_r().as_str(),
));
}
validate_namespace(ns)?;
}

Ok(Self {
Expand Down Expand Up @@ -686,9 +655,7 @@ impl RecordField {
) -> AvroResult<Self> {
let name = field.name().ok_or(Error::GetNameFieldFromRecord)?;

if !field_name_r().is_match(&name) {
return Err(Error::FieldName(name));
}
validate_record_field_name(&name)?;

// TODO: "type" = "<record name>"
let schema = parser.parse_complex(field, &enclosing_record.namespace)?;
Expand Down Expand Up @@ -1713,10 +1680,7 @@ impl Parser {

let mut existing_symbols: HashSet<&String> = HashSet::with_capacity(symbols.len());
for symbol in symbols.iter() {
// Ensure enum symbol names match [A-Za-z_][A-Za-z0-9_]*
if !enum_symbol_name_r().is_match(symbol) {
return Err(Error::EnumSymbolName(symbol.to_string()));
}
validate_enum_symbol_name(symbol)?;

// Ensure there are no duplicate symbols
if existing_symbols.contains(&symbol) {
Expand Down Expand Up @@ -6277,26 +6241,6 @@ mod tests {
Ok(())
}

#[test]
fn test_avro_3897_disallow_invalid_namespaces_in_fully_qualified_name() -> TestResult {
let full_name = "ns.0.record1";
let name = Name::new(full_name);
assert!(name.is_err());
let expected =
Error::InvalidSchemaName(full_name.to_string(), schema_name_r().as_str()).to_string();
let err = name.map_err(|e| e.to_string()).err().unwrap();
assert_eq!(expected, err);

let full_name = "ns..record1";
let name = Name::new(full_name);
assert!(name.is_err());
let expected =
Error::InvalidSchemaName(full_name.to_string(), schema_name_r().as_str()).to_string();
let err = name.map_err(|e| e.to_string()).err().unwrap();
assert_eq!(expected, err);
Ok(())
}

/// A test cases showing that names and namespaces can be constructed
/// entirely by underscores.
#[test]
Expand Down
Loading

0 comments on commit 00426d6

Please sign in to comment.