Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] S3 Sink Avro Output issues #3158

Closed
engechas opened this issue Aug 14, 2023 · 5 comments · Fixed by #3196
Closed

[BUG] S3 Sink Avro Output issues #3158

engechas opened this issue Aug 14, 2023 · 5 comments · Fixed by #3196
Assignees
Labels
bug Something isn't working
Milestone

Comments

@engechas
Copy link
Collaborator

engechas commented Aug 14, 2023

Describe the bug
A few bugs found for S3 Sink Avro output codecs:

  1. With compression: gzip specified, the files are still uploaded with just a .avro suffix. This should be .avro.gz
  2. The contents of the avro file show null for all fields:
{"version":null,"srcport":null,"dstport":null,"accountId":null,"interfaceId":null,"srcaddr":null,"dstaddr":null,"start":null,"end":null,"protocol":null,"packets":null,"bytes":null,"action":null,"logStatus":null}
{"version":null,"srcport":null,"dstport":null,"accountId":null,"interfaceId":null,"srcaddr":null,"dstaddr":null,"start":null,"end":null,"protocol":null,"packets":null,"bytes":null,"action":null,"logStatus":null}
{"version":null,"srcport":null,"dstport":null,"accountId":null,"interfaceId":null,"srcaddr":null,"dstaddr":null,"start":null,"end":null,"protocol":null,"packets":null,"bytes":null,"action":null,"logStatus":null}
{"version":null,"srcport":null,"dstport":null,"accountId":null,"interfaceId":null,"srcaddr":null,"dstaddr":null,"start":null,"end":null,"protocol":null,"packets":null,"bytes":null,"action":null,"logStatus":null}
  1. The timestamp in the file name is 12 hours behind UTC. I am not sure what the "correct" time is but UTC seems logical

To Reproduce
Sink config:

sink:
    - s3:
        aws:
          region: "us-west-2"
          sts_role_arn: "<my role>"
        bucket: "my-sink-bucket"
        object_key:
          path_prefix: "s3-sink"
        threshold:
          event_collect_timeout: 600s
        compression: "gzip"
        codec:
          avro:
            schema: >
              {
                "type" : "record",
                "namespace" : "org.opensearch.dataprepper.examples",
                "name" : "VpcFlowLog",
                "fields" : [
                  { "name" : "version", "type" : ["null", "string"]},
                  { "name" : "srcport", "type": ["null", "int"]},
                  { "name" : "dstport", "type": ["null", "int"]},
                  { "name" : "accountId", "type" : ["null", "string"]},
                  { "name" : "interfaceId", "type" : ["null", "string"]},
                  { "name" : "srcaddr", "type" : ["null", "string"]},
                  { "name" : "dstaddr", "type" : ["null", "string"]},
                  { "name" : "start", "type": ["null", "int"]},
                  { "name" : "end", "type": ["null", "int"]},
                  { "name" : "protocol", "type": ["null", "int"]},
                  { "name" : "packets", "type": ["null", "int"]},
                  { "name" : "bytes", "type": ["null", "int"]},
                  { "name" : "action", "type": ["null", "string"]},
                  { "name" : "logStatus", "type" : ["null", "string"]}
                ]
              }

Screenshots
File name times example:
image

@engechas engechas added bug Something isn't working untriaged labels Aug 14, 2023
@dlvenable dlvenable self-assigned this Aug 15, 2023
@dlvenable dlvenable added this to the v2.4 milestone Aug 15, 2023
@dlvenable
Copy link
Member

The Avro codec does not work with union types, nor is there any support for it in the code.

if (PRIMITIVE_TYPES.contains(fieldType)) {
switch (fieldType) {
case "string":
finalValue = rawValue.toString();
break;
case "int":
finalValue = Integer.parseInt(rawValue.toString());
break;
case "float":
finalValue = Float.parseFloat(rawValue.toString());
break;
case "double":
finalValue = Double.parseDouble(rawValue.toString());
break;
case "long":
finalValue = Long.parseLong(rawValue.toString());
break;
case "bytes":
finalValue = rawValue.toString().getBytes(StandardCharsets.UTF_8);
break;
default:
LOG.error("Unrecognised Field name : '{}' & type : '{}'", field.name(), fieldType);
break;
}
} else {
if (fieldType.equals("record") && rawValue instanceof Map) {
finalValue = buildAvroRecord(field.schema(), (Map<String, Object>) rawValue);
} else if (fieldType.equals("array") && rawValue instanceof List) {
GenericData.Array<String> avroArray =
new GenericData.Array<>(((List<String>) rawValue).size(), field.schema());
for (String element : ((List<String>) rawValue)) {
avroArray.add(element);
}
finalValue = avroArray;
}
}

@engechas
Copy link
Collaborator Author

For the timestamp in the file name, the same behavior is seen with ndjson so it doesn't look related to avro, rather S3 sink as a whole

@engechas
Copy link
Collaborator Author

#3171 fixes number 3

dlvenable added a commit to dlvenable/data-prepper that referenced this issue Aug 18, 2023
…a is null. Auto-generate schemas that are nullable so that null values can be included in these schemas. Resolves part of opensearch-project#3158.

Signed-off-by: David Venable <[email protected]>
@dlvenable
Copy link
Member

PR #3194 fixes item 2.

dlvenable added a commit that referenced this issue Aug 18, 2023
…a is null. Auto-generate schemas that are nullable so that null values can be included in these schemas. Resolves part of #3158. (#3194)

Signed-off-by: David Venable <[email protected]>
dlvenable added a commit to dlvenable/data-prepper that referenced this issue Aug 18, 2023
…f compression is internal, does not utilize. Resolves opensearch-project#3158.

Signed-off-by: David Venable <[email protected]>
@dlvenable
Copy link
Member

Item 1 should be resolved by #3196.

dlvenable added a commit that referenced this issue Aug 21, 2023
…f compression is internal, does not utilize. Resolves #3158. (#3196)

Signed-off-by: David Venable <[email protected]>
@github-project-automation github-project-automation bot moved this from Unplanned to Done in Data Prepper Tracking Board Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants