feat(decoding): Implement chunked GELF decoding #20859

jorgehermo9 · 2024-07-15T09:08:59Z

Closes #20769. This PR is kind of large(900 lines of code, 600 are from generated docs), if your prefer to chat via discord (and in order to be more agile while merging this), I'm in the vector community server, username @jorgehermo9.

Implementation is based on Graylog's documentation and Graylog's go-gelf library

In my local environment some tests are failing. Could you please trigger the CI so I can see if it is a problem of my environment and if not, I can proceed to fix them?

jorgehermo9 · 2024-07-15T09:12:13Z

Cargo.toml

@@ -139,6 +139,7 @@ serde_json = { version = "1.0.120", default-features = false, features = ["raw_v
 serde = { version = "1.0.204", default-features = false, features = ["alloc", "derive", "rc"] }
 toml = { version = "0.8.14", default-features = false, features = ["display", "parse"] }
 vrl = { version = "0.16.1", features = ["arbitrary", "cli", "test", "test_framework"] }
+tokio = { version = "1.38.0", default-features = false, features = ["full"] }


Needed to use tokio inside the lib/codecs crate, in order to implement gelf decoding timeouts with tokio tasks, so I added the dependency as a workspace one. If there is any problem with this, we may find another solution

lib/codecs/Cargo.toml

jorgehermo9 · 2024-07-15T09:22:24Z

lib/codecs/src/decoding/framing/chunked_gelf.rs

+    [const { Bytes::new() }; GELF_MAX_TOTAL_CHUNKS as usize];
+const DEFAULT_TIMEOUT_MILLIS: u64 = 5000;
+// TODO: ask what would be an appropriate default value for this
+const DEFAULT_PENDING_MESSAGES_LIMIT: usize = 1000;


I don't know what value is appropriate here. Do you have any recommendation?

This limit was enforced so we have a memory-bounded decoder.

The maximum UDP packet size is 65536 bytes... so with this limit, I think we have roughly 65MB of memory limit for pending messages storage.

However, the framing is agnostic of the transport protocol, so maybe other protocols does not have that per-message size limit and thus this can be "theoretically unbounded" (for example, reading raw bytes from a file).

Should we enforce too a per-message limit such as

vector/lib/codecs/src/decoding/framing/character_delimited.rs

Line 51 in 1579627

pub max_length: Option<usize>,

?

I was hoping the reference implementation would serve as prior art here, but https://github.com/Graylog2/graylog2-server/blob/3c7f9df250f7d58d99e9c554d9307dc1eec9fdac/graylog2-server/src/main/java/org/graylog2/inputs/codecs/GelfChunkAggregator.java seems like they have no pending message limit, just the timeout of 5 seconds as you have. I think I'd suggest having this as an option for people that do want to bound the memory, but default to unlimited to match Graylog server behavior.

Addressed this in 85edb00. Feel free to resolve this thread if the change is what you expected

jorgehermo9 · 2024-07-15T09:24:43Z

lib/codecs/src/decoding/framing/chunked_gelf.rs

+    pub timeout_millis: u64,
+
+    /// The maximum number of pending uncomplete messages. If this limit is reached, the decoder will start
+    /// dropping chunks of new messages. This limit ensures the memory usage of the decoder's state is bounded.


The bound is per total messages, but there is no per-message memory usage limit. We can theoretically have a 100GB single message and it won't be limited by this setting.

As stated before, should we include a per-message limit?

I like the idea of having a configurable bound on the number of pending messages.

The chunked encoding is only used for UDP, yes? Shouldn't that provide a defacto bound on size? How can we have a 100 GB message?

The chunked encoding is only used for UDP, yes?

Yes, it is intended to use it only for UDP and therefore it would be limited by the UDP packets limit of 65KB.
Nevertheless, as the chunked_gelf is a framing method, nothing blocks user to use that method with other types of sources, for example, with file sources and explictly stating the config framing.method="chunked-gelf". Although, it really does not have sense to use that framing method outside of UDP socket sources, and no one will use that in real environments... So maybe it is ok to leave this as it is.

We could add a max_length option. This would be consistent with other framers: https://vector.dev/docs/reference/configuration/sources/socket/#framing.newline_delimited.max_length

In chunked_gelf's case, I think we'd want to limit the length of the accumulated chunks in addition to each individual chunk.

Given the nature of gelf's messages, as they are just json, I don't think it would be fine to just truncate the input, as the json message would be most likely broken and the GELF deserialization would fail in nearly all cases after truncating.

Should we instead discard the whole message (including previously stored cunks) if an individual chunk reaches its defined limit or the accumulated chunks limit is reached? I don't know if its worth to do this, but I'm open to implement it if you see that it would be worth

Ah, yes, I think we'd want to discard messages that exceed the limit (in the future we can route them to a "dead letter" output). This is consistent with the other framers (see example:

vector/lib/codecs/src/decoding/framing/character_delimited.rs

Lines 123 to 128 in 17466c6

warn!(

message = "Discarding frame larger than max_length.",

buf_len = buf.len(),

max_length = self.max_length,

internal_log_rate_limit = true

);

).

I think it'd be worth it to remove a DOS risk if it isn't too much effort to add.

lib/codecs/src/decoding/framing/chunked_gelf.rs

jorgehermo9 · 2024-10-18T15:52:28Z

lib/codecs/src/decoding/framing/chunked_gelf.rs

+        &mut self,
+        mut chunk: Bytes,
+    ) -> Result<Option<Bytes>, ChunkedGelfDecoderError> {
+        // Encoding scheme:


I think it would be useful to record a metric for tracking the number of current pending messages. How can we approach this? I think this is usually done at Vector's binary level and not in inner libs... as I see a lot of metrics located at src/internal_events.

Also, it would be useful to record the number of timed out messages

That would be useful telemetry. Unfortunately I don't think there is a precedent for codecs emitting metrics, so this may require a bit of scaffolding if you want to add it. I don't think we want to emit telemetry directly from the codec crate so it'd look something like exposing metrics that then the caller (Vector) can emit.

I think this could mean adding a trait method to Framer,

vector/lib/codecs/src/decoding/framing/mod.rs

Lines 67 to 81 in 17466c6

pub trait Framer:

tokio_util::codec::Decoder<Item = Bytes, Error = BoxedFramingError> + DynClone + Debug + Send + Sync

{

}

/// Default implementation for `Framer`s that implement

/// `tokio_util::codec::Decoder`.

impl<Decoder> Framer for Decoder where

Decoder: tokio_util::codec::Decoder<Item = Bytes, Error = BoxedFramingError>

+ Clone

+ Debug

+ Send

+ Sync

{

}

, that exposes telemetry that the caller than polls and emits; or maybe it takes a callback that is called when the codec wants to emit telemetry (example from the kafka source)

And, yes, we generally try to have only internal_events actually emit metrics to centralize it all in one place.

I think emitting metrics from this point would require a refactor that in my opinion is slightly off from this PR. Maybe we can create an issue to track metrics in decoders in general.

I don't mind to address it myself as I found it interesting, but I prefer to narrow the scope of this PR

Co-authored-by: Jesse Szwedko <[email protected]>

jorgehermo9 · 2024-10-18T16:36:04Z

lib/codecs/src/decoding/framing/chunked_gelf.rs

+    }
+
+    #[tokio::test]
+    async fn decode_shuffled_messages() {


included this new test in 2ab9bf3

jorgehermo9 · 2024-10-18T16:49:12Z

Hi @jszwedko, thank you very much for the review, addressed all the comments and I'm ready for another review round.

Things left:

jszwedko

Thanks @jorgehermo9 ! I left a couple more comments. I think this is looking pretty close.

jszwedko · 2024-10-24T21:01:09Z

lib/codecs/src/decoding/framing/chunked_gelf.rs

+    // This limitation is due to the fact that the GELF format does not specify the length of the
+    // message, so we have to read all the bytes from the message (datagram)
+    bytes_decoder: BytesDecoder,
+    state: Arc<Mutex<HashMap<u64, MessageState>>>,


This would be a performance optimization, but we could think about using dashmap here to reduce lock contention. As it stands, it seems like only one packet can be processed at a time due to the locking, which may limit throughput.

I think the udp socket source does not parallelize and the whole process of reading from the socket -> framing -> decoding is entirely sequential from the udp socket's perspective.

I may be wrong but I think

vector/src/sources/socket/udp.rs

Line 177 in 17466c6

recv = socket.recv_from(&mut buf) => {

is purely sequential.

So I think the lock would only compete with the message timeout acquires, but in that case, if there are a lot of concurrent timeouts, you are right that dashmap couls help

Anyway, I think it would be useful for this last case, but I'd like if you can confirm the first part about the socket's sequential reads

I implemented it with dashmap in jorgehermo9@3e5f57f at another branch (https://github.com/jorgehermo9/vector/tree/feature/chunked-gelf-dashmap) and it seems that my implementation with it causes deadlocks and several tests fails

I'm concerned about the usage of the entry and remove methods, which seems to be the cause of the deadlocks. The message_state reference that is returned by the entry method lives until the end of the scope, but we need to call remove in some branches before returning in the decode_chunk method, so I think thats the cause of the deadlock.

I don't know how to model what I'm doing with DashMap

jorgehermo9 · 2024-10-25T17:12:22Z

lib/codecs/src/decoding/framing/chunked_gelf.rs

+    #[serde(default, skip_serializing_if = "vector_core::serde::is_default")]
+    pub pending_messages_limit: Option<usize>,
+
+    /// The maximum length of a single GELF chunk, in bytes. Chunks longer than this length will


Please @jszwedko take a look into the max_chunk_length and max_message_length documentation. I hope it is well explained

jorgehermo9 · 2024-10-25T17:15:14Z

lib/codecs/src/decoding/framing/chunked_gelf.rs

+        chunk_length: usize,
+        max_chunk_length: usize,
+    },
+    #[snafu(display("Message with id {message_id} has exceeded the maximum message length and it will be dropped: got {message_length} bytes and max message length is {max_message_length} bytes. Discarding all buffered chunks of that message"))]


@jszwedko Note that the discarding notice to the user is done inside this error. It will be logged outside of this framer when the MaxMessageLengthExceede error is thrown. This is a bit different from what you depicted about

vector/lib/codecs/src/decoding/framing/character_delimited.rs

Lines 123 to 128 in 17466c6

warn!(

message = "Discarding frame larger than max_length.",

buf_len = buf.len(),

max_length = self.max_length,

internal_log_rate_limit = true

);

where the message is silently discarded and no error is returned to the caller.
What do you prefer? To add a discarding warning or an error? In my opinion, this is not really an error and fits more a warning, but as data is discarded, you said in previous comments that we should return an error in those cases.

jorgehermo9 · 2024-10-25T17:18:37Z

Hi @jszwedko, thank you for your review. Addressed the max_chunk_length and max_message_length in 2139ae4. I also refactored the tests' assertions a little, take a look into that too please.

I have still pending to answer to the metric emitting comment and the DashMap one. I think using DashMap won't hurt and it will be easy to do, but I wonder if we would really see a performance improvement.

jorgehermo9 · 2024-10-25T17:36:11Z

I just noticed that I missed to do a max_message_length test. Will do it as soon as I can, but it will be very similar to the max_chunk_length one

EDIT: addressed in f4630e9

jorgehermo9 · 2024-10-26T09:22:42Z

lib/codecs/src/decoding/framing/chunked_gelf.rs

+    ///
+    /// This limit takes only into account the chunk's payload and the GELF header bytes are excluded from the calculation.
+    #[serde(skip_serializing_if = "vector_core::serde::is_default")]
+    pub max_chunk_length: Option<usize>,


Should we use no limit by default (None) Or use the max data (payload) size of UDP datagrams of 65507 bytes (65535 bytes max UDP size − 8-byte UDP header − 20-byte IP header source)?

(actually, we should use 65507 - 10 , because for this chunk length count we exclude the 10 bytes of the chunked gelf header)

jorgehermo9 · 2024-10-26T09:24:05Z

lib/codecs/src/decoding/framing/chunked_gelf.rs

+    /// This limit takes only into account the message's payload and the GELF header bytes are excluded from the calculation.
+    /// The message's payload is the concatenation of all the chunks' payloads.
+    #[serde(skip_serializing_if = "vector_core::serde::is_default")]
+    pub max_message_length: Option<usize>,


same as above, should we use None by default or 65,535 * 128 bytes, which is the max UDP datagram size times the maximum number of chunks.

Both of those defaults would work (theoretically) the same for the UDP socket source.

jorgehermo9 · 2024-10-26T14:40:20Z

lib/codecs/src/decoding/framing/chunked_gelf.rs

+
+        if let Some(max_chunk_length) = self.max_chunk_length {
+            let chunk_length = chunk.remaining();
+            ensure!(


Should we remove the state of the message id of this chunk?

I don't know if it makes sense to keep the buffered chunks if we discars one, as the message would never complete

I wonder if we should also do this for the InvalidTotalChunks error for example, or in any other case..?

jorgehermo9 added 21 commits July 4, 2024 13:16

feat(gelf): Initial implementation of chunked gelf

de3e0c0

feat(gelf): improve error handling when decoding

c6e3204

feat(gelf): improve error handling when decoding

5a1b49f

test(gelf): Add more tests for chunked gelf

b85235e

feat(gelf): derive clone for chunked gelf decoder

baccaba

docs(gelf): Improve doc comments

148e09b

feat(gelf): improve mutex poisioning handling

530f243

feat(gelf): improve mutex poisioning handling

2bc10a7

style: fix Cargo.toml stlye

94b1cf6

style: fix Cargo.toml stlye

d405e6c

chore: remove comment

9b69264

fix(gelf): Fix partial read of buffer

bfc332e

test(gelf): Refactor some tests

0bc3376

style: fix log style

7f633b0

style: fix log style

103b933

test(socket): Add chunked gelf tests for socket source

6d4f8c7

docs(gelf): Improve doc comments

2fda018

docs(gelf): Improve doc comments

0862c10

test(socket): Add chunked gelf tests for unix socket source

a5b1121

chore: remove comments

a5757c6

docs: add changelog

33bda36

jorgehermo9 requested a review from a team as a code owner July 15, 2024 09:09

github-actions bot added the domain: sources Anything related to the Vector's sources label Jul 15, 2024

Merge branch 'master' into feature/chunked-gelf

aaa9dc6

jorgehermo9 commented Jul 15, 2024

View reviewed changes

lib/codecs/Cargo.toml Show resolved Hide resolved

jorgehermo9 commented Jul 15, 2024

View reviewed changes

lib/codecs/src/decoding/framing/chunked_gelf.rs Show resolved Hide resolved

jorgehermo9 commented Jul 15, 2024

View reviewed changes

lib/codecs/src/decoding/framing/chunked_gelf.rs Outdated Show resolved Hide resolved

Merge branch 'master' into feature/chunked-gelf

b97de92

jorgehermo9 requested a review from a team as a code owner October 18, 2024 15:32

feat: default pending_messages_limit to None

85edb00

jorgehermo9 commented Oct 18, 2024

View reviewed changes

jorgehermo9 and others added 4 commits October 18, 2024 17:54

feat: remove skip_serialization for timeout_secs

b214fdb

Update lib/codecs/src/decoding/framing/chunked_gelf.rs

f1e5996

Co-authored-by: Jesse Szwedko <[email protected]>

feat: change duplicated chunk log error to debug

77ec8f5

test: add shuffle messages test

2ab9bf3

jorgehermo9 commented Oct 18, 2024

View reviewed changes

fix: clippy lint

ef3f0a7

jorgehermo9 added 4 commits October 18, 2024 18:54

docs: update components docs

308146b

docs: improve chunked gelf decoder option documentation

16f8228

docs: improve chunked gelf decoder option documentation

e3f8245

docs: update components docs

cecfa8c

jszwedko reviewed Oct 24, 2024

View reviewed changes

feat: Implement max_chunk_length and max_message_length

2139ae4

jorgehermo9 commented Oct 25, 2024

View reviewed changes

fix: clippy lint

f0ea000

jorgehermo9 added 3 commits October 26, 2024 10:48

test: add test for max_message_length

f4630e9

Merge with master

ee78d8c

feat: add serde(default) into new configs

eeaaf31

jorgehermo9 commented Oct 26, 2024

View reviewed changes

jorgehermo9 added 2 commits October 26, 2024 11:27

docs: update component docs

ecc61da

test: fix broken max chunk length test

dc07697

jorgehermo9 commented Oct 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(decoding): Implement chunked GELF decoding #20859

feat(decoding): Implement chunked GELF decoding #20859

jorgehermo9 commented Jul 15, 2024 •

edited

Loading

jorgehermo9 Jul 15, 2024 •

edited

Loading

jorgehermo9 Jul 15, 2024

jszwedko Oct 10, 2024

jorgehermo9 Oct 18, 2024

jorgehermo9 Jul 15, 2024 •

edited

Loading

jszwedko Jul 19, 2024

jorgehermo9 Jul 27, 2024 •

edited

Loading

jszwedko Oct 10, 2024 •

edited

Loading

jorgehermo9 Oct 18, 2024

jszwedko Oct 24, 2024

jorgehermo9 Oct 18, 2024

jszwedko Oct 24, 2024

jszwedko Oct 24, 2024

jorgehermo9 Oct 26, 2024

jorgehermo9 Oct 18, 2024

jorgehermo9 commented Oct 18, 2024

jszwedko left a comment

jszwedko Oct 24, 2024

jorgehermo9 Oct 24, 2024

jorgehermo9 Oct 26, 2024

jorgehermo9 Oct 25, 2024

jorgehermo9 Oct 25, 2024

jorgehermo9 commented Oct 25, 2024 •

edited

Loading

jorgehermo9 commented Oct 25, 2024 •

edited

Loading

jorgehermo9 Oct 26, 2024 •

edited

Loading

jorgehermo9 Oct 26, 2024

jorgehermo9 Oct 26, 2024

jorgehermo9 Oct 26, 2024

	warn!(
	message = "Discarding frame larger than max_length.",
	buf_len = buf.len(),
	max_length = self.max_length,
	internal_log_rate_limit = true
	);

	pub trait Framer:
	tokio_util::codec::Decoder<Item = Bytes, Error = BoxedFramingError> + DynClone + Debug + Send + Sync
	{
	}

	/// Default implementation for `Framer`s that implement
	/// `tokio_util::codec::Decoder`.
	impl<Decoder> Framer for Decoder where
	Decoder: tokio_util::codec::Decoder<Item = Bytes, Error = BoxedFramingError>
	+ Clone
	+ Debug
	+ Send
	+ Sync
	{
	}

feat(decoding): Implement chunked GELF decoding #20859

Are you sure you want to change the base?

feat(decoding): Implement chunked GELF decoding #20859

Conversation

jorgehermo9 commented Jul 15, 2024 • edited Loading

jorgehermo9 Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgehermo9 Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgehermo9 Jul 27, 2024 • edited Loading

Choose a reason for hiding this comment

jszwedko Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgehermo9 commented Oct 18, 2024

jszwedko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgehermo9 commented Oct 25, 2024 • edited Loading

jorgehermo9 commented Oct 25, 2024 • edited Loading

jorgehermo9 Oct 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgehermo9 commented Jul 15, 2024 •

edited

Loading

jorgehermo9 Jul 15, 2024 •

edited

Loading

jorgehermo9 Jul 15, 2024 •

edited

Loading

jorgehermo9 Jul 27, 2024 •

edited

Loading

jszwedko Oct 10, 2024 •

edited

Loading

jorgehermo9 commented Oct 25, 2024 •

edited

Loading

jorgehermo9 commented Oct 25, 2024 •

edited

Loading

jorgehermo9 Oct 26, 2024 •

edited

Loading