Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application is getting stuck with oversized events #209

Open
spenes opened this issue Nov 2, 2021 · 2 comments
Open

Application is getting stuck with oversized events #209

spenes opened this issue Nov 2, 2021 · 2 comments

Comments

@spenes
Copy link
Contributor

spenes commented Nov 2, 2021

In the v2.0.0, we realized that application stops processing the data at some point when oversized events are sent. I was able to reproduce this problem sending events with ~500 kb size.

First suspect for this problem is elastic4s version bump in v2.0.0. We couldn't find the actual reason why it blocks Kinesis consumer yet. When we find it, we can try to find better solution to this problem. However, as a quick fix, we might detect oversized events before sending them to Elasticsearch and create bad rows from them. We can even leave this fix as permanent because we have some similar performance problem with oversized events in prior version of ES Loader too. They don't completely halt but they are processing the events really slow when they are oversized.

So, the thing open to discussion is how we will decide whether event is oversized or not. My suggestion is we can do it empirical way. We can test different size of events and look for starting from how much size we have this problem. We can decide maximum size with this way.

Another potential solution for this problem is truncating the oversized fields in the document sent to Elasticsearch. I think this is not a good solution because we are manipulating incoming data without letting know user there is a problem with that. I think it would be better if we just create bad row from them explicitly instead of just silently truncating them.

@spenes
Copy link
Contributor Author

spenes commented Feb 25, 2022

In here, it is specified that Lucene's term byte-length limit is 32766 bytes. Since we are hitting this limit most probably, it might be a good value as field size limit.

@istreeter
Copy link
Contributor

I agree we should be guided by the Lucene byte-length limit. However, be careful about using string length:

scala> "😊".length
res0: Int = 2

scala> "😊".getBytes("UTF-8").length
res1: Int = 4

We should either impose a string character limit of 16383 or a byte limit of 32766.

For now, I agree that if an event violates the length then we should send it to bad. Because that matches the previous behaviour of the app. In old versions, we would try to load the event, then it would fail, then we send to bad.

In future, we might choose to implement a feature where we truncate the fields before inserting. But that is a new feature that goes beyond what we need to do here to fix this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants