-
-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate and fix bad I/O performance #558
Comments
Main issue can for instance be seen in |
Small part of the
|
@ikorennoy here's the bug regarding the bad I/O performance. |
YourKit Snapshots: gradle-worker-classpath.zip |
I believe GSON is already slow to parse the input file. Maybe it's the /**
* Create a new {@link JsonReader} instance on a file.
*
* @param path the path to the file
* @return an {@link JsonReader} instance
*/
public static JsonReader createFileReader(final Path path) {
checkNotNull(path);
try {
final var fileReader = new FileReader(path.toFile());
final var jsonReader = new JsonReader(fileReader);
jsonReader.setLenient(true);
return jsonReader;
} catch (final FileNotFoundException e) {
throw new UncheckedIOException(e);
}
} |
It might only be slow because of the Sirix code inserting nodes... @Test
public void testGson() throws IOException {
final var jsonPath = JSON.resolve("cityofchicago.json");
final var reader = JsonShredder.createFileReader(jsonPath);
// Iterate over all nodes.
while (reader.peek() != JsonToken.END_DOCUMENT) {
final var nextToken = reader.peek();
switch (nextToken) {
case BEGIN_OBJECT -> reader.beginObject();
case NAME -> reader.nextName();
case END_OBJECT -> reader.endObject();
case BEGIN_ARRAY -> reader.beginArray();
case END_ARRAY -> reader.endArray();
case STRING, NUMBER -> reader.nextString();
case BOOLEAN -> reader.nextBoolean();
case NULL -> reader.nextNull();
default -> {
}
}
}
} This reads and parses the file in approx 20 seconds on my machine, which is approx 190 mb/sec, which is okay (not particularly fast, but still), I guess. |
We've made some success over here (the traversal of the chicago file in preorder now is around 1:20min on my machine) and depending on if a path summary or not is created insertion time is approximately 2:20min - 2:50min. We might still have the potential for improvements, though. Reading in preorder was around 2:30min before and import was around 4-4:30mins on my machine. We also should investigate the performance of the coroutine-based descendant axis (which prefetches right siblings during down-traversal in preorder). |
A JFR recording with async profiler (IntelliJ Ultimate) inserting the chicago file 5 times in a row (agent options: |
A sweet spot seems to be to flush a new revision to disk every |
With wall clock time enabled (compression seems to be a huge factor): |
We currently face a memory leak... |
Hello, @JohannesLichtenberger I would like to help. Is the issue still as the title describes or the performance has been fixed and now the memory leak is the issue? |
I think the main issue was, that the page cache max sizes were too high and thus allocation rate of objects was higher than GC could cleanup. Especially with ZGC as it's not generational yet I've had bad pause times somehow... Shenandoah was better, but also not generational yet. What's strange is that the io uring implementation is slower than the FileChannel based version. Maybe the event loop is an issue somehow... |
@Kiriakos1998, do you like to work on this? Also, this might be interesting: #609 We'd also need JMH tests to better measure performance regarding JSON data... (import/traverse/query)... |
Sure I can take a look. |
@Kiriakos1998 did you already have had some time to look into the issue(s)? |
Yes, however although I managed to find the index of the iteration(3 or 4) that skyrockets execution time, I was not able to find the exact line of code which is a little bit strange. I can provide with more details if it's needed. |
@JohannesLichtenberger To be more specific. As I already mentioned one iteration skyrockets the execution time meaning that the specific iteration is responsible for more than 95% of execution time but when trying to isolate the execution time of each code line in the loop none of them reports any significant execution time. |
Maybe it's when reading from the producer / channel kicks in!? |
@Kiriakos1998 would you also like to have a look regarding the io_uring backend? |
Sure. Regarding this I need a Linux environment correct? |
Yes, correct. Then you can try to run |
You may then use a debugger. For IntelliJ there's a plugin from Johannes Bechberger. If you like to keep on working on SirixDB I think I have one or two licenses left for IntelliJ Ultimate and the other stuff from JetBrains (but should be only used for SirixDB/Brackit as we got the licenses for free). |
I have Intellij Ultimate but I think it will be hard for me to work on this issue. I can create a virtual machine for Linux environment ( my pc has windows) but I think I won't be able to install Intellij there. At least the ultimate version. Do you think it's absolutely necessary to investigate this issue? |
I may take a look myself again next week :) |
Write speed is only about 10MB/sec, which is rather low and doesn't saturate modern SSDs at all (not even HDDs) and SirixDB should be designed to take advantage of modern flash drive characteristics.
The text was updated successfully, but these errors were encountered: