-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FileHistoryCache#store is one big memory hog #3243
Comments
Low hanging fruit: one way to avoid the explosion is to move the deep copy of the history entries right before the tags are assigned in |
Actually, the deep copy looks unnecessary since the tags are actually reset for all the cases in opengrok/opengrok-indexer/src/main/java/org/opengrok/indexer/history/FileHistoryCache.java Lines 121 to 140 in 115e69d
|
Ultimately, it would help if the history was split into chunks when generating from scratch. |
Also, this seems to be something new, perhaps with 1.4.x. Certainly did not happen with 1.3.x. |
I should mention this happens with Java 1.8.0_261-b12 (64-bit, obviously). After 3 days or so
I tried forcing GC run via
|
Still, the idea about splitting the history cache generation into chunks stands no matter if we switch to newer Java. |
With OpenJDK ( |
Indexing using JDK11 without the |
Same thing with JDK11 and 1.4.1. There are 7 big repositories that are history indexed - most of them Linux kernel GIt repositories. Tracing these
Because the indexer is running with renamed file handling on (actually off for 5 of 7 of these big repositories) there is also bunch of |
For reference, the indexer is run like this:
|
This is not happening with 1.3.16 so there might be something in 1.4.1 that triggered this sort of behavior. |
Re-confirmed: in order to isolate this further I limited the indexed projects to 5 Git Linux kernel repositories (mix of mainline, UEK). With 1.3.16 the indexer (with the options above) completes fine within couple of hours. With 1.4.1 the indexer stalls in history cache generation with symptoms described earlier. It seems to be progressing however very very slowly - I can see occasional file system lookup (
Again, the read-only configuration has renamed handling turned off for all these repositories. |
Did a bit of a bisection:
Note: 1.4.0 was never officially released because the release build failed 6a2061f is the child of e88d39d. Tried getting the
The difference w.r.t. number of commits is not so big however the The repositories are similar w.r.t. history, so if just the textual representation of it is around 4 GiB then complete size would be near 20 GiB and size of the objects representing this data could easily hit the 32 GiB heap limit. If these projects were indexed sequentially that might work however in parallel this is causing significant memory pressure. |
Tried indexing with 1.4.7 (at a00dc57) with 6a2061f and 0241b5b backed out (had to do a minor conflict resolution in copyright comment, otherwise it was clean) and it was able to create the history cache fine in around the same time as 1.3.6:
Observing the write syscalls done by the indexer the problem is there, just not so visible: there are certainly periods of time when the writes slow down significantly for bunch of seconds (interleaved by SIGSEGV's as the heap limit is increased and GC tries to do its job) and then are done rapidly for longer periods of time. The -m just exacerbated the issue. There are multiple solutions:
|
@vladak , let me show you the rest of the patch whence the octopus handling came where I had refactored the history API to return While the patch did not succeed to speed up history, it does keep memory consumption low by using an intermediate, on-disk log-structured key-value store to hold the parsed (Having had success with the on-disk key-value store, I started a proof-of-concept of a |
See #3271 |
The 1.6.5 release contains a tunable to disable merge commits. This can be used as a workaround. |
1.7.0 will have merge commits disabled by default (#3540). |
I entertained the following idea to fix this: for VCS types that support retrieval of history for directories (e.g. Git, Mercurial), split the operation of storing the history into multiple steps, where each step will have limited number of changesets. There are multiple problems with this approach. Firstly, the traversal of commits is done from Secondly, the history for the top level directory is cached by default (in the |
Observing the RSS/CPU of indexer process that does indexing from scratch of multiple repositories with heavy history (Linux kernel, FreeBSD, ...) running with 16 threads and 48 GiB heap (the machine has 32 CPUs and 256 GB RAM), there is clearly something bad going on. The indexing is in the phase of generating history caches for all projects. The indexer process has ~50 GB RSS, is busy on the CPU (say 60%), the usage grows, stays a bit at the maximum (70%) and then quickly falls down (to low 60's %). This cycle repeats every couple of seconds (assuming the GC is busy collecting and then is either done or gets stopped because it spent too much time on the CPU) while there are Mercurial/Git
log
processes running, getting the history of the whole repository. This happens with 1.4.15.I have not done any heap analysis yet, however by looking at
FileHistoryCache#store
, this is just asking for trouble. First, the whole repository history is stored in memory (in the form ofHistory
/HistoryEntry
objects) which could be quite sizeable of its own (sample Linux kernel repo has 500k+ changsets and 50k+ files on disk) and is then converted to the inverted map:When tagged history is enabled (as is the case for the indexer run I am observing), it gets even worse:
In such case there will be distinct
HistoryEntry
object for each changeset that touched given file. In overall, this will lead to explosive growth ofHistoryEntry
objects.The text was updated successfully, but these errors were encountered: