Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileHistoryCache#store is one big memory hog #3243

Closed
vladak opened this issue Sep 24, 2020 · 20 comments
Closed

FileHistoryCache#store is one big memory hog #3243

vladak opened this issue Sep 24, 2020 · 20 comments

Comments

@vladak
Copy link
Member

vladak commented Sep 24, 2020

Observing the RSS/CPU of indexer process that does indexing from scratch of multiple repositories with heavy history (Linux kernel, FreeBSD, ...) running with 16 threads and 48 GiB heap (the machine has 32 CPUs and 256 GB RAM), there is clearly something bad going on. The indexing is in the phase of generating history caches for all projects. The indexer process has ~50 GB RSS, is busy on the CPU (say 60%), the usage grows, stays a bit at the maximum (70%) and then quickly falls down (to low 60's %). This cycle repeats every couple of seconds (assuming the GC is busy collecting and then is either done or gets stopped because it spent too much time on the CPU) while there are Mercurial/Git log processes running, getting the history of the whole repository. This happens with 1.4.15.

I have not done any heap analysis yet, however by looking at FileHistoryCache#store, this is just asking for trouble. First, the whole repository history is stored in memory (in the form of History/HistoryEntry objects) which could be quite sizeable of its own (sample Linux kernel repo has 500k+ changsets and 50k+ files on disk) and is then converted to the inverted map:

HashMap<String, List<HistoryEntry>> map = new HashMap<>();

When tagged history is enabled (as is the case for the indexer run I am observing), it gets even worse:

466                  /*
467                   * We need to do deep copy in order to have different tags
468                   * per each commit.
469                   */
470                  if (env.isTagsEnabled() && repository.hasFileBasedTags()) {
471                      list.add(new HistoryEntry(e));
472                  } else {
473                      list.add(e);
474                  }

In such case there will be distinct HistoryEntry object for each changeset that touched given file. In overall, this will lead to explosive growth of HistoryEntry objects.

@vladak
Copy link
Member Author

vladak commented Sep 24, 2020

Low hanging fruit: one way to avoid the explosion is to move the deep copy of the history entries right before the tags are assigned in doFileHistory() and then maybe set them to null once the storeFile() is done.

@vladak
Copy link
Member Author

vladak commented Sep 24, 2020

Actually, the deep copy looks unnecessary since the tags are actually reset for all the cases in doFileHistory() before they are assigned to the history entries:

if (hist == null) {
hist = new History();
// File based history cache does not store files for individual
// changesets so strip them unless it is history for the repository.
for (HistoryEntry ent : historyEntries) {
if (file.isDirectory() && filename.equals(repository.getDirectoryName())) {
ent.stripTags();
} else {
ent.strip();
}
}
// add all history entries
hist.setHistoryEntries(historyEntries);
} else {
for (HistoryEntry ent : hist.getHistoryEntries()) {
ent.strip();
}
}

@vladak
Copy link
Member Author

vladak commented Sep 25, 2020

Ultimately, it would help if the history was split into chunks when generating from scratch.

@vladak
Copy link
Member Author

vladak commented Sep 29, 2020

Also, this seems to be something new, perhaps with 1.4.x. Certainly did not happen with 1.3.x.

@vladak
Copy link
Member Author

vladak commented Sep 30, 2020

I should mention this happens with Java 1.8.0_261-b12 (64-bit, obviously). After 3 days or so jmap finally finished dumping the heap (stopping the process and getting the dump via gcore/gdb and getting the heap dump this way did not work as the heap could not be extracted from there for some reason) I loaded the resulting 34 GB dump into YourKit analyzer (took a bit of trial and error and involved adding 32 GB of swap space to machine that has 32 GB of RAM. MAT analysis tool loaded the dump very quickly compared to YourKit however did not have such a nice display of memory) and the results are pretty jaw dropping:

Screenshot from 2020-09-30 18-52-36
Screenshot from 2020-09-30 18-52-41
99% of the heap is memory waiting to be garbage collected. This really resembles #3089 which got solved by switching to JDK 11 so I wonder if PR #3098 (interesting coincidence w.r.t. the numbers of the issues) should be accelerated.

I tried forcing GC run via jcmd however that did not work (perhaps the process needs to be started with some extra options):

# /usr/jdk/latest/bin/jcmd 3164 GC.run
3164:
com.sun.tools.attach.AttachNotSupportedException: Unable to open door: target process not responding or HotSpot VM not loaded
        at sun.tools.attach.SolarisVirtualMachine.<init>(SolarisVirtualMachine.java:92)
        at sun.tools.attach.SolarisAttachProvider.attachVirtualMachine(SolarisAttachProvider.java:59)
        at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:208)
        at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:147)
        at sun.tools.jcmd.JCmd.main(JCmd.java:131)

@vladak
Copy link
Member Author

vladak commented Sep 30, 2020

Still, the idea about splitting the history cache generation into chunks stands no matter if we switch to newer Java.

@vladak
Copy link
Member Author

vladak commented Oct 1, 2020

With OpenJDK (11.0.7-internal built from source) the same thing happened. I realized that the indexer is running with -XX:-UseGCOverheadLimit which probably contributes to the problem.

@vladak
Copy link
Member Author

vladak commented Oct 1, 2020

Indexing using JDK11 without the -XX:-UseGCOverheadLimit option leads to the same problem.

@vladak
Copy link
Member Author

vladak commented Oct 2, 2020

Same thing with JDK11 and 1.4.1. There are 7 big repositories that are history indexed - most of them Linux kernel GIt repositories. Tracing these git log / hg log processes they are progressing however very slowly:

# truss -D -p 22015
write(1, 0xF1B4A8884, 5120)     (sleeping...)
 8.503952    write(1, " / s i g p v c / s i g p".., 5120)       = 5120
 0.000911    write(1, " c h o . c\n s y s / n e".., 5120)       = 5120
 0.122504    write(1, " n e t i n e t 6 / u d p".., 5120)       = 5120
write(1, 0xF1B4A8884, 5120)     (sleeping...)
10.227984    write(1, " 9 8 / s y s c o n s _ p".., 5120)       = 5120
 0.001409    write(1, " e x u s . c\n s y s / p".., 5120)       = 5120
 0.051428    write(1, " s m . c\n s y s / s p a".., 5120)       = 5120
 0.014977    write(1, " e a d . c\n s y s / u f".., 5120)       = 5120
write(1, 0xF1B4A8884, 5120)     (sleeping...)
10.822057    write(1, " / R e f / d i s k . c r".., 5120)       = 5120
write(1, 0xF1B4A8884, 5120)     (sleeping...)
10.224109    write(1, " b i n / u u e n c o d e".., 5120)       = 5120
 0.021716    write(1, " n d a r s / c a l e n d".., 5120)       = 5120
 0.014168    write(1, " . c\n u s r . b i n / d".., 5120)       = 5120
 0.036382    write(1, " t . c\n u s r . b i n /".., 5120)       = 5120
 0.002649    write(1, " e / j o b . c\n u s r .".., 5120)       = 5120

Because the indexer is running with renamed file handling on (actually off for 5 of 7 of these big repositories) there is also bunch of hg/git log processes running next to those 7. These seem to be progressing fine. At given point of time there are between 2 and 20+ of these (the system has 32 CPUs and there is no tuning done for the number of threads used for renamed files processing).

@vladak
Copy link
Member Author

vladak commented Oct 2, 2020

For reference, the indexer is run like this:

/opengrok-upgrade/dist/bin/venv/bin/opengrok-indexer -C --java=/jdk/openjdk11/bin/java -J=-Xmx32g -J=-server -J=-Djava.util.logging.config.file=/var/tmp/upgrade-logging.conf --jar /opengrok-upgrade/dist/lib/opengrok.jar -e LC_ALL=en_US.UTF-8 -- --renamedHistory on -r dirbased -G -m 256 --leadingWildCards on -c /usr/local/bin/ctags -o /opengrok/etc/ctags.config -H -P -S -T 16 -R /opengrok-upgrade/etc/readonly_configuration.xml -s /ws-local-upgrade -d /opengrok-upgrade/data -W /opengrok-upgrade/etc/configuration.xml --progress
              13235 /opengrok-upgrade/dist/bin/venv/bin/python3.7 /opengrok-upgrade/dist/bin/venv/bin/opengrok-indexer -C --java=/jdk/openjdk11/bin/java -J=-Xmx32g -J=-server -J=-Djava.util.logging.config.file=/var/tmp/upgrade-logging.conf --jar /opengrok-upgrade/dist/lib/opengrok.jar -e LC_ALL=en_US.UTF-8 -- --renamedHistory on -r dirbased -G -m 256 --leadingWildCards on -c /usr/local/bin/ctags -o /opengrok/etc/ctags.config -H -P -S -T 16 -R /opengrok-upgrade/etc/readonly_configuration.xml -s /ws-local-upgrade -d /opengrok-upgrade/data -W /opengrok-upgrade/etc/configuration.xml --progress

@vladak
Copy link
Member Author

vladak commented Oct 5, 2020

This is not happening with 1.3.16 so there might be something in 1.4.1 that triggered this sort of behavior.

@vladak
Copy link
Member Author

vladak commented Oct 8, 2020

This is not happening with 1.3.16 so there might be something in 1.4.1 that triggered this sort of behavior.

Re-confirmed: in order to isolate this further I limited the indexed projects to 5 Git Linux kernel repositories (mix of mainline, UEK). With 1.3.16 the indexer (with the options above) completes fine within couple of hours. With 1.4.1 the indexer stalls in history cache generation with symptoms described earlier. It seems to be progressing however very very slowly - I can see occasional file system lookup (realpath syscall), the rest is just yields, LWP waiting, mmap calls etc. No writes at all. 2 of the git log processes have completed, there are 3 left:

                 6594 /jdk/openjdk11/bin/java -Djava.util.logging.config.file=/var/tmp/upgrade-logging.conf -Dorg.opengrok.indexer.history.Mercurial=/usr/bin/hg -Dorg.opengrok.indexer.history.SCCS=/usr/bin/sccs -Dorg.opengrok.indexer.history.git=/usr/bin/git -Dorg.opengrok.indexer.history.cvs=/usr/bin/cvs -Dorg.opengrok.indexer.history.Subversion=/usr/bin/svn -Xmx32g -server -jar /opengrok-upgrade/dist/lib/opengrok.jar --renamedHistory on -r dirbased -G -m 256 --leadingWildCards on -c /usr/local/bin/ctags -o /opengrok/etc/ctags.config -H -P -S -T 16 -R /opengrok-upgrade/etc/readonly_configuration.xml -s /ws-local-linux -d /opengrok-upgrade/data -W /opengrok-upgrade/etc/configuration.xml --progress
                   7275 /usr/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict -m -- /ws-local-linux/linux-mainline
                   7277 /usr/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict -m -- /ws-local-linux/linux-mainline-next
                   7278 /usr/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict -m -- /ws-local-linux/linux-SPARC-upstream

Again, the read-only configuration has renamed handling turned off for all these repositories.

@vladak
Copy link
Member Author

vladak commented Oct 8, 2020

Did a bit of a bisection:

Release/changeset status
1.3.6 ok
1.4.0 @ e88d39d ok
1.4.0 @ 6a2061f fail
1.4.1 fail

Note: 1.4.0 was never officially released because the release build failed

6a2061f is the child of e88d39d.

Tried getting the git log with and without the -m option on one of the repositories:

$ du -sh /var/tmp/linux-mainline-next.gitlog*
 638M   /var/tmp/linux-mainline-next.gitlog
 4.4G   /var/tmp/linux-mainline-next.gitlog-m
$ grep ^commit /var/tmp/linux-mainline-next.gitlog |wc -l
795541
$ grep ^commit /var/tmp/linux-mainline-next.gitlog-m |wc -l
854726

The difference w.r.t. number of commits is not so big however the git log with -m contains huge lists of files.

The repositories are similar w.r.t. history, so if just the textual representation of it is around 4 GiB then complete size would be near 20 GiB and size of the objects representing this data could easily hit the 32 GiB heap limit. If these projects were indexed sequentially that might work however in parallel this is causing significant memory pressure.

@vladak
Copy link
Member Author

vladak commented Oct 8, 2020

Tried indexing with 1.4.7 (at a00dc57) with 6a2061f and 0241b5b backed out (had to do a minor conflict resolution in copyright comment, otherwise it was clean) and it was able to create the history cache fine in around the same time as 1.3.6:

00:09:52 INFO: Done scanning for repositories, found 5 repositories (took 47.348 seconds)
00:09:52 INFO: Generating history cache for all repositories ...
00:09:52 INFO: Creating historycache for 5 repositories
00:09:52 INFO: Creating historycache for /ws-local-linux/linux-mainline (GitRepository) without renamed file handling
00:09:52 INFO: Creating historycache for /ws-local-linux/linux-SPARC-upstream (GitRepository) without renamed file handling
00:09:52 INFO: Creating historycache for /ws-local-linux/linux-uek-4.1 (GitRepository) without renamed file handling
00:09:52 INFO: Creating historycache for /ws-local-linux/linux-mainline-next (GitRepository) without renamed file handling
00:09:52 INFO: Creating historycache for /ws-local-linux/linux-uek-SPARC (GitRepository) without renamed file handling
00:39:47 INFO: Done historycache for /ws-local-linux/linux-uek-SPARC (took 0:29:54)
00:42:33 INFO: Done historycache for /ws-local-linux/linux-uek-4.1 (took 0:32:40)
00:55:24 INFO: Done historycache for /ws-local-linux/linux-mainline-next (took 0:45:31)
00:58:06 INFO: Done historycache for /ws-local-linux/linux-SPARC-upstream (took 0:48:14)
00:58:31 INFO: Done historycache for /ws-local-linux/linux-mainline (took 0:48:39)
00:58:31 INFO: Done historycache for all repositories (took 0:48:39)

Observing the write syscalls done by the indexer the problem is there, just not so visible: there are certainly periods of time when the writes slow down significantly for bunch of seconds (interleaved by SIGSEGV's as the heap limit is increased and GC tries to do its job) and then are done rapidly for longer periods of time. The -m just exacerbated the issue.

There are multiple solutions:

  • make indexing git octopus merges tunable (disable by default ?)
  • fix this issue by performing history cache creation in chunks
    • I considered a weaker alternative: unset the history entries from the map right after they are written to disk. This might help a little bit however only in certain scenarios.
  • compress the representation in memory and uncompress just after storing to disk
    • not a proper solution as the indexer will hit the limit, just later

@vladak
Copy link
Member Author

vladak commented Oct 10, 2020

For the record this is how JVM used memory graph looks like:
with -m and without fix
I terminated the indexer before 16:15. The git log processes were still running. For this data set when git log is run without m the history cache is generated in some 45 minutes.

@idodeclare
Copy link
Contributor

idodeclare commented Oct 11, 2020

@vladak , let me show you the rest of the patch whence the octopus handling came where I had refactored the history API to return HistoryEnumeration instead of all History in order to keep the memory consumption low. That patch was intended to speed up history processing but did not succeed because the ultimate history store in OpenGrok is a per-file, XML-serialization with a severe redundancy of commit message text and other commit data. A successful optimization would require IMHO abandoning the per-file history storage model (while still preserving logical per-file history).

While the patch did not succeed to speed up history, it does keep memory consumption low by using an intermediate, on-disk log-structured key-value store to hold the parsed HistoryEntry instances while a Repository stream is being read.

(Having had success with the on-disk key-value store, I started a proof-of-concept of a HistoryCache implementation using it, but that effort stalled.)

@idodeclare
Copy link
Contributor

See #3271

@vladak
Copy link
Member Author

vladak commented Mar 25, 2021

The 1.6.5 release contains a tunable to disable merge commits. This can be used as a workaround.

@vladak
Copy link
Member Author

vladak commented Apr 14, 2021

1.7.0 will have merge commits disabled by default (#3540).

@vladak
Copy link
Member Author

vladak commented Apr 14, 2021

I entertained the following idea to fix this: for VCS types that support retrieval of history for directories (e.g. Git, Mercurial), split the operation of storing the history into multiple steps, where each step will have limited number of changesets. There are multiple problems with this approach.

Firstly, the traversal of commits is done from HEAD towards the history, optionally stopping at given changeset (specified as sinceRevision in getHistory()). This works well for the incremental reindexing where the last indexed changeset is stored (in the OpenGroklatestRev file underneath the historycache directory) and used to restrict the list of changesets to retrieve. Of course, even incremental reindex can suffer from this problem if the amount of incoming changesets is big. Given this traversal limitation (which makes sense as the changesets usually store a "pointer" to parent changeset(s), at least in Git), splitting the operation of storing the history would probably mean traversing from HEAD (till the last indexed changeset or to the first changeset), identifying the boundary changesets (depending on how many to index in one go) and applying that plan for the per partes history cache generation. This assumes that the memory footprint of the traversal is on the low side (which is probably safe to assume as we just need to identify the changeset IDs) and that the history of individual files is much smaller than the history of the repository. This is because for each file the pre-existing history needs to be merged with the newly added history (involves XML deserialization into History object, add newly added history, serialization back to compressed XML). The goal of evading the complete history of the repository being present in the memory during indexing should be achieved with this approach.

Secondly, the history for the top level directory is cached by default (in the OpenGrokDirHist.gz file underneath the historycache directory). This basically undermines this approach given the merging described above so it would have to be disabled/removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants