FileHistoryCache#store is one big memory hog #3243

vladak · 2020-09-24T14:46:26Z

Observing the RSS/CPU of indexer process that does indexing from scratch of multiple repositories with heavy history (Linux kernel, FreeBSD, ...) running with 16 threads and 48 GiB heap (the machine has 32 CPUs and 256 GB RAM), there is clearly something bad going on. The indexing is in the phase of generating history caches for all projects. The indexer process has ~50 GB RSS, is busy on the CPU (say 60%), the usage grows, stays a bit at the maximum (70%) and then quickly falls down (to low 60's %). This cycle repeats every couple of seconds (assuming the GC is busy collecting and then is either done or gets stopped because it spent too much time on the CPU) while there are Mercurial/Git log processes running, getting the history of the whole repository. This happens with 1.4.15.

I have not done any heap analysis yet, however by looking at FileHistoryCache#store, this is just asking for trouble. First, the whole repository history is stored in memory (in the form of History/HistoryEntry objects) which could be quite sizeable of its own (sample Linux kernel repo has 500k+ changsets and 50k+ files on disk) and is then converted to the inverted map:

HashMap<String, List<HistoryEntry>> map = new HashMap<>();

When tagged history is enabled (as is the case for the indexer run I am observing), it gets even worse:

466                  /*
467                   * We need to do deep copy in order to have different tags
468                   * per each commit.
469                   */
470                  if (env.isTagsEnabled() && repository.hasFileBasedTags()) {
471                      list.add(new HistoryEntry(e));
472                  } else {
473                      list.add(e);
474                  }

In such case there will be distinct HistoryEntry object for each changeset that touched given file. In overall, this will lead to explosive growth of HistoryEntry objects.

The text was updated successfully, but these errors were encountered:

vladak · 2020-09-24T15:25:12Z

Low hanging fruit: one way to avoid the explosion is to move the deep copy of the history entries right before the tags are assigned in doFileHistory() and then maybe set them to null once the storeFile() is done.

vladak · 2020-09-24T22:51:05Z

Actually, the deep copy looks unnecessary since the tags are actually reset for all the cases in doFileHistory() before they are assigned to the history entries:

opengrok/opengrok-indexer/src/main/java/org/opengrok/indexer/history/FileHistoryCache.java

Lines 121 to 140 in 115e69d

    
           if (hist == null) { 
        
               hist = new History(); 
        
               // File based history cache does not store files for individual 
        
               // changesets so strip them unless it is history for the repository. 
        
               for (HistoryEntry ent : historyEntries) { 
        
                   if (file.isDirectory() && filename.equals(repository.getDirectoryName())) { 
        
                       ent.stripTags(); 
        
                   } else { 
        
                       ent.strip(); 
        
                   } 
        
               } 
        
               // add all history entries 
        
               hist.setHistoryEntries(historyEntries); 
        
           } else { 
        
               for (HistoryEntry ent : hist.getHistoryEntries()) { 
        
                   ent.strip(); 
        
               } 
        
           }

vladak · 2020-09-25T07:07:03Z

Ultimately, it would help if the history was split into chunks when generating from scratch.

vladak · 2020-09-29T12:00:35Z

Also, this seems to be something new, perhaps with 1.4.x. Certainly did not happen with 1.3.x.

vladak · 2020-09-30T18:44:50Z

I should mention this happens with Java 1.8.0_261-b12 (64-bit, obviously). After 3 days or so jmap finally finished dumping the heap (stopping the process and getting the dump via gcore/gdb and getting the heap dump this way did not work as the heap could not be extracted from there for some reason) I loaded the resulting 34 GB dump into YourKit analyzer (took a bit of trial and error and involved adding 32 GB of swap space to machine that has 32 GB of RAM. MAT analysis tool loaded the dump very quickly compared to YourKit however did not have such a nice display of memory) and the results are pretty jaw dropping:

99% of the heap is memory waiting to be garbage collected. This really resembles #3089 which got solved by switching to JDK 11 so I wonder if PR #3098 (interesting coincidence w.r.t. the numbers of the issues) should be accelerated.

I tried forcing GC run via jcmd however that did not work (perhaps the process needs to be started with some extra options):

# /usr/jdk/latest/bin/jcmd 3164 GC.run
3164:
com.sun.tools.attach.AttachNotSupportedException: Unable to open door: target process not responding or HotSpot VM not loaded
        at sun.tools.attach.SolarisVirtualMachine.<init>(SolarisVirtualMachine.java:92)
        at sun.tools.attach.SolarisAttachProvider.attachVirtualMachine(SolarisAttachProvider.java:59)
        at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:208)
        at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:147)
        at sun.tools.jcmd.JCmd.main(JCmd.java:131)

vladak · 2020-09-30T18:45:20Z

Still, the idea about splitting the history cache generation into chunks stands no matter if we switch to newer Java.

vladak · 2020-10-01T05:24:29Z

With OpenJDK (11.0.7-internal built from source) the same thing happened. I realized that the indexer is running with -XX:-UseGCOverheadLimit which probably contributes to the problem.

vladak · 2020-10-01T19:08:19Z

Indexing using JDK11 without the -XX:-UseGCOverheadLimit option leads to the same problem.

vladak · 2020-10-02T10:16:48Z

Same thing with JDK11 and 1.4.1. There are 7 big repositories that are history indexed - most of them Linux kernel GIt repositories. Tracing these git log / hg log processes they are progressing however very slowly:

# truss -D -p 22015
write(1, 0xF1B4A8884, 5120)     (sleeping...)
 8.503952    write(1, " / s i g p v c / s i g p".., 5120)       = 5120
 0.000911    write(1, " c h o . c\n s y s / n e".., 5120)       = 5120
 0.122504    write(1, " n e t i n e t 6 / u d p".., 5120)       = 5120
write(1, 0xF1B4A8884, 5120)     (sleeping...)
10.227984    write(1, " 9 8 / s y s c o n s _ p".., 5120)       = 5120
 0.001409    write(1, " e x u s . c\n s y s / p".., 5120)       = 5120
 0.051428    write(1, " s m . c\n s y s / s p a".., 5120)       = 5120
 0.014977    write(1, " e a d . c\n s y s / u f".., 5120)       = 5120
write(1, 0xF1B4A8884, 5120)     (sleeping...)
10.822057    write(1, " / R e f / d i s k . c r".., 5120)       = 5120
write(1, 0xF1B4A8884, 5120)     (sleeping...)
10.224109    write(1, " b i n / u u e n c o d e".., 5120)       = 5120
 0.021716    write(1, " n d a r s / c a l e n d".., 5120)       = 5120
 0.014168    write(1, " . c\n u s r . b i n / d".., 5120)       = 5120
 0.036382    write(1, " t . c\n u s r . b i n /".., 5120)       = 5120
 0.002649    write(1, " e / j o b . c\n u s r .".., 5120)       = 5120

Because the indexer is running with renamed file handling on (actually off for 5 of 7 of these big repositories) there is also bunch of hg/git log processes running next to those 7. These seem to be progressing fine. At given point of time there are between 2 and 20+ of these (the system has 32 CPUs and there is no tuning done for the number of threads used for renamed files processing).

vladak · 2020-10-02T10:19:34Z

For reference, the indexer is run like this:

/opengrok-upgrade/dist/bin/venv/bin/opengrok-indexer -C --java=/jdk/openjdk11/bin/java -J=-Xmx32g -J=-server -J=-Djava.util.logging.config.file=/var/tmp/upgrade-logging.conf --jar /opengrok-upgrade/dist/lib/opengrok.jar -e LC_ALL=en_US.UTF-8 -- --renamedHistory on -r dirbased -G -m 256 --leadingWildCards on -c /usr/local/bin/ctags -o /opengrok/etc/ctags.config -H -P -S -T 16 -R /opengrok-upgrade/etc/readonly_configuration.xml -s /ws-local-upgrade -d /opengrok-upgrade/data -W /opengrok-upgrade/etc/configuration.xml --progress
              13235 /opengrok-upgrade/dist/bin/venv/bin/python3.7 /opengrok-upgrade/dist/bin/venv/bin/opengrok-indexer -C --java=/jdk/openjdk11/bin/java -J=-Xmx32g -J=-server -J=-Djava.util.logging.config.file=/var/tmp/upgrade-logging.conf --jar /opengrok-upgrade/dist/lib/opengrok.jar -e LC_ALL=en_US.UTF-8 -- --renamedHistory on -r dirbased -G -m 256 --leadingWildCards on -c /usr/local/bin/ctags -o /opengrok/etc/ctags.config -H -P -S -T 16 -R /opengrok-upgrade/etc/readonly_configuration.xml -s /ws-local-upgrade -d /opengrok-upgrade/data -W /opengrok-upgrade/etc/configuration.xml --progress

vladak · 2020-10-05T08:57:05Z

This is not happening with 1.3.16 so there might be something in 1.4.1 that triggered this sort of behavior.

vladak · 2020-10-08T18:54:19Z

This is not happening with 1.3.16 so there might be something in 1.4.1 that triggered this sort of behavior.

Re-confirmed: in order to isolate this further I limited the indexed projects to 5 Git Linux kernel repositories (mix of mainline, UEK). With 1.3.16 the indexer (with the options above) completes fine within couple of hours. With 1.4.1 the indexer stalls in history cache generation with symptoms described earlier. It seems to be progressing however very very slowly - I can see occasional file system lookup (realpath syscall), the rest is just yields, LWP waiting, mmap calls etc. No writes at all. 2 of the git log processes have completed, there are 3 left:

                 6594 /jdk/openjdk11/bin/java -Djava.util.logging.config.file=/var/tmp/upgrade-logging.conf -Dorg.opengrok.indexer.history.Mercurial=/usr/bin/hg -Dorg.opengrok.indexer.history.SCCS=/usr/bin/sccs -Dorg.opengrok.indexer.history.git=/usr/bin/git -Dorg.opengrok.indexer.history.cvs=/usr/bin/cvs -Dorg.opengrok.indexer.history.Subversion=/usr/bin/svn -Xmx32g -server -jar /opengrok-upgrade/dist/lib/opengrok.jar --renamedHistory on -r dirbased -G -m 256 --leadingWildCards on -c /usr/local/bin/ctags -o /opengrok/etc/ctags.config -H -P -S -T 16 -R /opengrok-upgrade/etc/readonly_configuration.xml -s /ws-local-linux -d /opengrok-upgrade/data -W /opengrok-upgrade/etc/configuration.xml --progress
                   7275 /usr/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict -m -- /ws-local-linux/linux-mainline
                   7277 /usr/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict -m -- /ws-local-linux/linux-mainline-next
                   7278 /usr/bin/git log --abbrev-commit --abbrev=8 --name-only --pretty=fuller --date=iso8601-strict -m -- /ws-local-linux/linux-SPARC-upstream

Again, the read-only configuration has renamed handling turned off for all these repositories.

vladak · 2020-10-08T22:11:52Z

Did a bit of a bisection:

Release/changeset	status
1.3.6	ok
1.4.0 @ `e88d39d`	ok
1.4.0 @ `6a2061f`	fail
1.4.1	fail

Note: 1.4.0 was never officially released because the release build failed

6a2061f is the child of e88d39d.

Tried getting the git log with and without the -m option on one of the repositories:

$ du -sh /var/tmp/linux-mainline-next.gitlog*
 638M   /var/tmp/linux-mainline-next.gitlog
 4.4G   /var/tmp/linux-mainline-next.gitlog-m
$ grep ^commit /var/tmp/linux-mainline-next.gitlog |wc -l
795541
$ grep ^commit /var/tmp/linux-mainline-next.gitlog-m |wc -l
854726

The difference w.r.t. number of commits is not so big however the git log with -m contains huge lists of files.

The repositories are similar w.r.t. history, so if just the textual representation of it is around 4 GiB then complete size would be near 20 GiB and size of the objects representing this data could easily hit the 32 GiB heap limit. If these projects were indexed sequentially that might work however in parallel this is causing significant memory pressure.

vladak · 2020-10-08T23:01:41Z

Tried indexing with 1.4.7 (at a00dc57) with 6a2061f and 0241b5b backed out (had to do a minor conflict resolution in copyright comment, otherwise it was clean) and it was able to create the history cache fine in around the same time as 1.3.6:

00:09:52 INFO: Done scanning for repositories, found 5 repositories (took 47.348 seconds)
00:09:52 INFO: Generating history cache for all repositories ...
00:09:52 INFO: Creating historycache for 5 repositories
00:09:52 INFO: Creating historycache for /ws-local-linux/linux-mainline (GitRepository) without renamed file handling
00:09:52 INFO: Creating historycache for /ws-local-linux/linux-SPARC-upstream (GitRepository) without renamed file handling
00:09:52 INFO: Creating historycache for /ws-local-linux/linux-uek-4.1 (GitRepository) without renamed file handling
00:09:52 INFO: Creating historycache for /ws-local-linux/linux-mainline-next (GitRepository) without renamed file handling
00:09:52 INFO: Creating historycache for /ws-local-linux/linux-uek-SPARC (GitRepository) without renamed file handling
00:39:47 INFO: Done historycache for /ws-local-linux/linux-uek-SPARC (took 0:29:54)
00:42:33 INFO: Done historycache for /ws-local-linux/linux-uek-4.1 (took 0:32:40)
00:55:24 INFO: Done historycache for /ws-local-linux/linux-mainline-next (took 0:45:31)
00:58:06 INFO: Done historycache for /ws-local-linux/linux-SPARC-upstream (took 0:48:14)
00:58:31 INFO: Done historycache for /ws-local-linux/linux-mainline (took 0:48:39)
00:58:31 INFO: Done historycache for all repositories (took 0:48:39)

Observing the write syscalls done by the indexer the problem is there, just not so visible: there are certainly periods of time when the writes slow down significantly for bunch of seconds (interleaved by SIGSEGV's as the heap limit is increased and GC tries to do its job) and then are done rapidly for longer periods of time. The -m just exacerbated the issue.

There are multiple solutions:

make indexing git octopus merges tunable (disable by default ?)
fix this issue by performing history cache creation in chunks
- I considered a weaker alternative: unset the history entries from the map right after they are written to disk. This might help a little bit however only in certain scenarios.
compress the representation in memory and uncompress just after storing to disk
- not a proper solution as the indexer will hit the limit, just later

vladak · 2020-10-10T20:05:38Z

For the record this is how JVM used memory graph looks like:

I terminated the indexer before 16:15. The git log processes were still running. For this data set when git log is run without m the history cache is generated in some 45 minutes.

idodeclare · 2020-10-11T15:23:57Z

@vladak , let me show you the rest of the patch whence the octopus handling came where I had refactored the history API to return HistoryEnumeration instead of all History in order to keep the memory consumption low. That patch was intended to speed up history processing but did not succeed because the ultimate history store in OpenGrok is a per-file, XML-serialization with a severe redundancy of commit message text and other commit data. A successful optimization would require IMHO abandoning the per-file history storage model (while still preserving logical per-file history).

While the patch did not succeed to speed up history, it does keep memory consumption low by using an intermediate, on-disk log-structured key-value store to hold the parsed HistoryEntry instances while a Repository stream is being read.

(Having had success with the on-disk key-value store, I started a proof-of-concept of a HistoryCache implementation using it, but that effort stalled.)

idodeclare · 2020-10-11T15:47:14Z

See #3271

vladak · 2021-03-25T14:21:32Z

The 1.6.5 release contains a tunable to disable merge commits. This can be used as a workaround.

vladak · 2021-04-14T09:21:55Z

1.7.0 will have merge commits disabled by default (#3540).

vladak · 2021-04-14T11:03:15Z

I entertained the following idea to fix this: for VCS types that support retrieval of history for directories (e.g. Git, Mercurial), split the operation of storing the history into multiple steps, where each step will have limited number of changesets. There are multiple problems with this approach.

Firstly, the traversal of commits is done from HEAD towards the history, optionally stopping at given changeset (specified as sinceRevision in getHistory()). This works well for the incremental reindexing where the last indexed changeset is stored (in the OpenGroklatestRev file underneath the historycache directory) and used to restrict the list of changesets to retrieve. Of course, even incremental reindex can suffer from this problem if the amount of incoming changesets is big. Given this traversal limitation (which makes sense as the changesets usually store a "pointer" to parent changeset(s), at least in Git), splitting the operation of storing the history would probably mean traversing from HEAD (till the last indexed changeset or to the first changeset), identifying the boundary changesets (depending on how many to index in one go) and applying that plan for the per partes history cache generation. This assumes that the memory footprint of the traversal is on the low side (which is probably safe to assume as we just need to identify the changeset IDs) and that the history of individual files is much smaller than the history of the repository. This is because for each file the pre-existing history needs to be merged with the newly added history (involves XML deserialization into History object, add newly added history, serialization back to compressed XML). The goal of evading the complete history of the repository being present in the memory during indexing should be achieved with this approach.

Secondly, the history for the top level directory is cached by default (in the OpenGrokDirHist.gz file underneath the historycache directory). This basically undermines this approach given the merging described above so it would have to be disabled/removed.

fixes oracle#3243

vladak added indexer bug labels Sep 24, 2020

vladak mentioned this issue Sep 25, 2020

do not deep copy history entries #3248

Merged

vladak mentioned this issue Sep 29, 2020

java.lang.NoClassDefFoundError indexing huge history branch #3252

Closed

vladak mentioned this issue Nov 12, 2020

Git merge support can lead to long index times #3367

Closed

harrisric mentioned this issue Mar 2, 2021

Subversion history is incomplete when a folder has been renamed #3443

Closed

vladak mentioned this issue Mar 23, 2021

Indexer option to run historycache only #3434

Closed

vladak mentioned this issue Apr 13, 2021

disable merge commits by default #3540

Merged

vladak mentioned this issue Apr 14, 2021

History is hogging web app memory #3541

Open

vladak mentioned this issue Apr 14, 2021

parallelize doFileHistory() for regular files in FileHistoryCache#store() #3542

Closed

vladak pushed a commit to vladak/OpenGrok that referenced this issue Apr 19, 2021

proof of concept: split history cache generation into chunks

61a50bc

fixes oracle#3243

vladak mentioned this issue May 4, 2021

public OpenGrok demo - autodeploy on push with successful build and test #1740

Closed

vladak closed this as completed in 1791bb4 May 19, 2021

vladak mentioned this issue Nov 4, 2021

history based incremental reindex for changeset based SCMs #3077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileHistoryCache#store is one big memory hog #3243

FileHistoryCache#store is one big memory hog #3243

vladak commented Sep 24, 2020 •

edited

Loading

vladak commented Sep 24, 2020

vladak commented Sep 24, 2020

vladak commented Sep 25, 2020

vladak commented Sep 29, 2020

vladak commented Sep 30, 2020 •

edited

Loading

vladak commented Sep 30, 2020

vladak commented Oct 1, 2020

vladak commented Oct 1, 2020

vladak commented Oct 2, 2020 •

edited

Loading

vladak commented Oct 2, 2020

vladak commented Oct 5, 2020

vladak commented Oct 8, 2020

vladak commented Oct 8, 2020 •

edited

Loading

vladak commented Oct 8, 2020 •

edited

Loading

vladak commented Oct 10, 2020

idodeclare commented Oct 11, 2020 •

edited

Loading

idodeclare commented Oct 11, 2020

vladak commented Mar 25, 2021

vladak commented Apr 14, 2021

vladak commented Apr 14, 2021

FileHistoryCache#store is one big memory hog #3243

FileHistoryCache#store is one big memory hog #3243

Comments

vladak commented Sep 24, 2020 • edited Loading

vladak commented Sep 24, 2020

vladak commented Sep 24, 2020

vladak commented Sep 25, 2020

vladak commented Sep 29, 2020

vladak commented Sep 30, 2020 • edited Loading

vladak commented Sep 30, 2020

vladak commented Oct 1, 2020

vladak commented Oct 1, 2020

vladak commented Oct 2, 2020 • edited Loading

vladak commented Oct 2, 2020

vladak commented Oct 5, 2020

vladak commented Oct 8, 2020

vladak commented Oct 8, 2020 • edited Loading

vladak commented Oct 8, 2020 • edited Loading

vladak commented Oct 10, 2020

idodeclare commented Oct 11, 2020 • edited Loading

idodeclare commented Oct 11, 2020

vladak commented Mar 25, 2021

vladak commented Apr 14, 2021

vladak commented Apr 14, 2021

vladak commented Sep 24, 2020 •

edited

Loading

vladak commented Sep 30, 2020 •

edited

Loading

vladak commented Oct 2, 2020 •

edited

Loading

vladak commented Oct 8, 2020 •

edited

Loading

vladak commented Oct 8, 2020 •

edited

Loading

idodeclare commented Oct 11, 2020 •

edited

Loading