parallelize doFileHistory() for regular files in FileHistoryCache#store() #3542

vladak · 2021-04-14T20:07:06Z

Playing with a proof of concept fix for #3243 I realized that regular files could be parallelized in the same way as renamed files, i.e. create the directories first and then use thread pool to perform doFileHistory() for each file. This will cost more memory - same assumption applies as for the proof of concept fix - history of individual files is reasonably big.

vladak · 2021-04-15T16:12:13Z

Also, the current way how the directories are created:

opengrok/opengrok-indexer/src/main/java/org/opengrok/indexer/history/FileHistoryCache.java

Lines 537 to 555 in 610d908

    
           // The directories for the renamed files have to be created before 
        
           // the actual files otherwise storeFile() might be racing for 
        
           // mkdirs() if there are multiple renamed files from single directory 
        
           // handled in parallel. 
        
           for (final String file : renamedMap.keySet()) { 
        
               File cache; 
        
               try { 
        
                   cache = getCachedFile(new File(env.getSourceRootPath() + file)); 
        
               } catch (ForbiddenSymlinkException ex) { 
        
                   LOGGER.log(Level.FINER, ex.getMessage()); 
        
                   continue; 
        
               } 
        
               File dir = cache.getParentFile(); 
        
               if (!dir.isDirectory() && !dir.mkdirs()) { 
        
                   LOGGER.log(Level.WARNING, 
        
                      "Unable to create cache directory ' {0} '.", dir); 
        
               } 
        
           }

is sub-optimal: it should really assemble the directories to be created in a set first and then go through the set and call mkdirs() for each item in the set. Like this is done now it calls isDirectory() more than is needed. Of course, more intelligent algorithm can be used to call mkdirs() on the longest paths first and drop those that have strictly smaller prefix. Perhaps construct a tree structure storing the directory tree, each node being a path component, the root node being the root directory (would work fine for Unix systems, it's a question whether this would work for Windows in the indexer context) and once the tree is populated with all the directories to create, traverse the leaf nodes and mkdirs() them.

vladak · 2021-05-18T13:31:12Z

One observation made when working on the fix for #3243 - when creating history cache for single repository with large history (e.g. Linux) the CPU is only lightly utilized, so this change should help boosting indexer performance.

Also, this change is almost necessary given that the XML serialization filtering seems to impose additional non trivial workload (#3585).

fixes oracle#3542

vladak added enhancement indexer labels Apr 14, 2021

vladak mentioned this issue May 21, 2021

Mercurial history per partes #3601

Merged

vladak self-assigned this Jun 15, 2021

vladak pushed a commit to vladak/OpenGrok that referenced this issue Jun 15, 2021

parallelize history cache creation for individual files

23f01ac

fixes oracle#3542

vladak closed this as completed in c62b3ec Jun 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallelize doFileHistory() for regular files in FileHistoryCache#store() #3542

parallelize doFileHistory() for regular files in FileHistoryCache#store() #3542

vladak commented Apr 14, 2021

vladak commented Apr 15, 2021 •

edited

Loading

vladak commented May 18, 2021 •

edited

Loading

parallelize doFileHistory() for regular files in FileHistoryCache#store() #3542

parallelize doFileHistory() for regular files in FileHistoryCache#store() #3542

Comments

vladak commented Apr 14, 2021

vladak commented Apr 15, 2021 • edited Loading

vladak commented May 18, 2021 • edited Loading

vladak commented Apr 15, 2021 •

edited

Loading

vladak commented May 18, 2021 •

edited

Loading