Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallelize doFileHistory() for regular files in FileHistoryCache#store() #3542

Closed
vladak opened this issue Apr 14, 2021 · 2 comments
Closed

Comments

@vladak
Copy link
Member

vladak commented Apr 14, 2021

Playing with a proof of concept fix for #3243 I realized that regular files could be parallelized in the same way as renamed files, i.e. create the directories first and then use thread pool to perform doFileHistory() for each file. This will cost more memory - same assumption applies as for the proof of concept fix - history of individual files is reasonably big.

@vladak
Copy link
Member Author

vladak commented Apr 15, 2021

Also, the current way how the directories are created:

// The directories for the renamed files have to be created before
// the actual files otherwise storeFile() might be racing for
// mkdirs() if there are multiple renamed files from single directory
// handled in parallel.
for (final String file : renamedMap.keySet()) {
File cache;
try {
cache = getCachedFile(new File(env.getSourceRootPath() + file));
} catch (ForbiddenSymlinkException ex) {
LOGGER.log(Level.FINER, ex.getMessage());
continue;
}
File dir = cache.getParentFile();
if (!dir.isDirectory() && !dir.mkdirs()) {
LOGGER.log(Level.WARNING,
"Unable to create cache directory ' {0} '.", dir);
}
}

is sub-optimal: it should really assemble the directories to be created in a set first and then go through the set and call mkdirs() for each item in the set. Like this is done now it calls isDirectory() more than is needed. Of course, more intelligent algorithm can be used to call mkdirs() on the longest paths first and drop those that have strictly smaller prefix. Perhaps construct a tree structure storing the directory tree, each node being a path component, the root node being the root directory (would work fine for Unix systems, it's a question whether this would work for Windows in the indexer context) and once the tree is populated with all the directories to create, traverse the leaf nodes and mkdirs() them.

@vladak
Copy link
Member Author

vladak commented May 18, 2021

One observation made when working on the fix for #3243 - when creating history cache for single repository with large history (e.g. Linux) the CPU is only lightly utilized, so this change should help boosting indexer performance.

Also, this change is almost necessary given that the XML serialization filtering seems to impose additional non trivial workload (#3585).

@vladak vladak self-assigned this Jun 15, 2021
vladak pushed a commit to vladak/OpenGrok that referenced this issue Jun 15, 2021
@vladak vladak closed this as completed in c62b3ec Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant