Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster file stat for directories? #484

Open
wlandau opened this issue Dec 18, 2024 · 10 comments
Open

Faster file stat for directories? #484

wlandau opened this issue Dec 18, 2024 · 10 comments

Comments

@wlandau
Copy link

wlandau commented Dec 18, 2024

base::file.info() is sometimes a bottleneck in targets pipelines with many files (c.f. ropensci/targets#1403). On a slow nfs drive in a shared RHEL9 cluster, I noticed performance improvements with a custom C implementation that uses fts. Is this an appealing enhancement for ps, either in ps::ps_fs_stat() or elsewhere in the package? If not, I will just implement it for targets. Reprex:

directory <- "files"
if (!file.exists(directory)) {
  dir.create(directory)
}
files <- seq_len(5e4)
random_data <- function() {
  out <- list()
  rows <- sample(seq(from = 800, to = 1200), size = 1)
  for (name in paste0("x", seq_len(32L))) {
    out[[name]] <- runif(rows)
  }
  as.data.frame(out)
}
temp <- lapply(files, \(file) {
  if (!(file %% 100)) print(file)
  saveRDS(
    object = random_data(),
    file = file.path(directory, paste0("test-data-", file, ".rds")),
    compress = FALSE
  )
})

paths <- list.files("files", full.names = TRUE)
system.time(out_base <- file.info(paths, extra_cols = FALSE))
#>   user  system elapsed 
#>  0.033   0.345 102.515 

file_info_fts <- inline::cfunction(
  sig = c(directory = "character"),
  includes = c(
    "#include <R.h>",
    "#include <Rinternals.h>",
    "#include <fts.h>",
    "#include <sys/stat.h>",
    "#include <string.h>",
    "#include <errno.h>"
  ),
  body = "const char *path_argv[] = {CHAR(STRING_ELT(directory, 0)), NULL};
    FTS *fts = fts_open((char* const*) path_argv, FTS_LOGICAL, NULL);
    if (!fts) {
      Rf_error(\"fts_open() failed: %s\", strerror(errno));
    }
    int capacity = 2048;
    int count = 0;
    FTSENT *entry;
    SEXP path;
    SEXP size;
    SEXP mtime;
    PROTECT_INDEX index_path;
    PROTECT_INDEX index_size;
    PROTECT_INDEX index_mtime;
    PROTECT_WITH_INDEX(path = allocVector(STRSXP, capacity), &index_path);
    PROTECT_WITH_INDEX(size = allocVector(REALSXP, capacity), &index_size);
    PROTECT_WITH_INDEX(mtime = allocVector(REALSXP, capacity), &index_mtime);
    while ((entry = fts_read(fts)) != NULL) {
      R_CheckUserInterrupt();
      if (entry->fts_info == FTS_F) {
        if (count == capacity) {
          capacity *= 2;
          REPROTECT(path = Rf_xlengthgets(path, capacity), index_path);
          REPROTECT(size = Rf_xlengthgets(size, capacity), index_size);
          REPROTECT(mtime = Rf_xlengthgets(mtime, capacity), index_mtime);
        }
        SET_STRING_ELT(path, count, mkChar(entry->fts_path));
        REAL(size)[count] = (double) entry->fts_statp->st_size;
        REAL(mtime)[count] = (double) entry->fts_statp->st_mtime;
        count++;
      }
    }
    fts_close(fts);
    REPROTECT(path = Rf_xlengthgets(path, count), index_path);
    REPROTECT(size = Rf_xlengthgets(size, count), index_size);
    REPROTECT(mtime = Rf_xlengthgets(mtime, count), index_mtime);
    SEXP result = PROTECT(allocVector(VECSXP, 3));
    SEXP names = PROTECT(allocVector(STRSXP, 3));
    SET_STRING_ELT(names, 0, mkChar(\"path\"));
    SET_STRING_ELT(names, 1, mkChar(\"size\"));
    SET_STRING_ELT(names, 2, mkChar(\"mtime\"));
    SET_VECTOR_ELT(result, 0, path);
    SET_VECTOR_ELT(result, 1, size);
    SET_VECTOR_ELT(result, 2, mtime);
    setAttrib(result, R_NamesSymbol, names);
    UNPROTECT(5);
    return result;"
)

system.time(out_fts <- file_info_fts("files"))
#>   user  system elapsed 
#>  0.010   0.100   4.497
@wlandau
Copy link
Author

wlandau commented Dec 19, 2024

Would this be a better fit for fs::dir_info()?

@wlandau
Copy link
Author

wlandau commented Dec 19, 2024

Hmm looks like fts is not in POSIX, but apparently ftw() is in POSIX.1-2008 and seems to have similar performance. EDIT: readdir() looks about equally performant and is available under POSIX.1-2001, which is the version on some M2 Macs.

@gaborcsardi
Copy link
Member

I believe that ftw() is a wrapper on opendir() and readdir(), e.g. this is the musl implementation: https://git.musl-libc.org/cgit/musl/tree/src/misc/nftw.c

fs already has functions to traverse a directory hierarchy, can you check if those are also slow? E.g. fs::dir_info(recurse = TRUE)? I suspect that the base R implementation is slower, because it opens the same inode multiple times, and ftw() does not. Keeping a lot of files open has its own limitations as well, of course.

@wlandau
Copy link
Author

wlandau commented Dec 20, 2024

fs::dir_info(recurse = TRUE) took about 80 seconds on a (slightly different) set of test files I am using, which is similar to the computation times I see from file.info().

@wlandau
Copy link
Author

wlandau commented Dec 20, 2024

That makes sense because fs::dir_info() is pretty much just fs::file_info(fs::dir_ls()). So by the time we reach fs::file_info(), we forget that all the files live in the same directory.

@wlandau
Copy link
Author

wlandau commented Dec 20, 2024

I believe that ftw() is a wrapper on opendir() and readdir()

Indeed I found that a pure readdir()/opendir() implementation is just as fast (https://github.com/wlandau/autometric/blob/5-readdir/src/stat.c)

@gaborcsardi
Copy link
Member

It would make sense to have a faster directory walker in fs, I think, but then it would use libuv to do the stuff. Or improve the current walker.

@wlandau
Copy link
Author

wlandau commented Dec 20, 2024

That would be great. Should I open a new issue there?

@gaborcsardi
Copy link
Member

Let me transfer this one.

@gaborcsardi gaborcsardi transferred this issue from r-lib/ps Dec 20, 2024
@wlandau wlandau changed the title Faster file stat with fts? Faster file stat for directories? Dec 20, 2024
@wlandau
Copy link
Author

wlandau commented Dec 20, 2024

For what it's worth, I have a backup implementation in a branch of autometric: https://github.com/wlandau/autometric/blob/5/src/stat.c. But I'm sure whatever lands in fs will be more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants