Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path deduplication in oc_filecache #42182

Open
taminob opened this issue Dec 12, 2023 · 0 comments
Open

Path deduplication in oc_filecache #42182

taminob opened this issue Dec 12, 2023 · 0 comments
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap enhancement feature: filesystem performance 🚀

Comments

@taminob
Copy link

taminob commented Dec 12, 2023

How to use GitHub

  • Please use the 👍 reaction to show that you are interested into the same feature.
  • Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
  • Subscribe to receive notifications on status change and new comments.

Is your feature request related to a problem?
My small/medium Nextcloud instance (~80GB) has a database of 300MB, mainly caused (>60%) by the table oc_filecache.
I think there is potential to reduce the database size which might also improve performance for large instances.

Describe the solution you'd like
While working on #41321, I noticed that oc_filecache contains the full (internal) path for each file and additionally the file name.
A path deduplication in the database could decrease the table size by a lot.

This could be achieved by creating a table oc_directories (or oc_paths) containing directories and mapping them to an id.
This id can then be used in the oc_filecache instead of the raw string and the entire path can be re-created by joining oc_filecache and oc_directories and combining the directory path with the file name.
Then, the full path to the directory will only be in the oc_directories table and a directory with lots and lots of files wouldn't increase the table size by that much.

Describe alternatives you've considered

  • One minor fix could be to at least drop the file name column (since the information can easily be retrieved from the path using basename).
  • The table already contains the column parent with the file_id of the parent directory. Resolving that recursively until parent = file_id could already replace the path column. However, I'm not sure how that could affect the performance for very deep directory nestings and I feel like the solution mentioned above might be a compromise.

Additional context
I can help implementing this, but would appreciate a few pointers if there is something to consider.

I'm also not sure if 3rd-party apps use the filecache - if so, this change would have to be rolled out in a major release since it could break these apps.

@taminob taminob added 0. Needs triage Pending check for reproducibility or if it fits our roadmap enhancement labels Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap enhancement feature: filesystem performance 🚀
Projects
None yet
Development

No branches or pull requests

2 participants