Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr Backups not running #447

Open
maxkadel opened this issue Jan 10, 2025 · 10 comments
Open

Solr Backups not running #447

maxkadel opened this issue Jan 10, 2025 · 10 comments
Assignees
Labels

Comments

@maxkadel
Copy link
Contributor

Expected behavior

On the Solr 8 servers (e.g. lib-solr-prod7.princeton.edu) the directory /mnt/solr_backup/solr8/production should have subdirectories by day which include .bk backup files

Actual behavior

The /mnt/solr_backup/solr8/production directory contains sub-directories by day which are empty

Steps to replicate

SSH onto a solr 8 box and ls the sub-directories in /mnt/solr_backup/solr8/production

Impact of this bug

We cannot restore production data if there is an issue.

Implementation notes, if any

@maxkadel maxkadel added the bug label Jan 10, 2025
@hackartisan
Copy link
Member

hackartisan commented Jan 13, 2025

If invoked directly it runs on staging (it looks like the schedule.rb configuration won't ever execute for the staging machine, since it's not in the :db group, so staging backups aren't running regularly -- I assume this is on purpose).

One difference I notice is that on staging the directories are group-writable, and on prod they are not. Perhaps that is the issue?

I double-checked the refactors I made in #440 and they seem to produce the same api urls.

@hackartisan
Copy link
Member

The async request status ids are written to /tmp/solr-backup.log

Here is an example of the output I get querying one of those request statuses:

deploy@lib-solr-prod7:~$ curl "http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=figgy-production-202501132037"
{
  "responseHeader":{
    "status":0,
    "QTime":11},
  "Operation backup caused exception:":"java.nio.file.AccessDeniedException:java.nio.file.AccessDeniedException: /mnt/solr_backup/solr8/production/20250113/figgy-production-20250113.bk",
  "exception":{
    "msg":"/mnt/solr_backup/solr8/production/20250113/figgy-production-20250113.bk",
    "rspCode":-1},
  "status":{
    "state":"failed",
    "msg":"found [figgy-production-202501132037] in failed tasks"}}

@hackartisan
Copy link
Member

More observations:

  • the last successful prod backup was on 2024/12/30. the one on the 31st failed and all of them since then have failed.

  • On prod boxes 8 and 9 the mount directories are owned by root root. Maybe these mounts got messed up somehow on the 30th. the solr backup requires all cloud machines to share the mounted directory, so however it's doing the backup is probably distributed. So with a permissions error on one machine the whole thing maybe fails.

@hackartisan
Copy link
Member

I checked ansible-alerts for anything run on dec 30th that could be relevant, but the closest things I found were Ansible ran update the Operating System packages on [lib-solr-prod8.princeton.edu](http://lib-solr-prod8.princeton.edu/) on the 31st. Those ran at 6-something am, eastern; the backups would have run at midnight UTC which is actually the evening of the 30th, and so they would have been done so this doesn't seem like it caused the failure.

@hackartisan
Copy link
Member

hackartisan commented Jan 14, 2025

the solr role should set the correct user when adding the mount:

https://github.com/pulibrary/princeton_ansible/blob/00513a948b432be32ddd00b3fafae651bb969b2b/roles/solrcloud/tasks/main.yml#L32

The id looks right on box 8. maybe the role just needs to be run for some reason.

@maxkadel
Copy link
Contributor Author

It looks like we had a PR to fix the permissions issue previously - not clear why the permissions would have changed again.

@maxkadel
Copy link
Contributor Author

maxkadel commented Jan 15, 2025

Running the playbooks on production fixed the mount permissions and allowed us to run the backup script successfully on production.

Further refinements @hackartisan, @kayiwa, and I discussed were

  1. Splitting the cleanup and backup tasks into their own cron jobs using the ruby scripts / whenever gem.
  2. Using a different, more reliable mount than diglib-data for the backups (@kayiwa will make a ticket)
  3. We ran into a non-breaking deprecation in the solrcloud role that we should fix - probably should switch this to use ansible.builtin.service for this, not ansible.builtin.command
TASK [roles/solrcloud : solrcloud | Check for solr service] ************************************************************
fatal: [lib-solr-prod8.princeton.edu]: FAILED! => {"changed": false, "msg": "Unsupported parameters for (ansible.legacy.command) module: warn. Supported parameters include: _raw_params, _uses_shell, argv, chdir, creates, executable, expand_argument_vars, removes, stdin, stdin_add_newline, strip_empty_ends."}

@escowles
Copy link
Member

If diglibdata is an issue, then it might make sense to prioritize this share for migrating to TigerData sooner than other shares. Though I expect the TigerData storage to be similar to diglibadata, but newer, so we should make sure it's appropriate for this use case.

@kayiwa
Copy link
Member

kayiwa commented Jan 15, 2025

Heya @escowles If you are on sensu alerts. I would want to really think on this. Also the fact the this ticket exists makes me feel like another non-Tigerdata solution would be my preference.

@escowles
Copy link
Member

@kayiwa That is completely fine with me — I think there are a ton of reasons why non-TigerData storage is probably a better fit for this data, and I would expect storage we manage to be less headache to use, anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants