Solr Backups not running #447

maxkadel · 2025-01-10T19:56:26Z

Expected behavior

On the Solr 8 servers (e.g. lib-solr-prod7.princeton.edu) the directory /mnt/solr_backup/solr8/production should have subdirectories by day which include .bk backup files

Actual behavior

The /mnt/solr_backup/solr8/production directory contains sub-directories by day which are empty

Steps to replicate

SSH onto a solr 8 box and ls the sub-directories in /mnt/solr_backup/solr8/production

Impact of this bug

We cannot restore production data if there is an issue.

Implementation notes, if any

The text was updated successfully, but these errors were encountered:

hackartisan · 2025-01-13T21:00:32Z

If invoked directly it runs on staging (it looks like the schedule.rb configuration won't ever execute for the staging machine, since it's not in the :db group, so staging backups aren't running regularly -- I assume this is on purpose).

One difference I notice is that on staging the directories are group-writable, and on prod they are not. Perhaps that is the issue?

I double-checked the refactors I made in #440 and they seem to produce the same api urls.

hackartisan · 2025-01-13T21:58:40Z

The async request status ids are written to /tmp/solr-backup.log

Here is an example of the output I get querying one of those request statuses:

deploy@lib-solr-prod7:~$ curl "http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=figgy-production-202501132037"
{
  "responseHeader":{
    "status":0,
    "QTime":11},
  "Operation backup caused exception:":"java.nio.file.AccessDeniedException:java.nio.file.AccessDeniedException: /mnt/solr_backup/solr8/production/20250113/figgy-production-20250113.bk",
  "exception":{
    "msg":"/mnt/solr_backup/solr8/production/20250113/figgy-production-20250113.bk",
    "rspCode":-1},
  "status":{
    "state":"failed",
    "msg":"found [figgy-production-202501132037] in failed tasks"}}

hackartisan · 2025-01-14T14:53:58Z

More observations:

the last successful prod backup was on 2024/12/30. the one on the 31st failed and all of them since then have failed.
On prod boxes 8 and 9 the mount directories are owned by root root. Maybe these mounts got messed up somehow on the 30th. the solr backup requires all cloud machines to share the mounted directory, so however it's doing the backup is probably distributed. So with a permissions error on one machine the whole thing maybe fails.

hackartisan · 2025-01-14T15:02:40Z

I checked ansible-alerts for anything run on dec 30th that could be relevant, but the closest things I found were Ansible ran update the Operating System packages on [lib-solr-prod8.princeton.edu](http://lib-solr-prod8.princeton.edu/) on the 31st. Those ran at 6-something am, eastern; the backups would have run at midnight UTC which is actually the evening of the 30th, and so they would have been done so this doesn't seem like it caused the failure.

hackartisan · 2025-01-14T15:09:26Z

the solr role should set the correct user when adding the mount:

https://github.com/pulibrary/princeton_ansible/blob/00513a948b432be32ddd00b3fafae651bb969b2b/roles/solrcloud/tasks/main.yml#L32

The id looks right on box 8. maybe the role just needs to be run for some reason.

maxkadel · 2025-01-14T18:27:19Z

It looks like we had a PR to fix the permissions issue previously - not clear why the permissions would have changed again.

maxkadel · 2025-01-15T15:39:47Z

Running the playbooks on production fixed the mount permissions and allowed us to run the backup script successfully on production.

Further refinements @hackartisan, @kayiwa, and I discussed were

Splitting the cleanup and backup tasks into their own cron jobs using the ruby scripts / whenever gem.
Using a different, more reliable mount than diglib-data for the backups (@kayiwa will make a ticket)
We ran into a non-breaking deprecation in the solrcloud role that we should fix - probably should switch this to use ansible.builtin.service for this, not ansible.builtin.command

TASK [roles/solrcloud : solrcloud | Check for solr service] ************************************************************
fatal: [lib-solr-prod8.princeton.edu]: FAILED! => {"changed": false, "msg": "Unsupported parameters for (ansible.legacy.command) module: warn. Supported parameters include: _raw_params, _uses_shell, argv, chdir, creates, executable, expand_argument_vars, removes, stdin, stdin_add_newline, strip_empty_ends."}

escowles · 2025-01-15T16:03:46Z

If diglibdata is an issue, then it might make sense to prioritize this share for migrating to TigerData sooner than other shares. Though I expect the TigerData storage to be similar to diglibadata, but newer, so we should make sure it's appropriate for this use case.

kayiwa · 2025-01-15T18:49:52Z

Heya @escowles If you are on sensu alerts. I would want to really think on this. Also the fact the this ticket exists makes me feel like another non-Tigerdata solution would be my preference.

escowles · 2025-01-15T18:53:44Z

@kayiwa That is completely fine with me — I think there are a ton of reasons why non-TigerData storage is probably a better fit for this data, and I would expect storage we manage to be less headache to use, anyway.

maxkadel added the bug label Jan 10, 2025

acozine mentioned this issue Jan 14, 2025

[CheckMK] Monitor Solr backups pulibrary/princeton_ansible#5740

Open

2 tasks

maxkadel self-assigned this Jan 14, 2025

hackartisan mentioned this issue Jan 15, 2025

Split cleanups from backups, so the one doesn't hold up the other #450

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solr Backups not running #447

Solr Backups not running #447

maxkadel commented Jan 10, 2025

hackartisan commented Jan 13, 2025 •

edited

Loading

hackartisan commented Jan 13, 2025

hackartisan commented Jan 14, 2025

hackartisan commented Jan 14, 2025

hackartisan commented Jan 14, 2025 •

edited

Loading

maxkadel commented Jan 14, 2025

maxkadel commented Jan 15, 2025 •

edited

Loading

escowles commented Jan 15, 2025

kayiwa commented Jan 15, 2025

escowles commented Jan 15, 2025

Solr Backups not running #447

Solr Backups not running #447

Comments

maxkadel commented Jan 10, 2025

Expected behavior

Actual behavior

Steps to replicate

Impact of this bug

Implementation notes, if any

hackartisan commented Jan 13, 2025 • edited Loading

hackartisan commented Jan 13, 2025

hackartisan commented Jan 14, 2025

hackartisan commented Jan 14, 2025

hackartisan commented Jan 14, 2025 • edited Loading

maxkadel commented Jan 14, 2025

maxkadel commented Jan 15, 2025 • edited Loading

escowles commented Jan 15, 2025

kayiwa commented Jan 15, 2025

escowles commented Jan 15, 2025

hackartisan commented Jan 13, 2025 •

edited

Loading

hackartisan commented Jan 14, 2025 •

edited

Loading

maxkadel commented Jan 15, 2025 •

edited

Loading