Postmortem: a Mastodon outage, Backup restore and preventive Maintenance
819 words, 4 minutes
After reading about Mastodon UI theming options, I decided to follow the directions from the TangerineUI-for-Mastodon project to get another look’n’feel on my instance. The directions were pretty clear and short, so I went for them. But something failed during assets compilation process. And my Mastodon instance got wrecked.
As a personnal “challenge”, I decided I would write a software post-mortem about this event. The end of the document will also summarize actions that were taken during post-backup-restoration maintenance phase.
Summary
Date | 2023-08-02 13:33 |
Author(s) | Joel C. |
Status | Service restored. |
Summary | Mastodon Web UI was partially unavailable for 3H during daylight period. |
Impact | No toots could be sent or read from the Web interface. No data were lost. Only user(s) who applied the new theme were impacted. |
Root cause(s) | The compilation of theme assets failed but the theme was applied by the only user of the Mastodon instance. |
Trigger | Errors were treated as warnings and have been discarded. |
Resolution | Frontend files and configuration were restored from the last nightly backup. |
Detection | Impacted user(s) could not access the Web interface. The Mastodon error message “Something went wrong on our side” was displayed. |
Corrective action(s)
Item | Type | Owner | Bug |
---|---|---|---|
Reminders of the rules for Production operations. | mitigate | Joel C. | N/A DONE |
Set up a non-Production environment for staging purpose. | prevent | Joel C. | #00008 TODO |
Freeze unplanned feature changes for a month. | process | Joel C. | N/A DONE |
Set up daily maintenance script to optimize backup. | improvement | Joel C. | #00009 DONE |
Lesson(s) learned
What went well
- The daily backups are healthy.
- The backup restoration was rather fast.
- No data were lost.
- Service was still up for API clients.
What went wrong
- The assets compilation failed.
- The backups are quite big.
Where we got lucky
- The database was not impacted.
- Messages are remotely queued during maintenance.
Timeline
2023-08-02 (GMT+2)
- 13:25 Download and install assets. Configure the YAML files.
- 13:30 Start assets compilation.
- 13:32 Apply theme from the Web UI.
- 13:33 The “something went wrong” message is displayed.
- 13:34 Read logs and compilation history output.
- 13:50 Read document on how to reverse theme selection from CLI.
- 14:00 Read document on how to reverse theme selection from SQL.
- 14:20 No solution was found. Proceed to backup restoration.
- 14:50 Discover that backup size is 40GB+ and takes ages to be transferred back to the server.
- 15:30 Read about cache data that can be ignored during restore.
- 16:00 Minimal backup data is restored.
- 16:01 Start issuing Mastodon maintenance commands using CLI.
- 16:35 Run tests with various accounts.
- 16:39 Service considered up & running again.
Restoring operations
Stop the Mastodon services and remove the failing content:
# systemctl stop mastodon-web mastodon-sidekiq mastodon-streaming
# rm -rf /home/mastodon/live
Transfer the backup data to the Mastodon server. I am using rsnapshot(1). This means that restore can be done with a simple tar command via a SSH connection, from the backup server to the Mastodon server:
# $(ssh-agent -s)
# ssh-add /home/backup/.ssh/id_backup
# cd /backup/mastodon/daily.0/home/mastodon
# gtar cpf - --exclude=live/public/system/cache live | \
ssh backup@mastodon "cd /home/mastodon ; tar xvpf -"
Clearing the cache from the Mastodon server:
$ redis-cli -h 192.0.2.10 -n 0 FLUSHDB
$ cd ~/live
$ RAILS_ENV=production ./bin/tootctl cache clear
$ RAILS_ENV=production bin/tootctl preview_cards remove --days 0
14648/14648 |===========================================| Time: 00:01:53
Removed 14648 preview cards (approx. 489 MB)
$ RAILS_ENV=production bin/tootctl media remove --days 0
33553/33553 |===========================================| Time: 00:04:51
Removed 33553 media attachments (approx. 22 GB)
$ RAILS_ENV=production bin/tootctl media remove-orphans
46414/46414 |===========================================| Time: 00:01:57
Removed 53 orphans (approx. 7.49 MB)
$ RAILS_ENV=production bin/tootctl accounts cull
26989/26989 |===========================================| Time: 00:22:30
Visited 26989 accounts, removed 205
Proceed to cache warming for preferred domains:
$ RAILS_ENV=production bin/tootctl accounts refresh --domain bsd.network
$ RAILS_ENV=production bin/tootctl accounts refresh --domain piaille.fr
$ RAILS_ENV=production bin/tootctl accounts refresh --domain fosstodon.org
...
Restart and check Mastodon services status:
# systemctl start mastodon-web mastodon-sidekiq mastodon-streaming
# systemctl status mastodon-web mastodon-sidekiq mastodon-streaming
Maintenance operations
Since I configured proxies on my Mastodon instance, I get a lot more traffic and a lot more cache is used. I realised that way too much data is backed up in the daily process. I decided to tune my rsnapshot(1) configuration and to add a nightly maintenance script that will erase “old” cache information. What is old depends on you.
The rsnapshot configuration is tweaked to exclude the Mastodon cache directory:
# vi /etc/rsnapshot/mastodon.conf
(...)
backup backup@mastodon:/home/ home/ exclude=mastodon/live/public/system/cache
(...)
A maintenance script runs every night and clear old cached data:
# vi /home/scripts/mastodon_maintenance
#!/usr/bin/env bash
PATH="/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin"
PATH="${PATH}:$HOME/.rbenv/shims:$HOME/.rbenv/bin"
export PATH RAILS_ENV=production LANG=en_US.utf8
cd ~/live
sleep $((RANDOM%600))
echo "Starting Mastodon maintenance."
./bin/tootctl media remove --days 7
./bin/tootctl media remove --days 7 --prune-profiles
./bin/tootctl media remove --days 7 --remove-headers
./bin/tootctl media remove-orphans
./bin/tootctl preview_cards remove --days 30
./bin/tootctl statuses remove --days 7
echo "Mastodon maintenance done."
#EOF
# chmod 0755 /home/scripts/mastodon_maintenance
# cat > /etc/cron.d/mastodon_maintenance
@daily mastodon /home/scripts/mastodon_maintenance | mailx -s "Mastodon maintenance" root
The storage is now kept stable at about 30GB. No more Mastodon outage has been seen 😮💨