Service Outage Report - 20210717
Summary
DB failure occurred due to slow replication sync of Mudfish DB slave server. Due to this issue, the relay log of the DB slave was continuously accumulated, and the disk full symptom of the DB slave appeared. As a result of this, the bin log of the DB master server was also accumulated, which also caused a disk full phenomenon.
Because of this, most of login and UI related didn't work, and it caused a lot of inconvenience to the overall use of mudfish. We sincerely apologize for the inconvenience. ㅠ.ㅠ
Outage Time
- 2021-07-17 3:30 PM ~ 2021-07-17 6:30 PM (based on KST)
- For about 3 hours, most of the mudfish services such as the web page and login did not work properly.
Current Status
- The DB slave's replication sync was slow, so the machine was upgraded.
- If the DB slave's replication sync differs by more than 600 seconds, it is set to send an emergency alarm related to failure to the administrator.
- When the DB partition of DB master and DB slave occupies more than 80% of space, it is set to send an emergency alarm related to failure to the administrator.
- DB space usage has been reduced by reducing the table purge cycle for user RTT information, which occupies the most space on the DB, from 2 days to 6 hours.