Incident Summary:
On both Monday, November 6th and Tuesday, November 7th, approximately between 0800 and 0915, we experienced partial outages affecting the EU platform, as detailed in the incident updates on our status page.
Root Cause:
The outages were traced back to an issue with a specific Azure Database Server, which had recently undergone an update to enhance resource scaling in our cloud infrastructure. A pattern emerged where, at peak activity times, the system memory reached its maximum capacity. This resulted in approximately an hour of adjustment time before normal performance resumed.
Resolution and Mitigation:
Upon identifying the root cause, our team swiftly implemented and deployed a fix to address these unforeseen side effects of the recent infrastructure maintenance. The specific remedial action involved reconfiguring the database server with adequate memory allocation to handle the increased load during peak activity periods. This adjustment has since stabilized the platform, preventing recurrence of the issue.
Lessons Learned and Future Steps:
This incident underscores the importance of proactive resource management and monitoring, especially following significant updates or maintenance activities. We have taken steps to enhance our monitoring systems to detect similar issues more promptly in the future, ensuring quicker response and mitigation. Additionally, we are reviewing our update and maintenance protocols to incorporate more rigorous testing and validation processes, particularly in scenarios involving significant changes to our infrastructure's resource management.
We apologize for any inconvenience caused and remain committed to continuously improving our systems for better reliability and user experience.