Partial Outage on EU Platform

Incident Report for The AskCody Platform

Postmortem

Incident Summary:

On both Monday, November 6th and Tuesday, November 7th, approximately between 0800 and 0915, we experienced partial outages affecting the EU platform, as detailed in the incident updates on our status page.

Root Cause:

The outages were traced back to an issue with a specific Azure Database Server, which had recently undergone an update to enhance resource scaling in our cloud infrastructure. A pattern emerged where, at peak activity times, the system memory reached its maximum capacity. This resulted in approximately an hour of adjustment time before normal performance resumed.

Resolution and Mitigation:

Upon identifying the root cause, our team swiftly implemented and deployed a fix to address these unforeseen side effects of the recent infrastructure maintenance. The specific remedial action involved reconfiguring the database server with adequate memory allocation to handle the increased load during peak activity periods. This adjustment has since stabilized the platform, preventing recurrence of the issue.

Lessons Learned and Future Steps:

This incident underscores the importance of proactive resource management and monitoring, especially following significant updates or maintenance activities. We have taken steps to enhance our monitoring systems to detect similar issues more promptly in the future, ensuring quicker response and mitigation. Additionally, we are reviewing our update and maintenance protocols to incorporate more rigorous testing and validation processes, particularly in scenarios involving significant changes to our infrastructure's resource management.

We apologize for any inconvenience caused and remain committed to continuously improving our systems for better reliability and user experience.

Posted Nov 13, 2023 - 08:51 CET

Resolved

After closely monitoring the changes we made to ensure continuous stability, we can see that it worked as intended, and that all services are fully operational.

A full postmortem will be posted within 7 days, to conclude the incident.

Posted Nov 08, 2023 - 09:00 CET

Update

The Platform is still fully operational.

Since we implemented the fix earlier today, we have not seen any indications of loading errors or similar, that would have a noticeable impact for the users.

We will continue to monitor, and if nothing changes, then the next update will be tomorrow morning, no later than 9:00 CEST 08/11/2023

Posted Nov 07, 2023 - 13:39 CET

Monitoring

The Platform is currently fully operational, and all services has been available since 09:30 CEST.

A recent Azure Database update was performed to improve the way in which resources are scaled in our cloud infrastructure.
However, we are seeing a pattern where at specific times in the day, where there is a higher activity in the platform, the system memory seems to reach its maximum capacity, taking it about half an hour to readjust, and perform as expected.

As mentioned in our previous update, we have taken measures to prevent this from happening again and are monitoring the implemented fix.

The next update will be provided at latest: 13:40 CEST, 07/11/2023

Posted Nov 07, 2023 - 11:34 CET

Identified

The Platform is currently fully operational, and all services has been available since 09:30 CEST.

The cause of the partial platform outages we have experienced today and yesterday morning has been identified, and we are currently implementing and deploying a fix that will address these unforeseen side effects of the recent infrastructure maintenance.

The next update will be provided at latest: 13:00 CEST, 07/11/2023

Posted Nov 07, 2023 - 11:03 CET

Investigating

Affected Users: All users
Region: Outside of North America

Since 8:15 CEST we have been experiencing an increase in response time, partial outage in the Management Portal and Outlook add-ins.
Users may see a "Service unavailable" message, when trying to access these services.

We are investigating the cause, and will continuously update this as we progress.

The next update will be provided at latest: 11:00 CEST, 07/11/2023

Posted Nov 07, 2023 - 08:36 CET

This incident affected: Visitor Management (Europe) (Outlook Add-in, Visitor Management Portal, Check-in kiosk), Meeting Services (Europe) (Outlook Add-in, Meeting Services Portal, Meeting Services Finance API), Room Booking (Europe) (Outlook Add-in, Room Management Portal, Meeting Dashboards, Room Displays, Workplace Central, Mobile App (iOS), Mobile App (Android)), Workplace Insights (Europe) (Power BI Dashboard, Data Collection), and Infrastructure (Europe) (Exchange Integration, Calendar Notification Service, AskCody Active Directory Forwarding Service, Azure AD Integration).