Status

Status

Status

AI Dungeon

Apps

AI Dungeon Web (Production)
Phoenix Android App (Production)
Phoenix iOS App (Production)

Servers

AI Dungeon Servers

Models

Mixtral
MythoMax
Tiefighter
ChatGPT
GPT-4 Turbo
Dragon
Griffin

Past Incidents

Partial Server Outage (@September 20, 2023 → September 21, 2023)

We had a 12-hour partial server outage that started at 7pm MT (UTC-6) September 20, 2023. The outage seemed to be caused by a redirect problem with the profile page. Our devs looked at things early morning September 21, 2023 and solved the issue with both a dyno increase and some bug fixes throughout the day.

Server + Database Instability (@August 21, 2023)

Given the turbulence of the weekend, we temporarily scaled up our server capacity to avoid further issues. This morning, we encountered problems when auto-scale was turned back on, but we worked with Azure to resolve the issue.

Additionally, a small number of users may be experiencing database connection issues. This is a separate problem stemming from our vendor, Timescale. We are still determining the root cause of this issue, but we have a mitigating solution in place for the interim.

Intermittent Server Outages (@August 19, 2023 → August 20, 2023)

AI Dungeon Legacy experienced intermittent server issues over the weekend. We reached out to Azure, our new server provider, and they helped us find the root cause of these problems: we were hitting the limit for SNAT ports available. This bug likely caused a number of outages in the past, and our devs have now determined a long-term fix.

Database Outage (@July 17, 2023)

3:54pm MT: AI Dungeon is current down due to a database outage. We have contacted the provider and are working with them to restore service.

4:14pm MT: Timescale was able to restore service. We are following up with them to figure out what caused this outage.

Timescale database failure (@July 8, 2023)

Our database provider failed during an auto-scale process. The incident resulted in a long outage because the fix required a full database migration to a different infrastructure.

Coreweave Partial Outage (@June 16, 2023)

One of our Coreweave pods that hosts our GPT-J model failed, causing a partial outage for players using our Griffin model. We were able to reset the server and restore service to normal.

Heroku Redis Upgrade (@June 16, 2023)

Starting at 4:08am MT A Heroku Redis auto upgrade lead to a conflict that prevented the AI Dungeon API from serving requests leading to an outage on production AI Dungeon. at 4:52am MT we discovered the fix for the issue and deployed a fix.

Timescale Maintenance Issue (@May 30, 2023)

Starting at 12:38am MT, our database provider restarted our database as a part of planned maintenance. While this should have only taken a few minutes, we had a full outage until 1:37am MT when the database came fully back online.

From Timescale:

We were doing maintenance in our clusters to make TSDB 2.11 available to all customers. While we were doing that, we were restarting all customer instances during their preconfigured maintenance window. In addition to that we were upgrading software on the nodes where customer databases are running. We usually keep some spare nodes to accommodate the customer instances after they are restarted. Spare nodes were not created in our clusters in a timely manner this time because AWS couldn’t provide us with enough nodes of the requested size, which caused your database instance to wait longer until it was created. We’ve taken measures to ensure, that spare nodes are available in the cluster and this shouldn’t happen further.

We have a larger upgrade in coordination with our database provider that should be completed this summer, which will ensure these types of issues don't happen in the future.

Heroku Connection Errors (@May 13, 2023)

Starting at 6:00am MDT we saw intermittent connection errors that were causing around 50% of connections to fail on the AI Dungeon API. After attempting our own restart and seeing no progress, we contacted Heroku, who we use to host the Latitude API. They responded and the issue was resolved by 6:57am MDT.

We’ve followed up with the Heroku team since this is tracking with 2 other issues Heroku had in the last 24 hours.

High Database CPU Causing Partial Outage (@April 6, 2023)

We experienced high CPU load today that caused a partial outage. The issue was quickly detected and resolved, and the outage only lasted a few minutes. We’ll continue to monitor for unexpected behavior.

5-minute Outage (@March 20, 2023)

We experienced a brief downtime this morning due to an error on our end. Our team identified and resolved the problem quickly, and AI Dungeon was down for approximately 5 minutes.

Timescale Database Outage (@March 14, 2023)

10:13am MDT—Service Restored

The intervention seems to have resolved the outage we experienced today. All systems appear to be online once again. We’ll continue to monitor for any unexpected behaviors. Please let us know if you experience any issues by emailing us at support@aidungeon.io or on our Discord server.

8:40am MDT—Outage Update

AI Dungeon is currently experiencing a full outage due to issues with our database hosting provider. Our team has identified an intervention that should resolve the issue, and is currently working the vendor support team to resolve. The current estimate for service restoration is approximately 3 hours.

8:13am MDT—Database outage

We’re currently experiencing a partial outage caused by a partial outage from our database hosting provider. The team is working to resolve. There may be periods of full outages as we work to reset our systems.

Heroku Network Errors Partial Outage (@February 15, 2023)

Our server provider, Heroku, had network issues related to our cluster. They began around 12:17 PM Mountain Time on @February 15, 2023. We were alerted about lower traffic on our models at 12:42 PM and restarted the cluster to clear up the issue by 12:57 PM.

We continue to work with Heroku to limit the impact of these outages for our player base. We have increased the sensitivity of our own alerting system so that we find the issue sooner if it does occur again.

Coreweave Partial Outage (Griffin) (@January 17, 2023)

Coreweave (the provider we host Griffin on) is having some performance issues they recently alerted us about. We are monitoring and will update when performance returns to full. Until then some players may have technical issues for some generations.

Heroku Slowdowns (@December 14, 2022)

We are currently looking into some model slow downs. We are reaching out to Heroku which seems to be the cause of the issues.

Heroku Outage (@December 8, 2022)

December 8, 2022 7:18pm MST

Heroku service has been restored and AI Dungeon should be back to online status.

December 8, 2022 5:51pm MST

Heroku has issued an update on the outage. They have identified the cause of the outage and are working on a fix. They said they will provide an update within 30min.

December 8, 2022 5:28pm MST

The outage seems to be caused by a Heroku outage. We use Heroku to host portions of AI Dungeon. We’ll update the community when service is restored.

December 8, 2022 4:46pm MST

We’re experiencing a 90% outage across the app. Our team is aware of the issue and is diagnosing the problem.

Heroku Outage (@November 30, 2022)

11/30/22 4:11 pm MT We were alerted that AI Dungeon was down for 80% of players for about 20 minutes due to an upstream issue with one of our tech providers. The issue is recovering on its own and should be resolved shortly. 11/30/22 4:16 pm MT It appears all systems are once again operating at full capacity.

Description of the incident.

Timescale Database Partial Outage (@November 13, 2022)

11/13/22 4:29 pm MT We are looking into a partial outage on AI Dungeon. Currently diagnosing increased 500 errors in the Latitude API Update 4:37 pm MT We got the server and database back to good health and are diagnosing the cause of the hiccups. We will continue to monitor.

Heroku Intermittent Outage with AI Dungeon API (@September 16, 2022 - @September 22, 2022 )

The Heroku instance running our AI Dungeon API has had intermittent issues since last Friday that keep recurring. The team is digging into why this is happening and working to mitigate. We have reached out to Heroku for more information about the behavior we’re seeing.

Healthy moving forward, but previously have had intermittent issues with the servers running our AI Dungeon API @September 20, 2022: The Heroku instance running our AI Dungeon API has had intermittent issues since last Friday that keep recurring. The team is digging into why this is happening and working to mitigate. We have reached out to Heroku for more information about the behavior we’re seeing. Update @September 21, 2022: We've identified the intermittent issues this past week as a Heroku open connections limit we have been hitting (even using their largest plan). AI Dungeon is now healthy given Heroku allowing us to bypass their normal limits. I will share more details tomorrow as we finalize the fix.

Action Counts Off + Griffin Outage (@August 25, 2022)

Description of the incident.

And we had another outage (a combination of database and then Coreweave instances going down). Action counts were off for a bit and ads were behaving oddly. We're awarding 200 actions to any impacted players.

Actions moving forward include a retro with the team and open conversation with Coreweave about how we build more resilience into the pod cluster, even with high traffic volumes.

Coreweave Outage (Griffin) (@August 25, 2022)

Our Griffin Pods had issues that didn’t recover even with a restart. Coreweave was able to help us successfully get the pods back online. We'll be working with them to figure out what caused this and how to avoid similar outages in the future.

Database Performance Issues (@August 25, 2022)

9:00 am: Some users are experiencing lag. We are currently digging in. 9:11 am: We've identified the issue related to database performance and are working on a fix. 9:58 am: A fix has been pushed by rolling back a change we made related to the upcoming gold system and scales changes. We will be adjusting given the performance issues before we push this again.

Heroku Outage (@August 23, 2022)

We had intermittent network issues due to an outage with one of our core providers, Heroku.

AI Art temporary outage (@August 16, 2022)

Pixray, Disco Diffusion, and VQgan were temporarily unavailable due to a service outage.

Heroku Outage (@August 15, 2022)

Our hosting provider had an outage that cause about 30 minutes of downtime and 20 minutes of degraded performance.

Coreweave Outage (@August 12, 2022)

One of our AI infrastructure partners, Coreweave, had an outage today that impacted us and other AI experiences. Griffin was unable to generate for 20 minutes because of this outage. We contacted the company and they quickly resolved the issue.

Partial Android App Outage (@August 7, 2022 )

Version 153 of the Android App is installing but the icon isn’t showing for some Android devices. If you haven’t upgraded we invite you not to. We have a new build already waiting on Google to review that a developer worked on late last night.

This was caused by a relatively small package upgrade that worked in local testing but caused an issue with certain devices in production deployment for Google. Frankly we were surprised by this one since Google review is supposed to catch stuff like this which we can’t test without pushing live. We’ve contacted them to expedite this review and looked into any options for rolling back.

Apologies. We will update here once the new version is live. Android players experiencing this issue can play using their mobile browser as an interim solution.

icon
image

© Latitude 2023