Fire - Notice history

All systems operational

Bot - Operational

100% - uptime
Sep 2021 · 100.0%Oct · 100.0%Nov · 100.0%
Sep 2021
Oct 2021
Nov 2021

Website - Operational

100% - uptime
Sep 2021 · 100.0%Oct · 100.0%Nov · 100.0%
Sep 2021
Oct 2021
Nov 2021

Notice history

Nov 2021

Oct 2021

Fire is offline in some guilds, responding twice in others
  • Resolved
    Resolved

    This issue has been resolved. The cause of the issue was a queue that clusters get put into after they identify with Aether based on the max concurrency returned by Discord in /gateway/bot to prevent too many shards starting at once. This queue took too long for some clusters to be assigned an id and shards so they disconnected and entered a race condition that allowed two of them be assigned the same id and shards, causing the double responses and the shard it was supposed to be given was now stuck offline. I noticed this issue pretty quickly but I have added more robust monitoring for this exact scenario to alert me almost immediately.

  • Monitoring
    Monitoring

    A fix has been implemented and is currently being deployed

  • Identified
    Identified

    The issue has been identified and a fix is being made. Unfortunately I cannot bring cluster 3 online without this fix as the issue will just occur again.

  • Investigating
    Investigating

    It seems a recent deploy has caused some issues with assigning cluster & shard ids. Cluster 2 (Fire will say 3/4 if this is your cluster, id is zero indexed whereas the bot's status is not) is currently responding twice and shard 3 (which should be on cluster 3) is offline as cluster 3 is currently assigned the id 2. I will attempt to resolve the issue without rolling back changes but it may be necessary to do so and make a more permanent fix.

Most if not all services unavailable
  • Resolved
    Resolved

    This incident has been resolved.

  • Monitoring
    Monitoring

    Everything that was running is once again running (after pain with poetry), I am now going to check to make sure nothing has broken in the process

  • Update
    Update

    All services listed on the statuspage are up and running! Working on internal services now. These services being down may affect some features in external services so you may encounter some errors.

  • Update
    Update

    Some services are back online, continuing to work on the rest!

  • Update
    Update

    Restoring from the PM2 dump was not successful so I will be manually bringing all processes back. This may take a while but I will try and start important processes first. I apologise for the inconvenience

  • Update
    Update

    While working on restoring, I had noticed PM2 was still trying to use the old Node version. It's been a couple minutes and I have figured out why so restoring should hopefully not take too much longer!

  • Identified
    Identified

    While switching from nodesource to nvm for managing Node versions, PM2 was killed and did not restore the process list when restarting. I am working on restoring everything now

Sep 2021

Fire is offline in some guilds
  • Resolved
    Resolved

    Everything seems to be running smoothly now. It is still unknown when the issues started so Fire may have been offline for quite a while. I have systems in place to alert me if a cluster dies but due to a misconfiguration, it didn't see the one offline cluster as it was only set to recognize 2 of the 4 clusters meaning 1 out of 4 crashing didn't trigger the alert. This issue has since been resolved and any future issues like this should be resolved much sooner. I apologise for the inconvenience. I strive to have near perfect uptime for Fire as poor uptime is one of the issues I've had with other bots but while this issue was caused by something not 100% in my control I do consider it unacceptable that it wasn't able to automatically recover and will be working to improve detection of issues like this which alongside the fixed alerting should ensure this doesn't happen again. An interesting note on this incident: The ongoing rewrite of the Fire website was able to correctly identify and display the outage status for cluster 3 which means the issue was detected therefore it's possible it was actually an issue on Discord's end that lead it to not reconnect.

  • Monitoring
    Monitoring

    Fire should now be online in all servers. If you still see Fire as offline, first try restarting Discord and if that does not resolve the issue, let me know in the #fire-help channel in Fire's Discord server ( https://inv.wtf/fire )

  • Update
    Update

    While attempting to get cluster 3 back online, it got assigned a different cluster id and shard which indicates another cluster may be having issues. I will instead perform a full clean restart of Fire (taking all clusters offline and then bringing them up one by one to ensure each process gets the correct id) so it will go offline temporarily in all servers.

  • Identified
    Identified

    It seems that Fire's VPS had some intermittent network issues and that specific cluster did not recover (the process is still alive but not connected to Discord) It should come back online in ~30 seconds

  • Investigating
    Investigating

    It seems a cluster (specifically cluster 3) has crashed and didn't automatically recover. I am investigating the cause and will get it back online as soon as possible.

Sep 2021 to Nov 2021

Next