Maintenance for the week of December 23:
· [COMPLETE] NA megaservers for maintenance – December 23, 4:00AM EST (9:00 UTC) - 9:00AM EST (14:00 UTC)
· [COMPLETE] EU megaservers for maintenance – December 23, 9:00 UTC (4:00AM EST) - 14:00 UTC (9:00AM EST)

Emergency Server Downtime Hangout Thread 12/12/2024

  • dk_dunkirk
    dk_dunkirk
    ✭✭✭✭✭
    sarahthes wrote: »
    dk_dunkirk wrote: »
    FireSoul wrote: »
    Hi,

    I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.

    Let me tell you, it's not as simple as just turning stuff back on...
    1. Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
    2. when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
    3. when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
    4. our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
    5. many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
    6. Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.

    Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
    When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.

    As someone who helped bring a data center online, I don't understand ANY of this. We had redundant EVERYTHING except main power feeds (because of local zoning). Even redundant ISP's and physical drops. We tested our generators and our UPS's and cooling towers monthly. I don't get it. A colo facility that even HAS an "edge case" scenario is not one I'd trust a million dollar a year business to.

    It sounds to me like the system that cuts everything out to prevent loss due to water damage kicked in. The one where "welp it's better to shut down unexpectedly rather than short out" comes into play. Basically where you don't WANT backup power to kick in.

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D
  • skellyink
    skellyink
    ✭✭✭
    FireSoul wrote: »
    Hi,

    I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.

    Let me tell you, it's not as simple as just turning stuff back on...
    1. Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
    2. when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
    3. when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
    4. our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
    5. many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
    6. Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.

    Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
    When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.

    A lot of this went *whew* right over my head, but thanks for this--really helpful for understanding why they can't just turn things on like a light switch. I'm thinking about all the poor souls trying to fix this right now. They should get a raise.
  • Danikat
    Danikat
    ✭✭✭✭✭
    ✭✭✭✭✭
    sarahthes wrote: »
    Danikat wrote: »
    I don't understand how one power cut can take out both the NA and EU servers. Isn't the point of having different regional servers that they're in different physical locations, to improve the connection for people in that continent?

    The login server is located in the same data center as the NA servers, and it serves both NA and EU.

    Well that's annoying, but it does explain why EU players so often have trouble logging in but then can play with no problems once we're in. If it's completely different data centers, and the one we just need to use for logging in is a few thousand miles further away, it's not surprising that's usually where problems occur.
    PC EU player | She/her/hers | PAWS (Positively Against Wrip-off Stuff) - Say No to Crown Crates!

    "Remember in this game we call life that no one said it's fair"
  • Personofsecrets
    Personofsecrets
    ✭✭✭✭✭
    ✭✭✭
    @ZOS_Kevin

    In the event that the data center has some kind of major issue, are there redundancies to ensure that player account data isn't lost?
    Edited by Personofsecrets on December 12, 2024 11:40PM
    My Holiday Wishlist Below - Message me with any questions and Happy Holidays.

    https://forums.elderscrollsonline.com/en/discussion/comment/8227786#Comment_8227786
  • Wolfkeks
    Wolfkeks
    ✭✭✭✭✭
    guar-tiently waiting - and showing off bantanm guars in the mean time
    i4xejk9d0ez8.png
    "Sheggorath, you are the Skooma Cat, for what is crazier than a cat on skooma?" - Fadomai
    EU PC 2000+ CP professional mudballer and pie thrower
    Former Emperor, Grand Overlord, vAA hm, vHelRa hm, vSO hm, vMoL hm, vHoF hm, vAS+2, vCR+3, vSS hm, vKA, vRG, Flawless Conquerer, Spirit Slayer
  • sarahthes
    sarahthes
    ✭✭✭✭✭
    ✭✭
    dk_dunkirk wrote: »
    sarahthes wrote: »
    dk_dunkirk wrote: »
    FireSoul wrote: »
    Hi,

    I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.

    Let me tell you, it's not as simple as just turning stuff back on...
    1. Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
    2. when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
    3. when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
    4. our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
    5. many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
    6. Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.

    Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
    When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.

    As someone who helped bring a data center online, I don't understand ANY of this. We had redundant EVERYTHING except main power feeds (because of local zoning). Even redundant ISP's and physical drops. We tested our generators and our UPS's and cooling towers monthly. I don't get it. A colo facility that even HAS an "edge case" scenario is not one I'd trust a million dollar a year business to.

    It sounds to me like the system that cuts everything out to prevent loss due to water damage kicked in. The one where "welp it's better to shut down unexpectedly rather than short out" comes into play. Basically where you don't WANT backup power to kick in.

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D

    They posted that the system that kicks in to totally knock out power in case of fire or flood is what kicked in. I don't think there was an actual flood. More a problem with the buildings emergency systems.
  • FireSoul
    FireSoul
    ✭✭
    dk_dunkirk wrote: »
    As someone who helped bring a data center online, I don't understand ANY of this. We had redundant EVERYTHING except main power feeds (because of local zoning). Even redundant ISP's and physical drops. We tested our generators and our UPS's and cooling towers monthly. I don't get it. A colo facility that even HAS an "edge case" scenario is not one I'd trust a million dollar a year business to.

    It was an EPO event. "Emergency Power Off". Someone pressed the "big red emergency" switch that is there by LAW.
    It was a security guard. He had knocked over the cover over the switch. An alarm went off. He tried to put the cover back on. The alarm did not stop. He panicked.

    Then he pressed the end-of-employment button.

    edit: this was from my EPO experience, not what's going on right now with ZOS's US-based (DFW?) colo.

    Edited by FireSoul on December 13, 2024 12:18AM
  • SeaGtGruff
    SeaGtGruff
    ✭✭✭✭✭
    ✭✭✭✭✭
    Iriidius wrote: »
    ZoS should extend the time of endeavours and login reward by 24 hours and give tomorrows login reward and seals of endeavours extra for free.

    It is unavoidable that players are unable to play when the server are down but not OK to still expect players to login to get their daily reward and endeavour seals when the servers are down.

    In germany/frankfurt where EU server is located it was 6pm when the server went down and they will not be up until after reset at 4am. Most Europeans are not playing before that time.

    Honestly, I don't think they should, as it would just set an expectation for something similar whenever there's an issue.

    Mind you, I didn't play on EU last night after reset, and didn't play on NA this morning after reset, so I'm probably going to lose at least a day of daily login rewards, seals from daily endeavors, rewards from Tales of Tribute, gold from crafting, and having fun as its own reward, on both servers.

    But I've been through local power outages, local internet outages, and game server downtimes before, and survived missing one or more days of game time.

    I just hope the people working to get things fixed ASAP know how much they're appreciated! :+1:
    I've fought mudcrabs more fearsome than me!
  • dk_dunkirk
    dk_dunkirk
    ✭✭✭✭✭
    FireSoul wrote: »
    dk_dunkirk wrote: »
    As someone who helped bring a data center online, I don't understand ANY of this. We had redundant EVERYTHING except main power feeds (because of local zoning). Even redundant ISP's and physical drops. We tested our generators and our UPS's and cooling towers monthly. I don't get it. A colo facility that even HAS an "edge case" scenario is not one I'd trust a million dollar a year business to.

    It was an EPO event. "Emergency Power Off". Someone pressed the "big red emergency" switch that is there by LAW.
    It was a security guard. He had knocked over the cover over the switch. An alarm went off. He tried to put the cover back on. The alarm did not stop. He panicked.

    Then he pressed the end-of-employment button.

    [CITATION NEEDED]
  • RedTalon
    RedTalon
    ✭✭✭✭✭
  • sarahthes
    sarahthes
    ✭✭✭✭✭
    ✭✭
    Kusandru wrote: »
    arena25 wrote: »
    Nightmare scenario over at Zeni right now.

    From ZoSKevin:
    Hi all, just providing an update. We are still hard at work getting systems back online. Based on what we know right now, we believe the Megaservers will most likely be offline longer than the original 12 hour estimation. We hope to provide more clarity on timeframe once we have a little more time to complete more work.

    Regarding the scope of work, this issue we ran into today was an edge-case emergency power outage at the data center that did not trigger standard backup failsafes for multiple tenants affected by the outage. (This type of outage is designed to cut ALL power in the event of a fire/flood scenario.) The outage now requires us to do a full reboot of our hardware while recovering from a full loss of power. Rebuilding piece by piece involves a methodical and lengthy process, including additional verification and testing as we bring the hardware online.

    Hopefully this provides some clarity on the work happening right now. Thanks again for the continued patience.

    The question for ZOS is going to be this... Was it worth cheaping out on having a backup data center incase of system failure for ANY reason and foregoing disaster recovery considering how beotchy gamers can be in terms of rage quitting or things even smaller than this and with an already shrinking player base? Will the cost of losing all that money outway the cost/chance you've taken to forego DR? I'll confess to telling my husband about this today and he can't wait to send a team over to offer services for DR.

    It seems likely that the NA data center backs up to the EU data center and vice versa multiple times a day (or to some third party site but that would be more expensive). But you can't just flip a switch and make the game run there now, it doesn't work that way.
  • dk_dunkirk
    dk_dunkirk
    ✭✭✭✭✭
    sarahthes wrote: »
    dk_dunkirk wrote: »
    sarahthes wrote: »
    dk_dunkirk wrote: »
    FireSoul wrote: »
    Hi,

    I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.

    Let me tell you, it's not as simple as just turning stuff back on...
    1. Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
    2. when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
    3. when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
    4. our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
    5. many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
    6. Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.

    Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
    When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.

    As someone who helped bring a data center online, I don't understand ANY of this. We had redundant EVERYTHING except main power feeds (because of local zoning). Even redundant ISP's and physical drops. We tested our generators and our UPS's and cooling towers monthly. I don't get it. A colo facility that even HAS an "edge case" scenario is not one I'd trust a million dollar a year business to.

    It sounds to me like the system that cuts everything out to prevent loss due to water damage kicked in. The one where "welp it's better to shut down unexpectedly rather than short out" comes into play. Basically where you don't WANT backup power to kick in.

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D

    They posted that the system that kicks in to totally knock out power in case of fire or flood is what kicked in. I don't think there was an actual flood. More a problem with the buildings emergency systems.

    Weird. My datacenter had no such system. I don't think we could have purposely cut all power and prevented the backup systems from taking over if we tried.

    Also, it still doesn't explain how ALL servers are impacted, in what we PRESUME are completely different locations and data centers.
  • barney2525
    barney2525
    ✭✭✭✭✭
    ✭✭✭
    Gingaroth wrote: »
    It's also the point at which even more dishes seem to suddenly manifest in the sink out of no-where.

    How do you know so accurately what happens in my home? That's almost scary!

    oh, btw, you need more toilet paper.


    :#
  • arena25
    arena25
    ✭✭✭✭✭
    dk_dunkirk wrote: »

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D

    There wasn't a flood, but sometimes there can be false positives that trigger something.

    At the high school I went to, the fire alarm system was designed so that as soon as the building's sprinkler system comes on, it triggers the fire alarms/evacuation system. Unfortunately, one particular day my freshman year, a water pressure error caused the fire evacuation/alarm system to sound, even though there was no fire.

    By the same token, on another lovely day in the same year, when a custodian accidentally dropped a bunch of insulation material in a storage closet, the dust from that material got into the smoke detector and reacted the same way that it would if actual smoke reached the detector - it sounded, and the fire alarm/evacuation system activates.

    I could provide other examples and go on for a while, but you get the gist - just because there is no fire/flood doesn't mean the failsafes that trigger when a fire/flood happens won't accidentally activate - no system is foolproof/perfect.

    And if I told you uno times, I told you a thousand uno times - the power outage took out NA megaserver and login server servicing both NA/EU megaservers. EU players could keep playing if they want to, as long as they were already in-game and don't log out (though not sure anyone is still online by now).
    Edited by arena25 on December 12, 2024 11:45PM
    If you can't handle the heat...stay out of the kitchen!
  • sarahthes
    sarahthes
    ✭✭✭✭✭
    ✭✭
    dk_dunkirk wrote: »
    sarahthes wrote: »
    dk_dunkirk wrote: »
    sarahthes wrote: »
    dk_dunkirk wrote: »
    FireSoul wrote: »
    Hi,

    I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.

    Let me tell you, it's not as simple as just turning stuff back on...
    1. Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
    2. when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
    3. when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
    4. our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
    5. many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
    6. Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.

    Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
    When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.

    As someone who helped bring a data center online, I don't understand ANY of this. We had redundant EVERYTHING except main power feeds (because of local zoning). Even redundant ISP's and physical drops. We tested our generators and our UPS's and cooling towers monthly. I don't get it. A colo facility that even HAS an "edge case" scenario is not one I'd trust a million dollar a year business to.

    It sounds to me like the system that cuts everything out to prevent loss due to water damage kicked in. The one where "welp it's better to shut down unexpectedly rather than short out" comes into play. Basically where you don't WANT backup power to kick in.

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D

    They posted that the system that kicks in to totally knock out power in case of fire or flood is what kicked in. I don't think there was an actual flood. More a problem with the buildings emergency systems.

    Weird. My datacenter had no such system. I don't think we could have purposely cut all power and prevented the backup systems from taking over if we tried.

    Also, it still doesn't explain how ALL servers are impacted, in what we PRESUME are completely different locations and data centers.

    The login server is impacted, and the one that handles account services and the website. So the EU server is working but there's no way to get to it.

    I believe all commercial data centers have what is called the "big red button" that can completely cut power for the entire site. There are situations where "all stop" causes less damage than a controlled shutdown, though those are rare.
  • Jimbru
    Jimbru
    ✭✭✭✭
    Feels right...

    it88a56aol8y.jpg
  • Elldarian
    Elldarian
    I'm gonna go out on a limb and say that over 6 hours into a complete shutdown that having the idiotic "All Systems Operational" message saying all servers up is going to have long term trust erosion with players.

    dgupmnu8pkii.png


    Baghdad Bob came out of retirement.

    Elldarian lil Duk-Tak CP 2175

    "Gentlemen, you can't fight in here, this is the War Room!"
  • dk_dunkirk
    dk_dunkirk
    ✭✭✭✭✭
    arena25 wrote: »
    dk_dunkirk wrote: »

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D

    There wasn't a flood, but sometimes there can be false positives that trigger something.

    At the high school I went to, the fire alarm system was designed so that as soon as the building's sprinkler system comes on, it triggers the fire alarms/evacuation system. Unfortunately, one particular day my freshman year, a water pressure error caused the fire evacuation/alarm system to sound, even though there was no fire.

    By the same token, on another lovely day in the same year, when a custodian accidentally dropped a bunch of insulation material in a storage closet, the dust from that material got into the smoke detector and reacted the same way that it would if actual smoke reached the detector - it sounded, and the fire alarm/evacuation system activates.

    I could provide other examples and go on for a while, but you get the gist - just because there is no fire/flood doesn't mean the failsafes that trigger when a fire/flood happens won't accidentally activate - no system is foolproof/perfect.

    And if I told you uno times, I told you a thousand uno times - the power outage took out NA megaserver and login server servicing both NA/EU megaservers. EU players could keep playing if they want to, as long as they were already in-game and don't log out (though not sure anyone is still online by now).

    Well that's a poorly designed data center. I guess our subscription money and Crown store purchases can only afford so much.

    And thank you for the explanation about the login server.
  • Reginald_leBlem
    Reginald_leBlem
    ✭✭✭✭✭
    I'm making chicken fried steak, rice, and roasted veggies with a honey mustard sauce to go with the rice.

    I'll make enough to share with whoever volunteers to do the dishes...
  • arena25
    arena25
    ✭✭✭✭✭
    Kusandru wrote: »

    <snip attempted troll misquote of OP's post>

    Kusandru, mind explaining why you are intentionally misquoting mine and other's post?
    Edited by arena25 on December 12, 2024 11:57PM
    If you can't handle the heat...stay out of the kitchen!
  • LadyGP
    LadyGP
    ✭✭✭✭✭
    dk_dunkirk wrote: »
    sarahthes wrote: »
    dk_dunkirk wrote: »
    sarahthes wrote: »
    dk_dunkirk wrote: »
    FireSoul wrote: »
    Hi,

    I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.

    Let me tell you, it's not as simple as just turning stuff back on...
    1. Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
    2. when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
    3. when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
    4. our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
    5. many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
    6. Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.

    Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
    When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.

    As someone who helped bring a data center online, I don't understand ANY of this. We had redundant EVERYTHING except main power feeds (because of local zoning). Even redundant ISP's and physical drops. We tested our generators and our UPS's and cooling towers monthly. I don't get it. A colo facility that even HAS an "edge case" scenario is not one I'd trust a million dollar a year business to.

    It sounds to me like the system that cuts everything out to prevent loss due to water damage kicked in. The one where "welp it's better to shut down unexpectedly rather than short out" comes into play. Basically where you don't WANT backup power to kick in.

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D

    They posted that the system that kicks in to totally knock out power in case of fire or flood is what kicked in. I don't think there was an actual flood. More a problem with the buildings emergency systems.

    Weird. My datacenter had no such system. I don't think we could have purposely cut all power and prevented the backup systems from taking over if we tried.

    Also, it still doesn't explain how ALL servers are impacted, in what we PRESUME are completely different locations and data centers.


    My thought is they have redundancies and what not but authentication for ALL servers (and account mgmt) ran through NA thus a single point of failure was, unintentionally, created.
    Will the real LadyGP please stand up.
  • dk_dunkirk
    dk_dunkirk
    ✭✭✭✭✭
    dk_dunkirk wrote: »
    arena25 wrote: »
    dk_dunkirk wrote: »

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D

    There wasn't a flood, but sometimes there can be false positives that trigger something.

    At the high school I went to, the fire alarm system was designed so that as soon as the building's sprinkler system comes on, it triggers the fire alarms/evacuation system. Unfortunately, one particular day my freshman year, a water pressure error caused the fire evacuation/alarm system to sound, even though there was no fire.

    By the same token, on another lovely day in the same year, when a custodian accidentally dropped a bunch of insulation material in a storage closet, the dust from that material got into the smoke detector and reacted the same way that it would if actual smoke reached the detector - it sounded, and the fire alarm/evacuation system activates.

    I could provide other examples and go on for a while, but you get the gist - just because there is no fire/flood doesn't mean the failsafes that trigger when a fire/flood happens won't accidentally activate - no system is foolproof/perfect.

    And if I told you uno times, I told you a thousand uno times - the power outage took out NA megaserver and login server servicing both NA/EU megaservers. EU players could keep playing if they want to, as long as they were already in-game and don't log out (though not sure anyone is still online by now).

    Well that's a poorly designed data center. I guess our subscription money and Crown store purchases can only afford so much.

    And thank you for the explanation about the login server.

    Scratch that, I guess ESO's much-vaunted TWO BILLION DOLLARS of revenue can only afford so much.
  • sarahthes
    sarahthes
    ✭✭✭✭✭
    ✭✭
    dk_dunkirk wrote: »
    arena25 wrote: »
    dk_dunkirk wrote: »

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D

    There wasn't a flood, but sometimes there can be false positives that trigger something.

    At the high school I went to, the fire alarm system was designed so that as soon as the building's sprinkler system comes on, it triggers the fire alarms/evacuation system. Unfortunately, one particular day my freshman year, a water pressure error caused the fire evacuation/alarm system to sound, even though there was no fire.

    By the same token, on another lovely day in the same year, when a custodian accidentally dropped a bunch of insulation material in a storage closet, the dust from that material got into the smoke detector and reacted the same way that it would if actual smoke reached the detector - it sounded, and the fire alarm/evacuation system activates.

    I could provide other examples and go on for a while, but you get the gist - just because there is no fire/flood doesn't mean the failsafes that trigger when a fire/flood happens won't accidentally activate - no system is foolproof/perfect.

    And if I told you uno times, I told you a thousand uno times - the power outage took out NA megaserver and login server servicing both NA/EU megaservers. EU players could keep playing if they want to, as long as they were already in-game and don't log out (though not sure anyone is still online by now).

    Well that's a poorly designed data center. I guess our subscription money and Crown store purchases can only afford so much.

    And thank you for the explanation about the login server.

    It's actually a pretty standard design, because when that system kicks in appropriately it can save the business a lot of time and money. It's just if it kicks in at the wrong time that it's a problem.

    I remember one time our fire system cut in when a contractor was heat sealing a linoleum repair or something along those lines in a particularly narrow hallway. Tripped the heat sensor.
  • arena25
    arena25
    ✭✭✭✭✭
    dk_dunkirk wrote: »
    dk_dunkirk wrote: »
    arena25 wrote: »
    dk_dunkirk wrote: »

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D

    There wasn't a flood, but sometimes there can be false positives that trigger something.

    At the high school I went to, the fire alarm system was designed so that as soon as the building's sprinkler system comes on, it triggers the fire alarms/evacuation system. Unfortunately, one particular day my freshman year, a water pressure error caused the fire evacuation/alarm system to sound, even though there was no fire.

    By the same token, on another lovely day in the same year, when a custodian accidentally dropped a bunch of insulation material in a storage closet, the dust from that material got into the smoke detector and reacted the same way that it would if actual smoke reached the detector - it sounded, and the fire alarm/evacuation system activates.

    I could provide other examples and go on for a while, but you get the gist - just because there is no fire/flood doesn't mean the failsafes that trigger when a fire/flood happens won't accidentally activate - no system is foolproof/perfect.

    And if I told you uno times, I told you a thousand uno times - the power outage took out NA megaserver and login server servicing both NA/EU megaservers. EU players could keep playing if they want to, as long as they were already in-game and don't log out (though not sure anyone is still online by now).

    Well that's a poorly designed data center. I guess our subscription money and Crown store purchases can only afford so much.

    And thank you for the explanation about the login server.

    Scratch that, I guess ESO's much-vaunted TWO BILLION DOLLARS of revenue can only afford so much.

    You're welcome for the explanation.

    And yeah, businesses try their best to cut costs/corners, and sometimes too much - classic example.
    If you can't handle the heat...stay out of the kitchen!
  • RMW
    RMW
    ✭✭✭✭
    ZOS: "So, who broke it?"

    how I imagine the conversation went

    https://www.youtube.com/watch?v=TUTAL9LDHRc
  • dk_dunkirk
    dk_dunkirk
    ✭✭✭✭✭
    sarahthes wrote: »
    dk_dunkirk wrote: »
    arena25 wrote: »
    dk_dunkirk wrote: »

    Wait. Where's the comment that there was a flood? I've been reading and hadn't seen that.

    And why would a flood at a single facility take out BOTH the NA and EU "megaservers?" That doesn't make sense.

    And "my" data center was purposely built higher than the 100-year flood plain. :-D

    There wasn't a flood, but sometimes there can be false positives that trigger something.

    At the high school I went to, the fire alarm system was designed so that as soon as the building's sprinkler system comes on, it triggers the fire alarms/evacuation system. Unfortunately, one particular day my freshman year, a water pressure error caused the fire evacuation/alarm system to sound, even though there was no fire.

    By the same token, on another lovely day in the same year, when a custodian accidentally dropped a bunch of insulation material in a storage closet, the dust from that material got into the smoke detector and reacted the same way that it would if actual smoke reached the detector - it sounded, and the fire alarm/evacuation system activates.

    I could provide other examples and go on for a while, but you get the gist - just because there is no fire/flood doesn't mean the failsafes that trigger when a fire/flood happens won't accidentally activate - no system is foolproof/perfect.

    And if I told you uno times, I told you a thousand uno times - the power outage took out NA megaserver and login server servicing both NA/EU megaservers. EU players could keep playing if they want to, as long as they were already in-game and don't log out (though not sure anyone is still online by now).

    Well that's a poorly designed data center. I guess our subscription money and Crown store purchases can only afford so much.

    And thank you for the explanation about the login server.

    It's actually a pretty standard design, because when that system kicks in appropriately it can save the business a lot of time and money. It's just if it kicks in at the wrong time that it's a problem.

    I remember one time our fire system cut in when a contractor was heat sealing a linoleum repair or something along those lines in a particularly narrow hallway. Tripped the heat sensor.

    And should that have tripped the heat sensor? Sure. Should that have triggered the fire suppression system? No. Someone should have been aware enough to disable that automatic reaction while that work was going on. Part of my responsibilities in bringing up a data center was setting up a system of THOUSANDS of physical sensors and SNMP traps, monitoring and dashboarding everything that was going on, and writing a whole app from scratch for routing all the alerts to the right people at the right time. There are ways to prevent false positives.
    Edited by dk_dunkirk on December 13, 2024 12:06AM
  • Sakiri
    Sakiri
    ✭✭✭✭✭
    ✭✭
    I still can't believe people are bent out of shape over this.

    Stuff happens. It'll be fine.
  • ArchangelIsraphel
    ArchangelIsraphel
    ✭✭✭✭✭
    ✭✭✭✭
    Gingaroth wrote: »
    It's also the point at which even more dishes seem to suddenly manifest in the sink out of no-where.

    How do you know so accurately what happens in my home? That's almost scary!

    Ah, it's the Universal Law of Dishwashing, I know it well! It was ordained in the distant past by some long forgotten, merciless Daedric Prince of Dishwashing that the one who does the dishes for the day enters into an endless loop of torment that will last until the end of eternity.

    Legend has it that the curse can only be broken by the legendary artifact: Paper Plates
    Legends never die
    They're written down in eternity
    But you'll never see the price it costs
    The scars collected all their lives
    When everything's lost, they pick up their hearts and avenge defeat
    Before it all starts, they suffer through harm just to touch a dream
    Oh, pick yourself up, 'cause
    Legends never die
  • OutLaw_Nynx
    OutLaw_Nynx
    ✭✭✭✭✭
    ✭✭
    Sakiri wrote: »
    I still can't believe people are bent out of shape over this.

    Stuff happens. It'll be fine.

    I’m honestly just kinda bored. I’m not stressing about the daily logins or endeavor points. Just wanted to do some dungeons tonight.
  • Carcamongus
    Carcamongus
    ✭✭✭✭✭
    ✭✭
    Well, I tried doing the reasonable thing and chose another form of entertainment. Watched Gladiator 2. It could have used more Pedro Pascal. Lots more. Naturally, I was hoping to come back to a miraculous recovery, but alas...

    Too bad about the game still being down. On the bright side, we're learning a bit about data centers!
    Imperial DK and Necro tank. PC/NA
    "Nothing is so bad that it can't get any worse." (Brazilian saying)
This discussion has been closed.