Emergency Server Downtime Hangout Thread 12/12/2024

Danikat · December 12

I don't understand how one power cut can take out both the NA and EU servers. Isn't the point of having different regional servers that they're in different physical locations, to improve the connection for people in that continent?

Iriidius · December 12

ZoS should extend the time of endeavours and login reward by 24 hours and give tomorrows login reward and seals of endeavours extra for free.

It is unavoidable that players are unable to play when the server are down but not OK to still expect players to login to get their daily reward and endeavour seals when the servers are down.

In germany/frankfurt where EU server is located it was 6pm when the server went down and they will not be up until after reset at 4am. Most Europeans are not playing before that time.

JoeCapricorn · December 12

Had a dream the other night that ESO added wearable wings and I had moth wings and was typing LÄMP repeatedly in zone chat.

sarahthes · December 12

Danikat wrote: »

I don't understand how one power cut can take out both the NA and EU servers. Isn't the point of having different regional servers that they're in different physical locations, to improve the connection for people in that continent?

The login server is located in the same data center as the NA servers, and it serves both NA and EU.

arena25 · December 12

carthalis wrote: »

MasterSpatula wrote: »

So I hear there's this stuff called "grass." I'm headed out to investigate right now.

be careful I've heard it can be quite dangerous stuff

Good news is that I don't have to worry about grass itself - bad news is that's only because it's a wintry wonderland outside right now.

LalMirchi wrote: »

Red banners at night, sailors cower in fright.

Green banners at morning light, sailors delight.

Fixed that for you.

hamgatan wrote: »

LatentBuzzard wrote: »

sarahthes wrote: »

We now have much more robust systems, but until that catastrophic failure happened, we thought our previous configuration was fine.

That's why responsible companies run regular BCP tests, so that they don't have to wait for a catastrophic failure before they find out that they can't recover.

Exactly. Thats why I have two DC's on a 100gbps trunk with vSphere HA and SAN Live Volumes as another level of failover in case the power goes to hell.

Unfortunately can't always trust third parties when they come in the mix. I had to cold start an entire DC last year because facilities conveniently left checking that the diesel genset had fuel in it off their checklists.. Least the other DC was alive though.

Can't always trust third parties for anything in any field, not just in IT. I should know this, I've seen third parties let me down more than once in my 2 years (and counting, I hope) in my time as a geologist/environmental consultant.

averyfarmanb14_ESO · December 12

Huh. Gaming sites are indicating a datacenter power failure.

CrushDepth wrote: »

Well, this seems like to watch RED ONE on Prime.

Hoo boy, I'd rather sit in the dark at a dead datacenter than watch RED ONE anywhere... /shudder.

arena25 · December 12

Iriidius wrote: »

ZoS should extend the time of endeavours and login reward by 24 hours and give tomorrows login reward and seals of endeavours extra for free.

It is unavoidable that players are unable to play when the server are down but not OK to still expect players to login to get their daily reward and endeavour seals when the servers are down.

In germany/frankfurt where EU server is located it was 6pm when the server went down and they will not be up until after reset at 4am. Most Europeans are not playing before that time.

That would be nice and maybe something to be considered for NA as well since power got cut at or around 1130am Eastern U.S. time (when many of us are at work) and it could be 1130pm or later before they return to service (if downtime does in fact extend to 12 hours or more). Doubt those on the Atlantic coast of North America will want to wake up at 1 in the morning to try and grab log-in reward before reset at 5am (unless of course they have no responsibilities, in which case, I can't help you, and I may envy you). But at this point, ball is in Zeni's court.

Danikat wrote: »

I don't understand how one power cut can take out both the NA and EU servers. Isn't the point of having different regional servers that they're in different physical locations, to improve the connection for people in that continent?

If I told this uno times, I told this a thousand uno times - login server is in NA, and it serves both NA and EU - EU megaserver was entirely unaffected, and anyone in-game over there before power went POP could still enjoy being in-game, PROVIDED they don't log out, crash, or otherwise lose connection to the game. Curious, though - is anyone still logged in over there? Or have they finally given up the fight til Zeni fixes the power problems?

Calastir · December 12

Just watched Red One,

highly recommend.

Diundriel · December 12

https://youtube.com/watch?v=rV5s1_0OqW0

https://youtube.com/watch?v=i-O_geHXfoA&t=746s

first and only emergency pvp content uploaded, more guild and BG content coming soon;)

Desiato · December 12

Many years ago, I was tricked into performing in dance videos by a Seducer! Like many morals, I suffered for meddling in dark magic I did not understand and paid the price with blisters on my feet..

I don't really remember but I think the idea behind these videos was to highlight the beautiful artwork in eso. So I thought players missing eso might appreciate them today.

Upon entrance to the Orc homelands, Br'i meets with the Emperor of Orc Town. (is that even a real place?? Is that even a real title?? I think he may have buffed his resume!)

https://www.youtube.com/watch?v=F8ZK8333jrk

Things tend to get a little hot and wild when DKs are in the club! David and Davey, a couple of fiery Dragon Knight hawties, fly us on a bar hop of their favorite party places. The popular pair live loud and large and always draw a crowd!

It is all fun and games until somebody calls a guard.

https://www.youtube.com/watch?v=kWZT0KXDKSM

What girl doesn't love her B.O.B.?

While suffering from an unhealthy interest in giant robots, Br'i finds they make (shockingly) lousy boyfriends...so, using spare parts, she builds her own! Can robots be programmed to dance??

https://www.youtube.com/watch?v=2a6z0N4vpBw

I played the roles of the Emperor of Orc Town, the DK in red, and the dancing robot! (I didn't create these, I was just a performer)

DinoZavr · December 12

RMW · December 12

Some fun Elder Scrolls animation;

https://www.youtube.com/watch?v=ccta76QbT9w

Sakiri · December 12

I am now going to re-watch the dagoth ur teaches dunmer slang video.

LadyGP · December 12

Silo is an amazing show if someone is bored and needing something to watch.

merevie · December 12

Doctor_Demento · December 12

This is ZOS idea of "some people" who can't get online. \

This is reality of who can't get online on Planet Earth...

So truth is NO ONE can login. Zero...

FireSoul · December 12

Hi,

I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.

Let me tell you, it's not as simple as just turning stuff back on...

Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.

Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.

galbreath34b14_ESO · December 12

I'm gonna go out on a limb and say that over 6 hours into a complete shutdown that having the idiotic "All Systems Operational" message saying all servers up is going to have long term trust erosion with players.

arena25 · December 12

Nightmare scenario over at Zeni right now.

From ZoSKevin:

Hi all, just providing an update. We are still hard at work getting systems back online. Based on what we know right now, we believe the Megaservers will most likely be offline longer than the original 12 hour estimation. We hope to provide more clarity on timeframe once we have a little more time to complete more work.

Regarding the scope of work, this issue we ran into today was an edge-case emergency power outage at the data center that did not trigger standard backup failsafes for multiple tenants affected by the outage. (This type of outage is designed to cut ALL power in the event of a fire/flood scenario.) The outage now requires us to do a full reboot of our hardware while recovering from a full loss of power. Rebuilding piece by piece involves a methodical and lengthy process, including additional verification and testing as we bring the hardware online.

Hopefully this provides some clarity on the work happening right now. Thanks again for the continued patience.

Advice: Head for bed. Take your first shower in a month, get some actual rest, check back in tomorrow morning.

At least I can rest easy knowing Zeni won't be the only folks with some major explaining to do...

Sleepsin · December 12

galbreath34b14_ESO wrote: »

I'm gonna go out on a limb and say that over 6 hours into a complete shutdown that having the idiotic "All Systems Operational" message saying all servers up is going to have long term trust erosion with players.

I was just going to mention that. Seems odd.

LadyGP · December 12

FireSoul wrote: »

Hi,

I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.

Let me tell you, it's not as simple as just turning stuff back on...
Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.

when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.

when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.

our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.

many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*

Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.

Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.

This. Used to work in IT and had to assist some Sys Admins when things went down over the weekend. Yeah, this post is 100% what some poor souls are having to deal with right now at the data center. Truly feel for them... the stress they are under right now... isn't fun.

Tinyfangs · December 12

When you only just got power back today, after storm Darragh had the lights go out on Friday last week (almost 6 blooming days without power and water!) - only to find ESO is down with its own power failure.

At this point I am just laughing about it all

(spent all my tears already on the lost log in rewards and endeavours...)

Ah well, need research generators anyway................................

Gingaroth · December 12

ArchangelIsraphel wrote: »

It's also the point at which even more dishes seem to suddenly manifest in the sink out of no-where.

How do you know so accurately what happens in my home? That's almost scary!

Destai · December 12

Just saw the update. Good luck guys, I am sure it’s stressful but just know that you are appreciated for communicating and working hard getting it back up @ZOS_Kevin @ZOS_JessicaFolsom @ZOS_GinaBruno

kargen27 · December 12

Time for a warm soak in the tub.

Or spend the day painting that next masterpiece.

Do not though run off and do something rash.

dk_dunkirk · December 12

FireSoul wrote: »

Hi,

I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.

Let me tell you, it's not as simple as just turning stuff back on...
Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.

when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.

when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.

our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.

many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*

Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.

Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.

As someone who helped bring a data center online, I don't understand ANY of this. We had redundant EVERYTHING except main power feeds (because of local zoning). Even redundant ISP's and physical drops. We tested our generators and our UPS's and cooling towers monthly. I don't get it. A colo facility that even HAS an "edge case" scenario is not one I'd trust a million dollar a year business to.

Pendrillion · December 12

Wow... That sounds serious. The whole Infrastructure going down... Holy crap!

OutLaw_Nynx · December 12

This is bad

sarahthes · December 12

dk_dunkirk wrote: »

FireSoul wrote: »

Hi,

I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.

Let me tell you, it's not as simple as just turning stuff back on...
Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.

when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.

when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.

our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.

many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*

Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.

Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.

As someone who helped bring a data center online, I don't understand ANY of this. We had redundant EVERYTHING except main power feeds (because of local zoning). Even redundant ISP's and physical drops. We tested our generators and our UPS's and cooling towers monthly. I don't get it. A colo facility that even HAS an "edge case" scenario is not one I'd trust a million dollar a year business to.

It sounds to me like the system that cuts everything out to prevent loss due to water damage kicked in. The one where "welp it's better to shut down unexpectedly rather than short out" comes into play. Basically where you don't WANT backup power to kick in.

Gingaroth · December 12

Rkindaleft wrote: »

This is what I ate for dinner

That looks great! (Now I'm jealous)