I don't understand how one power cut can take out both the NA and EU servers. Isn't the point of having different regional servers that they're in different physical locations, to improve the connection for people in that continent?
MasterSpatula wrote: »So I hear there's this stuff called "grass." I'm headed out to investigate right now.
be careful I've heard it can be quite dangerous stuff
Red banners at night, sailors cower in fright.
Green banners at morning light, sailors delight.
LatentBuzzard wrote: »We now have much more robust systems, but until that catastrophic failure happened, we thought our previous configuration was fine.
That's why responsible companies run regular BCP tests, so that they don't have to wait for a catastrophic failure before they find out that they can't recover.
Exactly. Thats why I have two DC's on a 100gbps trunk with vSphere HA and SAN Live Volumes as another level of failover in case the power goes to hell.
Unfortunately can't always trust third parties when they come in the mix. I had to cold start an entire DC last year because facilities conveniently left checking that the diesel genset had fuel in it off their checklists.. Least the other DC was alive though.
CrushDepth wrote: »Well, this seems like to watch RED ONE on Prime.
ZoS should extend the time of endeavours and login reward by 24 hours and give tomorrows login reward and seals of endeavours extra for free.
It is unavoidable that players are unable to play when the server are down but not OK to still expect players to login to get their daily reward and endeavour seals when the servers are down.
In germany/frankfurt where EU server is located it was 6pm when the server went down and they will not be up until after reset at 4am. Most Europeans are not playing before that time.
I don't understand how one power cut can take out both the NA and EU servers. Isn't the point of having different regional servers that they're in different physical locations, to improve the connection for people in that continent?
Hi all, just providing an update. We are still hard at work getting systems back online. Based on what we know right now, we believe the Megaservers will most likely be offline longer than the original 12 hour estimation. We hope to provide more clarity on timeframe once we have a little more time to complete more work.
Regarding the scope of work, this issue we ran into today was an edge-case emergency power outage at the data center that did not trigger standard backup failsafes for multiple tenants affected by the outage. (This type of outage is designed to cut ALL power in the event of a fire/flood scenario.) The outage now requires us to do a full reboot of our hardware while recovering from a full loss of power. Rebuilding piece by piece involves a methodical and lengthy process, including additional verification and testing as we bring the hardware online.
Hopefully this provides some clarity on the work happening right now. Thanks again for the continued patience.
galbreath34b14_ESO wrote: »I'm gonna go out on a limb and say that over 6 hours into a complete shutdown that having the idiotic "All Systems Operational" message saying all servers up is going to have long term trust erosion with players.
Hi,
I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.
Let me tell you, it's not as simple as just turning stuff back on...
- Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
- when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
- when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
- our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
- many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
- Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.
Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.
ArchangelIsraphel wrote: »It's also the point at which even more dishes seem to suddenly manifest in the sink out of no-where.
Hi,
I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.
Let me tell you, it's not as simple as just turning stuff back on...
- Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
- when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
- when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
- our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
- many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
- Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.
Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.
dk_dunkirk wrote: »Hi,
I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.
Let me tell you, it's not as simple as just turning stuff back on...
- Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
- when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
- when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
- our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
- many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
- Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.
Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.
As someone who helped bring a data center online, I don't understand ANY of this. We had redundant EVERYTHING except main power feeds (because of local zoning). Even redundant ISP's and physical drops. We tested our generators and our UPS's and cooling towers monthly. I don't get it. A colo facility that even HAS an "edge case" scenario is not one I'd trust a million dollar a year business to.
Rkindaleft wrote: »This is what I ate for dinner