BelmontDrakul wrote: »
I do not discard their efforts for sure but; we deserve an update, don't you think?
A bit more of a substantial update this time! We are making solid progress, though there is still a great deal of work to be done. We are anticipating work continuing through the night to get all Live realms tested and ready to be brought back online.
With that said, we are currently targeting opening PC NA and EU servers around 10am EST/3pm GMT on Friday, December 13 and will update on the status of console realms shortly after. The PTS will remain offline through the weekend with plans to bring it back online early next week.
We'll provide the latest status of everything in the morning (EST). The ongoing patience from everyone has been greatly appreciated.
thinkaboutit wrote: »This should have been resolved already.
Hi,
I'm a Linux Systems Engineer, by profession, and I have been through a colo-wide Emergency Power Off event in my time.
Let me tell you, it's not as simple as just turning stuff back on...
- Our colocation center ITSELF was supposed to be our UPS. There's no UPS. If the colo goes out, that's it.
- when power was cutoff, it didn't take us long to figure out that the colo.. disappeared. We basically clown-car'd over to the datacenter and we were there for a long time. The power failure had occured in the early evening, on a Friday, and we spent all night there. We were 5 staff members that rushed over.
- when the power came back, ALL of the machines all tried to POST and boot at the same time. I don't know if you've ever heard servers, but their fans scream and everything goes full power for a sec. There was a brownout and 2/3 of the hosts were stuck in POST, frozen. Someone had to go around with a crashcart/KVM to check on its health and force a powercycle. 1 host at a time. There can be a LOT of hosts in a colo.
- our disaster recovery plan never had a 'cold start' plan prepared and we had to make one up on the fly. The switches will just power on and everything needs to be up. Storage, Database, and caching hosts first. Tools and things that talk to storage hosts next. (workhorse hosts, website). Once that's up and healthy, Proxies come up next, opening the floodgates to services.
- many the database hosts had corrupted tables that needed SQL table repair after boot. I saw in another thread that there are indeed MySQL hosts involved, so they have my sympathy there. *1000 yards stare*
- Some hosts were DOA and wouldnt even power on. Sometimes it was a standby of a given role, so we just let them stay dead till we had time for a replacement. Others were Primaries, and we had to force emergency failovers and make sure the old dead primaries stayed dead and don't just come back to life to mess things up. That led to some things being out a sync a bit after revival.
Anyway, we worked all weekend. We had standby hosts to revive or replace and a lot of cleanup to do to damaged databases that we had to prioritize.
When we walked in the office door on Monday, the office staff stood up and gave us a standing ovation.
BelmontDrakul wrote: »This one feels annoyed. Approximately 24 hours have passed since the incident started and there is no update, nothing.
[snip]
[snip]
SerafinaWaterstar wrote: »Jesus, 24 hours without gaming - real first world problems. Please take a breath & relax.
Look, I do not know whose fault it is but; it seems something is wrong and at least a person should be guilty. I can see ZOS crew has made special bond with gamers which is a good thing but; wrong is wrong, right is right. This is professional bussiness.TwiceBornStar wrote: »I don't think ZOS can be held responsible for any mishaps in any datacenter, and I'm sure everyone is working as fast as they can.
Even if it is not ZOS' fault, it should be the company's whose servers were rented by ZOS (if rental is the case). And, I have never ever claimed I was an expert.smallhammer wrote: »Wow, there are some people on here, who seem to be experts in everything that has to do with what has happened.
If only you guys were working for ZOS? Eh?
Never any downtime, and log-in servers would be up 10 hours ago? Right?
The uptime on ESO has been very good over the years. What has happened can be read about here: https://forums.elderscrollsonline.com/en/discussion/670288/eso-na-eu-megaservers-offline-dec-12
Yes, we can all be annoyd, but in the end, this is of course not ZOS' fault. Cool down. Have a beer or something. It's friday after all
BelmontDrakul wrote: »Look, I do not know whose fault it is but; it seems something is wrong and at least a person should be guilty.
[snip]
SerafinaWaterstar wrote: »
There have been updates.
And its more than just ‘the power shut off’.
On console, if you don’t turn it off properly such as in a power cut, next time you switch on, you have to let the console rebuild/reboot itself to prevent damage & loss of data. (My knowledge of pcs is limited but presume they don’t like being shut down improperly either.)
SerafinaWaterstar wrote: »Jesus, 24 hours without gaming - real first world problems. Please take a breath & relax.
BelmontDrakul wrote: »[The thing I cannot understand is if there is not a backup server besides the one in Texas? Trying to operate an MMORPG without backup server is huge risk. AFAIK even some private server operators have backup servers. I don't get it.
Having no backup server is the main fault, here.TwiceBornStar wrote: »Is it your fault if your processor decides to stop working tomorrow? Uh-uh. I don't think so!BelmontDrakul wrote: »Look, I do not know whose fault it is but; it seems something is wrong and at least a person should be guilty.
I was talking about backup server (which helds copies of our data) not backup power (which powers the system when there is shortage or malfunction).Pretty sure they did have backup power, problem is that the fire/flood alarm system was designed to cut that backup power too if said fire/flood alarms were activated - which sounds like it was (even though there was no fire/flood).BelmontDrakul wrote: »[The thing I cannot understand is if there is not a backup server besides the one in Texas? Trying to operate an MMORPG without backup server is huge risk. AFAIK even some private server operators have backup servers. I don't get it.
Jessica has already come out and said that the reason for the Crown Store being brought offline and the datacenter power failure are not related - humans do like to look for patterns, but the two have nothing to do with one another in this case - I'll see if I can find the post.
BelmontDrakul wrote: »Even if it is not ZOS' fault, it should be the company's whose servers were rented by ZOS (if rental is the case). And, I have never ever claimed I was an expert.
smallhammer wrote: »Yes, we can all be annoyed, but in the end, this is of course not ZOS' fault. Cool down. Have a beer or something. It's friday after all
I got my toes in the water, my rear in the sand, not a worry in the world, a cold beer in my hand - life is good today, life is good today...
BelmontDrakul wrote: »This one feels annoyed. Approximately 24 hours have passed since the incident started and there is no update, nothing.
[snip]
It is. If they do not want me make assumptions, they should give me more info, not doing so is also a fault to me. Never let your customers or shareholders stayed uninformed too long. It may backlash very hard.That's an assumption.
Providing an additional update here: https://forums.elderscrollsonline.com/en/discussion/comment/8234999/#Comment_8234999
Alinhbo_Tyaka wrote: »BelmontDrakul wrote: »This one feels annoyed. Approximately 24 hours have passed since the incident started and there is no update, nothing.
[snip]
Every machine room I've ever worked in, and there have been many, has an Emergency Power Off (EPO) switch located someplace where it can be activated in the event of an emergency. I've seen them get tripped by people when they should not have. Another example happened to me early in my career. Early one morning I was updating some diagnostic software on a customer's system. A systems Engineer from my office stopped by to see how things were going and to get a cup of coffee. Believing I had full control of the system he decided it would be funny to pull the CPU EPO switch when I wasn't looking and took down the live system. Needless to say he was told to never come back. I'm not saying this is what happened here but just want to address that it could be something as simple as someone "turning out the lights" so to speak.
I can only base this on my experience in the mainframe business system arena but an unplanned outage can take many hours to recover from as it involves more than turning the machines back on. With the state of the system being unknown databases need to be verified against transaction logs and any errors fixed before restarting. Transaction management system logs need to reviewed and partial or failed transaction removed before restarting. Jobs that rely upon checkpoints for restart need to be reviewed so they are restarted at the correct job step. All of these rely upon some type of automation but even then it takes time to go through logs or diagnose a database and all require IT personnel to make the decisions of what needs to be done.
Ok, explain it to me like I am ****You've never had fun until you've been in a Sev-0Alinhbo_Tyaka wrote: »BelmontDrakul wrote: »This one feels annoyed. Approximately 24 hours have passed since the incident started and there is no update, nothing.
[snip]
Every machine room I've ever worked in, and there have been many, has an Emergency Power Off (EPO) switch located someplace where it can be activated in the event of an emergency. I've seen them get tripped by people when they should not have. Another example happened to me early in my career. Early one morning I was updating some diagnostic software on a customer's system. A systems Engineer from my office stopped by to see how things were going and to get a cup of coffee. Believing I had full control of the system he decided it would be funny to pull the CPU EPO switch when I wasn't looking and took down the live system. Needless to say he was told to never come back. I'm not saying this is what happened here but just want to address that it could be something as simple as someone "turning out the lights" so to speak.
I can only base this on my experience in the mainframe business system arena but an unplanned outage can take many hours to recover from as it involves more than turning the machines back on. With the state of the system being unknown databases need to be verified against transaction logs and any errors fixed before restarting. Transaction management system logs need to reviewed and partial or failed transaction removed before restarting. Jobs that rely upon checkpoints for restart need to be reviewed so they are restarted at the correct job step. All of these rely upon some type of automation but even then it takes time to go through logs or diagnose a database and all require IT personnel to make the decisions of what needs to be done.
Sev-1's is the sky is falling --- Sev 0- Is when it actually fell.
redlink1979 wrote: »
Pretty sure they did have backup power, problem is that the fire/flood alarm system was designed to cut that backup power too if said fire/flood alarms were activated - which sounds like it was (even though there was no fire/flood).
Pretty sure they did have backup power, problem is that the fire/flood alarm system was designed to cut that backup power too if said fire/flood alarms were activated - which sounds like it was (even though there was no fire/flood).
So internet please forgive me in advance as this is not my field of expertise. But does it not seem like a multi billion dollar datacenter should have software at least as sophisticated as the stuff that comes with an $80 PC UPS, and in such a case instead of just instantly killing the backup power, simply tell the servers to do a "graceful shutdown?"
On my APC backup UPS, I have the option to install free monitoring software that, if the power is down and battery about to die, has the ability to tell Windows to shut down the PC, thus avoiding a dangerous hard shutdown.
I just have a hard time imagining that a major corporation would run a datacenter without such seemingly basic functionality. I mean, it kind of defeats the whole point of HAVING a backup power system if it isn't able to do a simple safe shutdown.
KyleTheYounger wrote: »Providing an additional update here: https://forums.elderscrollsonline.com/en/discussion/comment/8234999/#Comment_8234999
So would you also agree it's now time to retire this archaic 2014 hardware and revolutionize the MMORPG realm with ESO II?
Pretty sure they did have backup power, problem is that the fire/flood alarm system was designed to cut that backup power too if said fire/flood alarms were activated - which sounds like it was (even though there was no fire/flood).
So internet please forgive me in advance as this is not my field of expertise. But does it not seem like a multi billion dollar datacenter should have software at least as sophisticated as the stuff that comes with an $80 PC UPS, and in such a case instead of just instantly killing the backup power, simply tell the servers to do a "graceful shutdown?"
On my APC backup UPS, I have the option to install free monitoring software that, if the power is down and battery about to die, has the ability to tell Windows to shut down the PC, thus avoiding a dangerous hard shutdown.
I just have a hard time imagining that a major corporation would run a datacenter without such seemingly basic functionality. I mean, it kind of defeats the whole point of HAVING a backup power system if it isn't able to do a simple safe shutdown.
In the case of fire and/or flood there *isn't* time to do a graceful shutdown. It goes off NOW to prevent data loss among other things, even though it doesn't always work.
In the case of fire, spraying ANYTHING onto a burning server with the power on is risking electrocution, particularly water which they'd likely be using because that's what they use on building fires.
Eventhough, I know nothing, only logical solution seems cutting power and try to transfer air to the outside to create a vacuum environment to extinguish fire. Is this how it works in server rooms?I have NEVER heard of a server room using water as a fire suppression system. That just seems insane lol! xDIn the case of fire and/or flood there *isn't* time to do a graceful shutdown. It goes off NOW to prevent data loss among other things, even though it doesn't always work.
In the case of fire, spraying ANYTHING onto a burning server with the power on is risking electrocution, particularly water which they'd likely be using because that's what they use on building fires.
Like I said, not really my field. Although how long would it take from issuing a shutdown command to it being off? Also even at the relatively small corporations where I have managed servers, the standard go-to was always Halon. Halon suppression systems became widely properly because Halon is a low-toxicity, chemically stable compound that does not damage sensitive equipment, documents, and valuable assets.