rip the server

Ardaghion · December 14

Morvan wrote: »

Yeah, seems to be the same thing that happened last week.

It works as normal for a few minutes, then gets completely unresponsive, back and forth.

I'm in Wyoming and connect through Mountain West. My pings aren't that bad, I tried a tracert and my path hits a bunch of networks owned by he.net, looks like some backbone networks. A bunch of those servers are dropping packets or not responding about half the time. That includes the 198.20.200.1 server.

I've tried ICMP packets which does get rejected by many servers but my networking tools include tracert with TCP or UDP packets.

Edit: I take that back about pings, I tried again and every ping failed. It then comes back and shows no drops.

DewiMorgan · December 14

Morvan wrote: »

Yeah, seems to be the same thing that happened last week.

Oof. I've worked for an MMO, and for an ISP, and either way, router problems were always the worst, because the ones with problems were rarely ours: Especially the Friday-night outages were always a hop or two from our data center, so we had to get in touch with their engineers, get through to someone who would escalate it, and then wait on them for a fix. And of course their IP contact records' contact info is someone who has no clue what the router even IS, and can't put us through to anyone who does.

But of course it would look like OUR problem as far as the users were concerned, rather than them blaming some misconfigured router owned by our upstream ISP.

At least when you're an ISP, and your network peers' routers mess up, you have some kinda business relationship and uptime agreements with those peers. A game doesn't get any of that kinda leverage.

Looking at tracert, it looks like a router 4.71.220.10, four hops before 198.20.200.1, mightbe flapping? But can't really tell, tracert is so flakey nowadays. I miss the days when routers didn't deliberately eat ICMP, it made debugging this stuff so much easier.

(hmm.. why's my ISP routing me all the way up to Dallas, if ESO is hosted here in Austin?)

[...]
  6    18 ms    13 ms    13 ms   ae51.edge1.Dallas1.Level3.net [67.72.0.33]
  7     *        *        *     Request timed out.  <- maybe fine, might just eat ICMP?
  8    27 ms    24 ms     *     4.71.220.10 <- maybe flapping?
  9    23 ms     *        *     198.20.200.1
 10     *        *        *     Request timed out.
 11     *        *        *     Request timed out.
 12    31 ms     *       14 ms  198.20.200.1

[...]
  6    28 ms    22 ms    18 ms  ae51.edge1.Dallas1.Level3.net [67.72.0.33]
  7     *        *        *     Request timed out.
  8     *        *       33 ms  4.71.220.10
  9     *        *        *     Request timed out.
 10     *        *        *     Request timed out.
 11     *        *        13 ms  ZENIMAX-MED.ear1.Dallas1.Level3.net [4.71.220.10]
 12    23 ms     *       16 ms  198.20.200.1

The second tracert I see 4.71.220.10 in line #8 and #11, which... yeah. But the second time it resolves to "ZENIMAX-MED.ear1.Dallas1.Level3.net" - which means it's one side or the other of a connection between Zenimax and Level3, and is either in Dallas, or is on a connection TO Dallas (unfortunately routers are often named after where they are, but also often after where they connect to).

So could be Zenimax router, or a Level3 one. If the latter, it'll be a non-fun Friday night for engineers in both companies.

But this is just speculation. I don't work there or know anything. Could just be someone turning the servers on and off like a scene from Airplane, for all I know.

barney2525 · December 14

TheDuke wrote: »

What is it THIS time

could be the server is a bit miffed about all the abuse that was said yesterday, and just wants to vent a bit.

SeaGtGruff · December 14

trittnerxx wrote: »

all that matters is they were able to focus on the cooking stream instead of the servers that have been on fire

I'm pretty sure Gina and Jessica don't work on the servers in Texas (or wherever the data center is that had the emergency power outage the other day), and that the folks who work on the servers in Texas don't have any connection to the holiday cooking stream.

Ella_Mental · December 14

https://forums.elderscrollsonline.com/en/discussion/comment/8237530/#Comment_8237530

"wanted to confirm the team is investigating these issues which appear to be affecting all NA servers"
"we have alerts that occur for issues like this along with on-call teams that handle the investigations/resolutions. These folks had already been investigating for a bit"

blktauna · December 14

DewiMorgan wrote: »

baconaura wrote: »

no oncalls. been going on for an hour, and still no acknowledgement there is an issue. everyone is just going to give up and call it a night because the game is unplayable.

Engineers do not respond to customer service calls, especially when things are on fire. They focus on diagnosing and resolving the problem.

For any large enterprise, there are essentially always engineers on call. I can essentially guarantee that there are some engineers right now with very unhappy faces, because they are not going to have a fun Friday night.

Tech support is generally a different dept, and in this case maybe even a different org (I suspect ZOS user support is done by MS nowadays), and does not typically have any on-call staff.

trittnerxx wrote: »

all that matters is they were able to focus on the cooking stream instead of the servers that have been on fire

Engineers also don't do cooking streams (but I'm guessing this was just a joke).

Been there and I feel for the Engineers. This is not how they want to be spending a Friday night. I wish them well.

ZOS_Kevin · December 14

Hi All, just wanted to note that we have a team investigating issues right now. But given the time right now, figuring things out will take some time. If we have an update, we will follow up.

TX12001rwb17_ESO · December 14

I do not envy the ZOS employees who have to deal with this one bit, sure I could see being an ESO developer as being fun but not when you would have to deal with things like this when you would normally be asleep, yesterday there is a power outage and now this.

There is a very strong chance it is a Hardware issue, something is broken and needs replacing, I hate to say this, but I think ZOS should keep the servers down for a few days and work on it, compensate everyone after that with a big Christmas present like 15 free crates or something.

hamgatan · December 14

TX12001rwb17_ESO wrote: »

There is a very strong chance it is a Hardware issue, something is broken and needs replacing, I hate to say this, but I think ZOS should keep the servers down for a few days and work on it, compensate everyone after that with a big Christmas present like 15 free crates or something.

even so, any properly set up environment has flags for that. id be surprised if there were not alarms in place.

i mean heck i have iDRAC reporting from dozens of Dell hosts/SANs etc reporting the second anything goes *** up along with SNMP trap capture and Nagios NRPE flagging.. why wouldnt there be similar at ZOS's end?

if something breaks.. move the service, throw the host in maintenance mode.

DewiMorgan · December 14

Possible suggestion of this being a DDoS, which would certainly explain the router flapping so badly.

Not sure how true this is, but hey, rumors are fun. I've verified what I can.

FFX!V is also down, and they're calling it a DDoS. They've been reporting DDoS issues for a few days now. So their outages kinda line up with ours.

Rumor has it they're hosted in the same data center as Zenimax' game servers - but I have not been able to verify this, and I suspect it is not true. What little evidence I can find online suggests the FFXIV NA datacenter is in Sacramento, CA, while Zenimax/ESO's is in Austin, TX (or maybe Dallas?)

Various other websites I know are also being DDoSed - 'tis the season, I guess?

This may all be a response to the recent multinational "Operation PowerOFF" that shut down 27 DDoS sites a few days ago, to try and avert the usual spate of Xmas DDoSes (FBI page; Europol page).

Maybe the DDoSers want to be like "haha, you can't stop us" or something. Or maybe that PowerOFF hurt them bad, so now they're scrambling hard to get blackmail moneys to recover all they lost?

Either way, grrr. Jerks.

Some internet weather sites are showing increased outages. Others showing none.
internet weather map - mostly OK.
Thousand Eyes - lots of outages
Internet Health Report - lots of alarms

Ella_Mental · December 14

Thanks, Dewi, for the links to those "Internet Weather" sites! I didn't even think to check to find pages like that.

baconaura · December 14

DewiMorgan wrote: »

baconaura wrote: »

no oncalls. been going on for an hour, and still no acknowledgement there is an issue. everyone is just going to give up and call it a night because the game is unplayable.

Engineers do not respond to customer service calls, especially when things are on fire. They focus on diagnosing and resolving the problem.

For any large enterprise, there are essentially always engineers on call. I can essentially guarantee that there are some engineers right now with very unhappy faces, because they are not going to have a fun Friday night.

Tech support is generally a different dept, and in this case maybe even a different org (I suspect ZOS user support is done by MS nowadays), and does not typically have any on-call staff.

trittnerxx wrote: »

all that matters is they were able to focus on the cooking stream instead of the servers that have been on fire

Engineers also don't do cooking streams (but I'm guessing this was just a joke).

Unfortunately this forum's announcement section at the top is the only way ZOS has communicated with us which is bottlenecked by requiring customer service/community managers/mods to update the status. If they had a status dashboard like so many companies, it would streamline the communication process, and keep everyone in the loop.

Not to beat a dead horse, but communications could be improved, and providing status dashboards like below or using a product like atlassian statuspage would be one way to streamline things and make things more transparent.

e.g.

ffxiv statuses: https://na.finalfantasyxiv.com/lodestone/news/
eve online status: https://status.eveonline.com/ (which i feel does a good job because it also shows status for cloud providers)
reddit status: https://www.redditstatus.com/
aws status: https://health.aws.amazon.com/health/status
azure status: https://azure.status.microsoft/en-us/status

LadyGP · December 14

DewiMorgan wrote: »
Morvan wrote: »

Yeah, seems to be the same thing that happened last week.

Oof. I've worked for an MMO, and for an ISP, and either way, router problems were always the worst, because the ones with problems were rarely ours: Especially the Friday-night outages were always a hop or two from our data center, so we had to get in touch with their engineers, get through to someone who would escalate it, and then wait on them for a fix. And of course their IP contact records' contact info is someone who has no clue what the router even IS, and can't put us through to anyone who does.

But of course it would look like OUR problem as far as the users were concerned, rather than them blaming some misconfigured router owned by our upstream ISP.

At least when you're an ISP, and your network peers' routers mess up, you have some kinda business relationship and uptime agreements with those peers. A game doesn't get any of that kinda leverage.

Looking at tracert, it looks like a router 4.71.220.10, four hops before 198.20.200.1, mightbe flapping? But can't really tell, tracert is so flakey nowadays. I miss the days when routers didn't deliberately eat ICMP, it made debugging this stuff so much easier.

(hmm.. why's my ISP routing me all the way up to Dallas, if ESO is hosted here in Austin?)
[...]
  6    18 ms    13 ms    13 ms   ae51.edge1.Dallas1.Level3.net [67.72.0.33]
  7     *        *        *     Request timed out.  <- maybe fine, might just eat ICMP?
  8    27 ms    24 ms     *     4.71.220.10 <- maybe flapping?
  9    23 ms     *        *     198.20.200.1
 10     *        *        *     Request timed out.
 11     *        *        *     Request timed out.
 12    31 ms     *       14 ms  198.20.200.1
[...]
  6    28 ms    22 ms    18 ms  ae51.edge1.Dallas1.Level3.net [67.72.0.33]
  7     *        *        *     Request timed out.
  8     *        *       33 ms  4.71.220.10
  9     *        *        *     Request timed out.
 10     *        *        *     Request timed out.
 11     *        *        13 ms  ZENIMAX-MED.ear1.Dallas1.Level3.net [4.71.220.10]
 12    23 ms     *       16 ms  198.20.200.1
The second tracert I see 4.71.220.10 in line #8 and #11, which... yeah. But the second time it resolves to "ZENIMAX-MED.ear1.Dallas1.Level3.net" - which means it's one side or the other of a connection between Zenimax and Level3, and is either in Dallas, or is on a connection TO Dallas (unfortunately routers are often named after where they are, but also often after where they connect to).

So could be Zenimax router, or a Level3 one. If the latter, it'll be a non-fun Friday night for engineers in both companies.

But this is just speculation. I don't work there or know anything. Could just be someone turning the servers on and off like a scene from Airplane, for all I know.

So uh, hi! Sorry to jump into this thread and yoink you away. Would be curious if you had any insight into the https://forums.elderscrollsonline.com/en/discussion/658253/zos-massive-spike-in-ping-lag-in-recent-days-what-gives#latest situation and if you had any suggestions on things we could run on our side of the house to see if it's an us/isp issue... or kind of verify it's a ZoS thing and maybe nail down what is happening. Rich posted a big QA in there yesterday with some info.