Update 33 PC Launch Postmortem

ZOS_MattFiror
Since Update 33 launched, I think the PC North American megaserver performance problems deserve some explanation. This post outlines what has been going on the last week or so for our North American PC players.

First, last year (which seems like decades ago) we announced a plan to increase ESO’s stability and performance, and we have been diligently performing tasks behind the scenes with every update to implement them. One of the larger items on this list was "Database Sharding" which is a simple concept: take our giant player database (DB) and separate it into two sections for "current characters" and "older characters" so the entire DB doesn't have to be queried when a player logs in. Over time, our character DB (one per Megaserver) has been growing and about two years ago, its sheer size became a bottleneck. This is why the "requesting character load" part of the login process sometimes takes a lot longer than it should.

The DB Sharding process separates our character databases into a "live" DB and a "cold" DB; all accounts who have logged in over the past year are in the Live DB and older ones are in the Cold DB. The plan, once everything is complete, is that active accounts will pull their characters from the smaller Live DB on login, greatly decreasing login time. Older characters will pull from the Cold DB on login, which will take longer, but once an account logs in their characters are moved over to the Live DB for faster access after the initial login. This character record separation happens the first time an account logs in after sharding has been enabled for that megaserver. The first login may be longer than normal as the copying happens, but after that every login afterwards should be much faster.

The good news here is that we have already done this for most of the live megaservers over the last couple of months; all console megaservers have been upgraded already and login times have greatly decreased.

With that background information, you can now start to understand what happened since Update 33 launched last Monday. The PC character database (especially the North American megaserver) is far, far larger than console as ESO had a big launch year in 2014 (pre-console launch) and all those accounts are still there. In addition, all the Beta accounts (and characters) are still there as well.

So, Update 33 launched last Monday and the plan was to wait until the dust settled, then actually enable sharding on PC NA. On launch day, we tracked the usual in-game bugs and issues that tend to crop up and began work to address them. And there were indeed some problems. There were reports of in-game loading screen timeouts and that the Activity Finder was bogged down. Our first big failure was we chalked these reports up to normal server startup issues after a big update. We later increased our real-time monitoring which showed the Activity Finder and other processes were running a bit "hot" – they would spike a bit, then return to normal. We made adjustments both outside of and during primetime hours to try to alleviate queue issues, but this made it difficult to pinpoint if our adjustments were working or if primetime population on the server was easing. So we – and this was our second large error – decided to move ahead with enabling DB Sharding on the PC NA megaserver without addressing the Activity Finder issues.

And all of you who play on the PC NA megaserver know what happened once we flipped the DB Sharding switch: the entire server slowed down even more during primetime. The DB processes got backed up, which meant that all transfers between processes (i.e. zoning) were even slower, as well as logouts (where your character's DB record is updated) and the Activity Finder (which accesses your character records) became so bogged down it essentially ceased to function at all.

We had done the math and designed the DB Sharding system to work within normal server performance guidelines, so when we started addressing the slowdown issues, we naturally assumed that we had some bad calculations and started there. We made some changes (hence the downtime on Monday earlier this week) but they didn't help at all; performance was still terrible Monday night. Adding to the situation was that we could only troubleshoot on the live server, and only during primetime, because these problems cropped up mostly when the server was under moderate load. But the system ran slowly again Monday night so we knew it was something else.

On Tuesday, with the understanding that the problem was probably not connected to DB Sharding at all, we traced every log we could find to figure out where the bottleneck was and we finally found it – the issue was actually caused by a bad (as in failing) network port that was unable to process as much bandwidth as it was configured for. It wasn't a software problem at all; it was a hardware failure that, in essence, slowed down the entire megaserver. Tuesday’s maintenance was to take that device out of service and reconfigure a replacement, and once that was up, everything returned to normal and the DB Sharding process ran as intended: behind the scenes and with no player impact.

Obviously, there are no guarantees, but we do believe we have gotten to the root of this issue. The TL;DR is that it wasn't related to Update 33, Account Wide Achievements or DB Sharding at all, even though they all happened around the same time and we spent too much time investigating a red herring because of it.

I know this hasn't been an awesome time for any of you on PC. Many of you were unable to login to play and take advantage of the Explorer's Celebration as you otherwise might have. You may have lost time and progress, and to acknowledge that, we are going to be giving out five 150% Experience Scrolls on the first day of April through the Daily Login Rewards calendar and will be tripling the number of Weekly Endeavor Seals the week of 4/4 for players on all ESO platforms.

We have so much to look forward to in April with Jester's Festival, the Anniversary Jubilee, and even more we can't wait to share with you. We hope you'll use these Experience Scrolls during the upcoming 100% bonus XP events and catch up to where you might have been, had the game been running as intended.

Thanks so much for bearing with us and for reading this long explanation. Given the circumstances, I think full disclosure was warranted.
Edited by ZOS_Kevin on July 15, 2022 7:45PM
Matt Firor
Studio Director, ZeniMax Online Studios
Staff Post
  • Sylvermynx
    Sylvermynx
    ✭✭✭✭✭
    ✭✭✭✭✭
    Well, no one ever said troubleshooting a huge online process (in this case, a live game) would be easy and fast. Great info Matt, and thanks for explaining.

    And the scrolls plus tripled seals really makes me happy! So double thanks for that!
  • FeedbackOnly
    FeedbackOnly
    ✭✭✭✭✭
    ✭✭
    This actually makes sense. The problems for me started Sunday night before update 33.

    ♥️

    I just hope you checked all the network ports because it's still not like it was. There's a 30 increase numbers on latency
    Edited by FeedbackOnly on March 25, 2022 2:09PM
  • thejadefalcon
    thejadefalcon
    ✭✭✭✭✭
    "You may have lost time and progress, and to acknowledge that, we are going to be giving out five 150% Experience Scrolls on the first day of April through the Daily Login Rewards calendar and will be tripling the number of Weekly Endeavor Seals the week of 4/4 for players on all ESO platforms."

    In the interests of communication, is this going to be a standard compensation in the future? Because PC EU has problems during events and we've been lucky to maybe get a day's extension before. Xbox NA just had a bunch of character rollbacks which I'm not sure much, if any, compensation was given (I could be wrong, maybe I missed an update). This is a great "sorry for the problems", don't get me wrong, but if it only happens because of problems on a single megaserver, it does not paint a good look.
  • VaxtinTheWolf
    VaxtinTheWolf
    ✭✭✭✭✭
    The PC character database (especially the North American megaserver) is far, far larger than console as ESO had a big launch year in 2014 (pre-console launch) and all those accounts are still there. In addition, all the Beta accounts (and characters) are still there as well.

    I appreciate not nuking old accounts. I know I've abandoned some games because I get an E-mail saying that "YOU MUST LOG IN NOW OR ELSE" when I have no intension of rushing back to this or that game when I'm not in a mood for it, and thus my account and all data on it is purged. At least I know I would be able to return here, assuming the game is still alive far into the future, knowing my memories and progress has been retained if I happened to take a break from the game.
    Edited by VaxtinTheWolf on March 25, 2022 1:55PM
    || AD - Rah'Jiin Lv50 Khajiit Nightblade (Damage) || EP - Generic Argonian Lv50 Argonian Nightblade (Tank) || DC - Zinkotsu Lv50 Breton Nightblade (Healer) ||
    || DC - Ja'Kiro Feral-Heart Lv50 Khajiit Dragonknight (Damage) || EP - VaxtinTheWolf Lv50 Redguard Templar (Tank) || AD - Velik Iranis Lv50 Dark Elf Sorcerer (Tank ) ||
    || EP - Einvarg The Frozen Lv50 Nord Warden (Tank/Healer) || EP - Keem-Ja Lv4 Argonian Necromancer (Healer/Tank) ||
    PC - North American Server (Champion 1300+)
  • skinnycheeks
    skinnycheeks
    ✭✭✭✭✭
    Thanks for the response. It has definitely seemed better since Tuesday night, so glad to hear that the issue seems to be resolved. Appreciate the communication about it and the extra goodies.
  • TheGent
    TheGent
    ✭✭✭
    Thanks for the update. Thats good that you guys at least said something. I respect that. B)
    ESO: @The.Gent
    I really need a questing friend. Playing solo is lonely and boring (i am in multiple guilds too)

  • krachall
    krachall
    ✭✭✭✭✭
    I've been playing MMOs since Ultima Online in the late 90s. This may be the best developer post I've ever read.
  • iris56
    iris56
    ✭✭✭
    This was one of the most transparent updates I've ever seen from ZOS. Thank you!
  • MaraxusTheOrc
    MaraxusTheOrc
    ✭✭✭✭✭
    Compensation seems commensurate to the problem from my perspective. Good communication. These are the kind of posts to keep sharing with the player base.
  • ApoAlaia
    ApoAlaia
    ✭✭✭✭✭
    ✭✭✭
    You had me all the way to:

    I know this hasn't been an awesome time for any of you on PC. Many of you were unable to login to play and take advantage of the Explorer's Celebration as you otherwise might have. You may have lost time and progress, and to acknowledge that, we are going to be giving out five 150% Experience Scrolls on the first day of April through the Daily Login Rewards calendar and will be tripling the number of Weekly Endeavor Seals the week of 4/4 for players on all ESO platforms.

    Well played @ZOS_MattFiror, well played.
  • FeedbackOnly
    FeedbackOnly
    ✭✭✭✭✭
    ✭✭
    🙏♥️Please make sure to test group finder next update. 🙏♥️This used to happen every patch before dark brotherhood. As in the patch would break it for awhile each patch.


    Group finder did not work. It would only allow premade groups or groups more then 1 in queue properly. Solo queue getting it was extremely rare and only because the other half was group queued.

    Example I would instant queue with myself as a healer and with a DPS. While tanks would still be queue for litterally hours alone.
    Edited by FeedbackOnly on March 25, 2022 2:24PM
  • Destai
    Destai
    ✭✭✭✭✭
    ✭✭✭
    Since Update 33 launched, I think the PC North American megaserver performance problems deserve some explanation. This post outlines what has been going on the last week or so for our North American PC players.

    First, last year (which seems like decades ago) we announced a plan to increase ESO’s stability and performance, and we have been diligently performing tasks behind the scenes with every update to implement them. One of the larger items on this list was "Database Sharding" which is a simple concept: take our giant player database (DB) and separate it into two sections for "current characters" and "older characters" so the entire DB doesn't have to be queried when a player logs in. Over time, our character DB (one per Megaserver) has been growing and about two years ago, its sheer size became a bottleneck. This is why the "requesting character load" part of the login process sometimes takes a lot longer than it should.

    The DB Sharding process separates our character databases into a "live" DB and a "cold" DB; all accounts who have logged in over the past year are in the Live DB and older ones are in the Cold DB. The plan, once everything is complete, is that active accounts will pull their characters from the smaller Live DB on login, greatly decreasing login time. Older characters will pull from the Cold DB on login, which will take longer, but once an account logs in their characters are moved over to the Live DB for faster access after the initial login. This character record separation happens the first time an account logs in after sharding has been enabled for that megaserver. The first login may be longer than normal as the copying happens, but after that every login afterwards should be much faster.

    The good news here is that we have already done this for most of the live megaservers over the last couple of months; all console megaservers have been upgraded already and login times have greatly decreased.

    With that background information, you can now start to understand what happened since Update 33 launched last Monday. The PC character database (especially the North American megaserver) is far, far larger than console as ESO had a big launch year in 2014 (pre-console launch) and all those accounts are still there. In addition, all the Beta accounts (and characters) are still there as well.

    So, Update 33 launched last Monday and the plan was to wait until the dust settled, then actually enable sharding on PC NA. On launch day, we tracked the usual in-game bugs and issues that tend to crop up and began work to address them. And there were indeed some problems. There were reports of in-game loading screen timeouts and that the Activity Finder was bogged down. Our first big failure was we chalked these reports up to normal server startup issues after a big update. We later increased our real-time monitoring which showed the Activity Finder and other processes were running a bit "hot" – they would spike a bit, then return to normal. We made adjustments both outside of and during primetime hours to try to alleviate queue issues, but this made it difficult to pinpoint if our adjustments were working or if primetime population on the server was easing. So we – and this was our second large error – decided to move ahead with enabling DB Sharding on the PC NA megaserver without addressing the Activity Finder issues.

    And all of you who play on the PC NA megaserver know what happened once we flipped the DB Sharding switch: the entire server slowed down even more during primetime. The DB processes got backed up, which meant that all transfers between processes (i.e. zoning) were even slower, as well as logouts (where your character's DB record is updated) and the Activity Finder (which accesses your character records) became so bogged down it essentially ceased to function at all.

    We had done the math and designed the DB Sharding system to work within normal server performance guidelines, so when we started addressing the slowdown issues, we naturally assumed that we had some bad calculations and started there. We made some changes (hence the downtime on Monday earlier this week) but they didn't help at all; performance was still terrible Monday night. Adding to the situation was that we could only troubleshoot on the live server, and only during primetime, because these problems cropped up mostly when the server was under moderate load. But the system ran slowly again Monday night so we knew it was something else.

    On Tuesday, with the understanding that the problem was probably not connected to DB Sharding at all, we traced every log we could find to figure out where the bottleneck was and we finally found it – the issue was actually caused by a bad (as in failing) network port that was unable to process as much bandwidth as it was configured for. It wasn't a software problem at all; it was a hardware failure that, in essence, slowed down the entire megaserver. Tuesday’s maintenance was to take that device out of service and reconfigure a replacement, and once that was up, everything returned to normal and the DB Sharding process ran as intended: behind the scenes and with no player impact.

    Obviously, there are no guarantees, but we do believe we have gotten to the root of this issue. The TL;DR is that it wasn't related to Update 33, Account Wide Achievements or DB Sharding at all, even though they all happened around the same time and we spent too much time investigating a red herring because of it.

    I know this hasn't been an awesome time for any of you on PC. Many of you were unable to login to play and take advantage of the Explorer's Celebration as you otherwise might have. You may have lost time and progress, and to acknowledge that, we are going to be giving out five 150% Experience Scrolls on the first day of April through the Daily Login Rewards calendar and will be tripling the number of Weekly Endeavor Seals the week of 4/4 for players on all ESO platforms.

    We have so much to look forward to in April with Jester's Festival, the Anniversary Jubilee, and even more we can't wait to share with you. We hope you'll use these Experience Scrolls during the upcoming 100% bonus XP events and catch up to where you might have been, had the game been running as intended.

    Thanks so much for bearing with us and for reading this long explanation. Given the circumstances, I think full disclosure was warranted.

    @zos_mattfiror Thanks Matt. Full disclosure is always warranted. Always.

    I think it's only fair we get these reflections following each patch. It feels like each patch breaks things, and given how many efforts you guys have in flight, we're all wanting a little more detail. I get it, it's expected with live software, but getting these postmortems is the best damage control you can do.

    What can we expect with the console release and will these bugs be experienced to the same or lesser degree? I had cancelled my High Isle preorder after seeing this first unfold. Hopefully High Isle is a more workable release so we can all enjoy it!
    Edited by Destai on March 25, 2022 2:28PM
  • LalMirchi
    LalMirchi
    ✭✭✭✭✭
    A postmortem is always useful and making the postmortem a discussion is excellent.

    I wonder if there would be any improvement (performance || cost) by moving the servers to Azure or AWS?
  • skayl
    skayl
    ✭✭
    Wow, thank you so much for providing this level of context and explanation to the community! It had been frustrating seeing all of the performance issues and downtime that we've had on PC/NA after this launch, but this post goes into great detail about all of the work that the team did and makes me feel a lot better about the whole experience. And the bonus rewards and endeavors will be appreciated for sure :)
    PC/NA - cp2000+
  • McTaterskins
    McTaterskins
    ✭✭✭✭
    This.

    These kinds of posts.

    These are what matter.


    Matt - Any updates to follow up on your old re-architecture post?
  • Scaletho
    Scaletho
    ✭✭✭✭✭
    It's very rare a big gaming company to give its player community such straightforward and honest explanations. So thank you very much for your efforts.
  • Onomog
    Onomog
    ✭✭✭✭
    Thank you for your openness on this. The transparency is refreshing.

    I would love to see the same effort given to explaining why the AwA process was pushed through with out seeming to take into consideration everything that it broke.
  • Ashryn
    Ashryn
    ✭✭✭✭✭
    Thank you :)
  • I_CraftwithPntButter
    @ZOS_MattFiror

    Ty for taking the time for explaining the issues surrounding the pc na server :)
  • Oakenaxe
    Oakenaxe
    ✭✭✭✭
    The communication is appreciated, thank you 🙏
    a.k.a. Leo
    non-native English speaker
    200-300 ping and low fps player
  • MasterWarrior
    MasterWarrior
    ✭✭✭
    this is a good response to the issues we had. You laid out in detail why it happened without revealing too much. And I assume you will be making changes to try to avoid these problems in the future.
  • Lostar
    Lostar
    ✭✭✭✭
    "It's very rare a big gaming company to give its player community such straightforward and honest explanations. So thank you very much for your efforts." --It's becoming less and less rare as companies take note that their playerbase has come to expect such from companies that had the foresight to already do this.. a good thing and though one would hope such measures are unnecessary in the future; that should it come to pass, that they continue to offer postmortems but also certainly unless it's an 'all hands on deck scenario' that they don't wait until postmortem. It's been a pretty stressful time for them I'm sure and I'm happy for the form of compensation.
    I paint stuff sometimes...
    https://www.instagram.com/artoflostar/
  • redlink1979
    redlink1979
    ✭✭✭✭✭
    ✭✭✭
    Thanks for the insightful explanation @ZOS_MattFiror
    This is the kind of communication we all need and appreciate.
    "Sweet Mother, sweet Mother, send your child unto me, for the sins of the unworthy must be baptized in blood and fear"
    • Sons of the Night Mother [PS5][EU] 2165 CP
    • Daggerfall's Mightiest [PS5][NA] 1910 CP
    • SweetTrolls [PC][EU] 1950 CP
    • Bacon Rats [PC][NA] 1850 CP
  • Anifaas
    Anifaas
    ✭✭✭✭✭
    Thank you for your thoughtful and detailed post. Much respect! ❤️
  • sonwon.1_ESO
    sonwon.1_ESO
    ✭✭✭
    @ZOS_MattFiror

    And when will the bugs from the last 5 years be fixed?
  • Serafen
    Serafen
    Thank you so much!! This shows you care and listen to those of us who love and support this beautiful game. 💗💗
  • mbaranski15
    mbaranski15
    ✭✭
    Thank you! Love the detailed explanation!
  • FeedbackOnly
    FeedbackOnly
    ✭✭✭✭✭
    ✭✭
    @ZOS_MattFiror

    And when will the bugs from the last 5 years be fixed?

    Sugar skulls the food item has litterally been bugged for years.

    A simple UI error but by now it deserves fix.
    Edited by FeedbackOnly on March 25, 2022 2:43PM
  • Marcusorion1
    Marcusorion1
    ✭✭✭✭
    Thanks for the detailed and open explanation, it is something the players as supporters of your game will greatly appreciate.

    It would be even more awesome to make this a regular occurrence, not just when things have gone south !
  • Arunei
    Arunei
    ✭✭✭✭✭
    ✭✭✭
    I appreciate the honesty here, but I do have a few questions.

    First, why are beta accounts and characters still being held? We don't have access to those accounts or characters, there should be no reason to keep them, at least none I can think of. Unless those are kept on a completely separate, inactive server, why waste database space on them, especially when your answer to database issues was to implement AwA in the way you did?

    As for the 150% Exp Scrolls, that's 10 hours of increased experience. Which is a fraction of time compared to the five days a lot of people lost. You really should be giving people 20 or so of these things, it's not like it would cost you anything, and that would still only be 40 hours of increased experience, not even a full two days. Daily Seals should also be increased for five days, not just the weekly ones.

    @ZOS_MattFiror

    Edit because I was wrong, a friend informed me that the 150% scrolls are only one hour long, not two as I had thought. That's even less time people are being compensated for. We really should be getting a lot more of these if this is how you want to handle the compensation.
    Edited by Arunei on March 25, 2022 3:07PM
    Character List [RP and PvE]:
    Stands-Against-Death: Argonian Magplar Healer - Crafter
    Krisiel: Redguard Stamsorc DPS - Literally crazy Werewolf, no like legit insane. She nuts
    Kiju Veran: Khajiit Stamblade DPS - Ex-Fighters Guild Suthay who likes to punch things, nicknamed Tinykat
    Niralae Elsinal: Altmer Stamsorc DPS - Young Altmer with way too much Magicka
    Sarah Lacroix: Breton Magsorc DPS - Fledgling Vampire who drinks too much water
    Slondor: Nord Tankblade - TESified verson of Slenderman
    Marius Vastino: Imperial <insert role here> - Sarah's apathetic sire who likes to monologue
    Delthor Rellenar: Dunmer Magknight DPS - Sarah's ex who's a certified psychopath
    Lirawyn Calatare: Altmer Magplar Healer - Traveling performer and bard who's 101% vanilla bean
    Gondryn Beldeau: Breton Tankplar - Sarah's Mages Guild mentor and certified badass old person
    Gwendolyn Jenelle: Breton Magplar Healer - Friendly healer with a coffee addiction
    Soliril Larethian- Altmer Magblade DPS - Blind alchemist who uses animals to see and brews plagues in his spare time
    Tevril Rallenar: Dunmer Stamcro DPS - Delthor's "special" younger brother who raises small animals as friends
    Celeroth Calatare: Bosmer <insert role here> - Shapeshifting Bosmer with enough sass to fill Valenwood

    PC - NA - EP - CP1000+
    Avid RPer. Hit me up in-game @Ras_Lei if you're interested in getting together for some arr-pee shenanigans!
Sign In or Register to comment.