Why do you take servers offline for maintenance

ahill3780 · October 2016

This is a bit rediculous in this day and age and pretty hard to justify as anything other than being too cheap to purchase a few extra servers for swapping. And considering this game has a monthly subscription service then the one time cost of server hardware is negligible. There is NO technical reason why you should have to take the entire service offline to perform maintenence. Unless you're running everything and everyone off a single machine, which, shame on you if so.

Correction; One technical reason: servicing of routing/switching equipment, and even then you should have fail over equipment in place to take over in an emergency so STILL, no technical reason!

Daemons_Bane · October 2016

Another player who is needing a fix?

Reverb · October 2016

ahill3780 wrote: »

There is NO technical reason why you should have to take the entire service offline to perform maintenence

Cite your sources. Because I suspect you have no actual knowledge of the ESO data center or storage array, nor any in depth knowledge of authentication server functionality.

kevlarto_ESO · October 2016

Is there any online game that patches and updates with out some down time, even phone apps go down for maintenance, it part of the online experience, those of us that have played online for many years wish there was an economical way they could do these things with out down time but until there is a major change in technology I think we are stuck with things like they are. Do you have any idea what a few extra servers cost ? they are not cheap and most gaming companies, work as cheap as they can to maximize profit. ZOS does nothing different than any other mmo company does.

ahill3780 · October 2016

Reverb wrote: »

ahill3780 wrote: »

There is NO technical reason why you should have to take the entire service offline to perform maintenence

Cite your sources. Because I suspect you have no actual knowledge of the ESO data center or storage array, nor any in depth knowledge of authentication server functionality.

What sources would you like? I'm a little rusty being out of the server administration field for the last two years but I'm sure I can dig up something.

Storage array is irrelevant to this case as they don't typically EVER have to take a storage array offline for weekly maintenance, disk drives are hot swappable, weekly maintenance is another term for software patch installation with reservation to swap any failing hardware if needed, which 9 times out of 10 is not needed, patch only.

Authentication services are handled either off site or by a separate authentication server, and are also not needed to be interrupted by weekly software maintenance windows. I'm most familiar with AD but I know there are others out there. But regardless, weekly maintenance is not done, or really ever done, on user credential data (which authentication servers deal with, not your game data which is in your profile on the storage array) so the authentication service has no bearing on my point of 99.9% service uptime.

The only thing that is really focused on here is pulling down the actual host boxes to install the new patches and updates which can be handled on a separate set of host machines in a separate pool which when completed can then have traffic routed to it through the load balancer so you can then take the old pool offline for updating the next time. Should require at worst asking users to log out and back in to get routed to the new live pool. We did it all the time in our terminal services business when we had to roll out software updates and patches, or if one of the hosts had to be brought offline for hardware maintenance.

The Datacenter is also irrelevant, I would be HIGHLY surprised if they did not use a reputable Datacenter for their servers (such as Peak10) where they guarantee 99.9% uptime. Not even a factor for maintenance windows anyway so not sure why you even brought them up.

Also, hardware cost is almost negligible as well these days, yes they can be upwards of $10-20k for solid specs capable of hosting hundreds of thousands of concurrent users, but servers can be leased or even acquired with a financed monthly payment so the business does not have to shell out the full cost up front.

DPG76 · October 2016

i'm not a technaut but i know its for the best for game performance

yodased · October 2016

megaserver™

subtlezeroub17_ESO · October 2016

Well, when you have a mega server with the technology to create instances at will depending on population, you can bet your litty hiney that there's gonna be regular maintenance to maintain that kind of technology.

Mega servers arent an easy technology to maintain, I'm afraid.

SantieClaws · October 2016

Because if the servers they were still online then khajiit may unexpectedly experience the sensation of a diagnostic tool being placed somewhere quite uncomfortable.

This one she would not like this.

Yours with paws
Santie Claws

Zolron · October 2016

ahill3780 wrote: »

Reverb wrote: »

ahill3780 wrote: »

There is NO technical reason why you should have to take the entire service offline to perform maintenence

Cite your sources. Because I suspect you have no actual knowledge of the ESO data center or storage array, nor any in depth knowledge of authentication server functionality.

What sources would you like? I'm a little rusty being out of the server administration field for the last two years but I'm sure I can dig up something.

Storage array is irrelevant to this case as they don't typically EVER have to take a storage array offline for weekly maintenance, disk drives are hot swappable, weekly maintenance is another term for software patch installation with reservation to swap any failing hardware if needed, which 9 times out of 10 is not needed, patch only.

Authentication services are handled either off site or by a separate authentication server, and are also not needed to be interrupted by weekly software maintenance windows. I'm most familiar with AD but I know there are others out there. But regardless, weekly maintenance is not done, or really ever done, on user credential data (which authentication servers deal with, not your game data which is in your profile on the storage array) so the authentication service has no bearing on my point of 99.9% service uptime.

The only thing that is really focused on here is pulling down the actual host boxes to install the new patches and updates which can be handled on a separate set of host machines in a separate pool which when completed can then have traffic routed to it through the load balancer so you can then take the old pool offline for updating the next time. Should require at worst asking users to log out and back in to get routed to the new live pool. We did it all the time in our terminal services business when we had to roll out software updates and patches, or if one of the hosts had to be brought offline for hardware maintenance.

The Datacenter is also irrelevant, I would be HIGHLY surprised if they did not use a reputable Datacenter for their servers (such as Peak10) where they guarantee 99.9% uptime. Not even a factor for maintenance windows anyway so not sure why you even brought them up.

Also, hardware cost is almost negligible as well these days, yes they can be upwards of $10-20k for solid specs capable of hosting hundreds of thousands of concurrent users, but servers can be leased or even acquired with a financed monthly payment so the business does not have to shell out the full cost up front.

Seriously , whats wrong with people. This guy gives a perfectly ( sounding at least) explanation of the possibilities of doing maintenance without the servers coming down and the only replies are ' has to be done, needed for performance..blah blah'...Does anyone have a SOLID answer as to why his/her solution of a backup server isnt possible ???
Not trolling but seriously curios..Is it actually possible to do this ???

mobicera · October 2016

Zolron wrote: »

ahill3780 wrote: »

Reverb wrote: »

ahill3780 wrote: »

There is NO technical reason why you should have to take the entire service offline to perform maintenence

Cite your sources. Because I suspect you have no actual knowledge of the ESO data center or storage array, nor any in depth knowledge of authentication server functionality.

What sources would you like? I'm a little rusty being out of the server administration field for the last two years but I'm sure I can dig up something.

Storage array is irrelevant to this case as they don't typically EVER have to take a storage array offline for weekly maintenance, disk drives are hot swappable, weekly maintenance is another term for software patch installation with reservation to swap any failing hardware if needed, which 9 times out of 10 is not needed, patch only.

Authentication services are handled either off site or by a separate authentication server, and are also not needed to be interrupted by weekly software maintenance windows. I'm most familiar with AD but I know there are others out there. But regardless, weekly maintenance is not done, or really ever done, on user credential data (which authentication servers deal with, not your game data which is in your profile on the storage array) so the authentication service has no bearing on my point of 99.9% service uptime.

The only thing that is really focused on here is pulling down the actual host boxes to install the new patches and updates which can be handled on a separate set of host machines in a separate pool which when completed can then have traffic routed to it through the load balancer so you can then take the old pool offline for updating the next time. Should require at worst asking users to log out and back in to get routed to the new live pool. We did it all the time in our terminal services business when we had to roll out software updates and patches, or if one of the hosts had to be brought offline for hardware maintenance.

The Datacenter is also irrelevant, I would be HIGHLY surprised if they did not use a reputable Datacenter for their servers (such as Peak10) where they guarantee 99.9% uptime. Not even a factor for maintenance windows anyway so not sure why you even brought them up.

Also, hardware cost is almost negligible as well these days, yes they can be upwards of $10-20k for solid specs capable of hosting hundreds of thousands of concurrent users, but servers can be leased or even acquired with a financed monthly payment so the business does not have to shell out the full cost up front.

Seriously , whats wrong with people. This guy gives a perfectly ( sounding at least) explanation of the possibilities of doing maintenance without the servers coming down and the only replies are ' has to be done, needed for performance..blah blah'...Does anyone have a SOLID answer as to why his/her solution of a backup server isnt possible ???
Not trolling but seriously curios..Is it actually possible to do this ???

$, it really is that simple...

Some suggested reading for server downtime...
http://www.m.webmd.com/a-to-z-guides/features/video-game-addiction-no-fun

anitajoneb17_ESO · October 2016

Zolron wrote: »

Seriously , whats wrong with people. This guy gives a perfectly ( sounding at least) explanation of the possibilities of doing maintenance without the servers coming down and the only replies are ' has to be done, needed for performance..blah blah'...Does anyone have a SOLID answer as to why his/her solution of a backup server isnt possible ???
Not trolling but seriously curios..Is it actually possible to do this ???

Well that's an easy one. His post is full of assumptions and the last paragraph according to which hardware costs are low is just flatout wrong. These costs are huge and that's why the megaservers cannot be duplicated just to avoid us downtime.

llllADBllll · October 2016

Have you ever tried changing your underwear while wearing trousers? Metaphorically that's what you are suggesting.
Nothing ever gets updated without downtime from phone software upgrades turning your phone off to an app uninstalling then reinstalling after update.
The fact this is an MMO with 000,000's lines of code means editing while live would probably cause errors with overlaying and UI issues.
If the servers didn't go down the problems would multiply exponentially.

...but I feel your pain I'm a little bored too.

raglau · October 2016

There really is no technical reason as to why any service, including those vastly larger, far more complex, and with many more customers than a computer game, needs to be taken offline for maintenance. Hence why global enterprise services, such as Office 365, Azure, AWS, vCloud etc., all offer out of the box uptimes of 99.95%, and more if desired. If those services went down for patching then those businesses would not be the global heavyweights that they are.

However it is all about the money honey. The more availability you want, the more money you need to invest. This is just a game, no one will die if it goes down for patching, so I guess ZOS do not see the requirement to invest in high availability. It annoys us, but the only way to change that is to vote with our wallets and therefore exert a sufficient penalty to drive a business case for more resilience.

But without knowing the ZOS profit margin, it is hard to know if such a business case might also drive an increase in sub fee and/or an increase in the prevalance of utter dreck in the Crown store to fund such resilience.

Elsonso · October 2016

ahill3780 wrote: »

There is NO technical reason why you should have to take the entire service offline to perform maintenence. Unless you're running everything and everyone off a single machine, which, shame on you if so.

Yeah, there pretty much are technical reasons. Cost reasons, too.

High availability environments for something like this are going to be expensive and a lot more technical. It makes more sense to make a system that you take offline for a few hours when you want to do maintenance. We are talking about a game here, not the Central Bank of Nirn.

anitajoneb17_ESO · October 2016

Pibbles wrote: »

There really is no technical reason as to why any service, including those vastly larger, far more complex, and with many more customers than a computer game, needs to be taken offline for maintenance. Hence why global enterprise services, such as Office 365, Azure, AWS, vCloud etc., all offer out of the box uptimes of 99.95%, and more if desired. If those services went down for patching then those businesses would not be the global heavyweights that they are.

However it is all about the money honey. The more availability you want, the more money you need to invest. This is just a game, no one will die if it goes down for patching, so I guess ZOS do not see the requirement to invest in high availability. It annoys us, but the only way to change that is to vote with our wallets and therefore exert a sufficient penalty to drive a business case for more resilience.

But without knowing the ZOS profit margin, it is hard to know if such a business case might also drive an increase in sub fee and/or an increase in the prevalance of utter dreck in the Crown store to fund such resilience.

You are right that is comes down to the money.
You are not right to compare with services provided by big companies. They have, as you say, a VERY LARGE customer base, and therefore a much higher income. They can afford the costs of a duplicate infrastructure - which ZOS cannot.
If we insisted on 99.99% (virtually 100%) uptime, we would need to pay much more for playing the game than we currently are.
Inversely, if we insisted on 100% uptime and stopped playing/paying because of the downtimes, then the game probably wouldn't exist because there would be no sustainable business model for it.

(Disclaimer : I have no insight into ZOS finances or cost structure, but that's how I see it).

raglau · October 2016

anitajoneb17_ESO wrote: »

Pibbles wrote: »

But without knowing the ZOS profit margin, it is hard to know if such a business case might also drive an increase in sub fee and/or an increase in the prevalance of utter dreck in the Crown store to fund such resilience.

You are right that is comes down to the money.
You are not right to compare with services provided by big companies. They have, as you say, a VERY LARGE customer base, and therefore a much higher income. They can afford the costs of a duplicate infrastructure - which ZOS cannot.
If we insisted on 99.99% (virtually 100%) uptime, we would need to pay much more for playing the game than we currently are.
Inversely, if we insisted on 100% uptime and stopped playing/paying because of the downtimes, then the game probably wouldn't exist because there would be no sustainable business model for it.

I completely agree, that was the thrust of my post. I included the 'big boys' just to show that at a technical level, not only is high availability possible, it is in fact the norm for the enterprise. O365 is 14ukp per month per user for 99.95% uptime, but there are nigh on 100 million O365 users now, so MS get economies of scale. Amazon add, PER DAY, the same amount of server power that they added in their entire first year of business, just to service growth.

So yes, it's entirely possible to make any service resilient, you just need to chuck money at it, but as ZOS are small fry in enterprise terms, they do not have that sort of money and would end up passing that cost to us, no doubt by 'virtue' of more Crown Store garbage. Therefore we either accept the game is run on a shoe string, or we defect to another game, no doubt run on an equally thread bear shoe-string.

Elsonso · October 2016

Pibbles wrote: »

So yes, it's entirely possible to make any service resilient, you just need to chuck money at it, but as ZOS are small fry in enterprise terms, they do not have that sort of money and would end up passing that cost to us, no doubt by 'virtue' of more Crown Store garbage. Therefore we either accept the game is run on a shoe string, or we defect to another game, no doubt run on an equally thread bear shoe-string.

In this case, a "thread bare shoe-string" budget might be a little off base. We are talking about pretty large servers in two different data centers. Yes, if you throw enough money at it, you can make all 6 megaservers high availability, but ZOS already has a massive investment in server hardware. Doing this would significantly increase the cost and complexity.

I gather that part of the deployment process is to "build to Live", which implies that they are doing more than just installing a few programs. This likely includes trusted source database updates that really increase the technical level of rolling out an update when copies of those database are in use.

It is expensive and a lot of work to do it, and get it right, and there really needs to be a good business reason to do it. ESO is fine with what they are doing. I don't see that we need 99.99% availability and seamless transparent updates.

Biro123 · October 2016

Depends what kind of maintenance they are doing I guess.

Space-saving? clearing down temp files, cleaning up memory that is allocated when players log in.. I can understand that it would need everyone logged out for that to happen. Doesn't explain the length of the downtime unless there are a LOT of small files all over the filesystem to cleanup.

Database runstats/re-orgs.. ? Yeah, they'd need to take down the databses for that which would mean stopping the server from running. Depending on the size of the database, could take hours to run.

Hardware related? Not one I know so not gonna guess.

Obviously patching is different. That DOES take time on bespoke server software.

So if they do need stopping - first thing they need to do is back everything up in case they balls it up. You don't want to lose a few days of game progress cos they messed up a patch.maint and couldn't roll back do you? How long does the backup take? hours?

Then how long does it take to start them up again? I'd suspect services are manually brought up bit by bit and tested as they go to ensure all is good. If not, then restore from the backup and get it started again.. retry the maint another day. How long does it take to restore?

What's the alternative..? Switch over to another server.. Hm, how would it keep the data up-to-date? your character progression etc... It would have to be constantly copying it over on the fly so that when it switches, you don't lose any progression. That would have a performance hit to the server - meaning they would need a bigger, more expensive hamster to power it. They probably already have the biggest hamster they can afford. But wait, not only would a bigger hamster be needed, they would need 2 of them!

I've worked in software development for Global banks for quite a few years. They have money, yet they always take systems down to patch and it always takes several hours.

raglau · October 2016

Biro123 wrote: »

I've worked in software development for Global banks for quite a few years. They have money, yet they always take systems down to patch and it always takes several hours.

Yes, but banks are highly risk averse and would rather sweat their assets to death, rather than introduce modern technology that would vastly improve the situation. The perverse irony being they spend hundreds of millions on custom support agreements left right and centre, which still leave them wide open to every form of threat going. But from their perspective 100 million a year on a CSA for x system, is more palatable than the threat of losing 100 million a second in a trading floor outage due to introducing a new solution that contains defects.

The sheer amount of commercial risk involved, and the amount of people therefore involved in signing off any business change, means banks are invariably paralysed by indecision. Hence you see these archaic systems such as you speak of, when the rest of the business world has largely embraced high availability via virtualisation or cloud offerings.

Cously · October 2016

Game has this system since two years ago. If you are not happy then just move to other game. It's not like everyone been fooled.

cjthibs · October 2016

Simply put, the examples given above are infrastructure. The issue isn't infrastructure, it's software.

EVERY single one of my systems needs a reboot when a new kernel is installed for example. (Linux servers, but this applies to any other platform) It's entirely unavoidable. You simply cannot run one version of code, and then another, magically without any downtime. Even individual services must be shut down and restarted to utilize the newer code.

Ever notice how when you update Firefox it asks you to restart Firefox? Or Windows prompts you to reboot after an udpate?

Now, my infrastructure is different because my software is running on top of it, and is generally not dependent on which path or host it's running on. I can swap the software from one host to another with no interruptions, just as I can swap its network path to another, with no interruption. This is what allows uptime guarantees, redundant infrastructure.

Most Admins are thinking of a setup like VMware's vSphere here when they claim that there is no downtime required, and for that infrastructure software this is true. Because it is designed to run its virtual machines regardless of the host's software version (so long as the version is going up and not down, of course). Patching can be done in a rolling fashion, one host at a time, always leaving systems up and running. (But, when you update the software on an individual ESXi host, it still has to be brought down..just like everything else.)

When it comes to the code itself, there is simply no way to change the running version without stopping/restarting.
This isn't complicated, it's just reality, and any Systems Admin/Engineer/etc. who doesn't recognize this is either not very versed in the technology or isn't doing anything very complicated.

There are some very new versions of the Linux kernel that can run a newer version of itself in userspace, but even so it still requires a reboot to fully utilize that new kernel.

Sigtric · October 2016

I love all the comparisons in these sort of threads, to that of webmail servers/DCs/office type server-client software servers, etc.

Because the hundreds of thousands of calculations the server cluster is running at any given second during ESO uptime is comparable to your exchange server hosting a couple hundred or thousand inboxes. lol

WalksonGraves · October 2016

So the company that won't invest in more servers should buy twice as many solely for use 5 hours a month in the middle of the night?