1

State of the Game #182: How I DDoSed Myself

In-case you are wondering about the title, DDoS stands for Distributed Denial of Service, and is a popular attack that happens online to bring services offline for users. It has been used to bring down Xbox live, Minecraft was hit with several of them, and many many websites.

I manged to be extra special and pull this attack off on myself! :/

It all started with the switching of the website over to the new server. While the actual switch from the old site to the new site went fairly well, I did expect some minor issues, like pages throwing 404 errors because they had moved or been renamed. I didn’t expect people to not be able to login to the game.

Problem #1 – Error 404

Almost immediately after the site switch happened I got reports of people not being able to login and play the game. I went in to full recovery mode, as the game being ‘down’ is a worst case situation. I had extensively tested the new server and login system, so I was very surprised to hear there were issues. I fired up my editor and tried to login to the game. It worked! So what is going on? Is it the passwords or an account issue? I created a new account and was able to login and tried with an old migrated account and was also able to login. I fired up the installed version of the game and I suddenly could not log in! What changed? I traced the difference down to me using the direct IP address to the login server in the editor [I was testing before the site was live] vs the released game using the proper URL. As it turns out, the game was requesting files from the www.bombdogstudios.com domain, but the server was returning a 404 not found error, as it was setup to only use the NON www version of the site. This was a simple fix of implementing the proper redirects on the server.

Problem #2 – Out of Memory

After the redirects went in I started getting a lot of reports of the site being down and only showing an error message about the database. On top of this, many users could still not login! I started digging into the server logs and found that the database system was running out of memory and crashing, which would then bring down the site and game login system. As a stop gap I implemented a 4 GB swap file on the servers SSD to prevent the out of memory crashes. This was a great fix for the server crashes but, when I started measuring the memory usage in realtime I sat in horror as I watched a fresh reboot of the server eat through all 2 GB of system memory and then chip out about 90% of the swapfile [for perspective the old server only had 512 MB of memory!]. In fact, the only reason the server wasn’t crashing still is because the memory swaps started to take so long that it couldn’t actually use up all the memory!

Problem #3 – The Ddos begins

This is what a normal server should look like:

A few Apache services serving up web content to visitors, lots of memory free and a hardly touched swap file. This is what my server looked like:

It’s Apaches all the way down! Notice the PID, which stands for process ID. These are created in order. Which means in the ~45 secs it took for me to get my screen grab of this fresh reboot ~20,000 requests where made to the server that spawned a separate Apache process! This was a full blown Denial of Service attack! At this point I went into a triage mode and discovered there was something about the way the game was requesting login data that was causing this. Since the game hadn’t changed, I thought this must be an error in the configuration of the server.

Problem #4 – Fixing the Server

I very quickly went into full DDoS prevent mode. This was my first experience dealing with a DDoS, so I was learning a lot on the fly, and frankly, winging it. I implemented page request caching, limited Apache’s process spawning, and limited the number of requests a single IP could make per second. All of this helped slowly make the site more responsive and was even allowing some users to login to the game! It was not a universal fix, and not a perfect one either. I was way too aggressive with the caching and IP limits. Soon I would start to receive reports of strange behavior on the website, people getting forbidden errors when trying to browse the forums, and other artifacts from the server changes. These fixes would have to wait until the DDoS was fully resolved.

Problem #5 – Code Rot

After patching the server together I started to investigate the game, as something had to be wrong for it to be hitting the server that hard. It had to be something with the login system, as everything you fired up the game it would create a secure login connection. I was convinced I had mistakenly put in something stupid like this:

while(!loggedIn){
RequestLogin();
}

Interestingly enough, I commented out ALL of the login system and the problem was still there. That lead me to the version/updater system. combing through all the update code I noticed there was this strange piece of code that was making a unity www call to a version.txt on the bombdog server. It finally dawned on me that this was an old failsafe updater, from WAY back when the game was first launched, before the patching system was added. I had failed to copy the version.txt file to the new server and the built in unity www class was getting a 404 error, which would then go to a fail safe updater system, which called the server for the version.txt file… you can see where I am going with this. I had found the source of the DDoS!¬†Simple enough, I just removed the outdated version code and released a new snapshot build for testing.

Problem #6 – Game Servers

Nearly as soon as the snapshot build was live and I thought my work was done, I get reports that the game servers are no longer working! Back into the code. They only change I made was to the version check system. Well, it looks like this actually introduced an unseen race condition, as the version check was needed to complete before the game would load into the main menu. With the old system removed and the server running faster, this happened so fast that they system to determine if the game was launched as a multiplayer server would not update into, thus causing the game to load into the main menu, then decide it was a server and load in all the server resources and UI. I was able to quickly get a fix for the race condition figured out and I released another snapshot build.

At this point, I relized I had no way of preventing the old versions of the game from continuing to DDoS the site, as I wanted to ease back on the protections to prevent the site errors people were seeing. I did all the testing I could on the last snapshot build, then released it as a stable update build, unifying the player base again.

Problem #7 – Lost Users

In the middle of this whole debacle, Leagacyelite84 and ALEXANDRA-MOONWATCH brought to my attention that new users to the site could not login to the demo. They had full access to the website, but could not play the game. I looked into this issue and found there was a single error that was preventing new users from getting game access. I fixed this issue and reverse merged the missing users so they could gain access to the demo.

 

And that brings us to today! It’s been a hell of a ride this past week. These are just the major issues I have been working on, there have also been many minor things brought to my attention through Facebook, the forums, email, and the steam page. While I anticipated issues when switching over the server, I have to be honest and say I thought they would be WAY more minor! While I do feel a bit dumb from this episode, I also learned a lot of good lessons. I got a serious crash course in server management and the Linux console. I also learned that there is quite a bit of code rot hiding in M.A.V., which is likely causing some of the bugs we see in-game. It did feel real good going in and cleaning house a little.

It’s been a hell of a week and I am looking forward to what the next week holds! If it goes according to plan, it will be the start of the polish phase, but who knows what will happen!

Comments 1

  1. The server issue on Problem #6 actually forced me to hard reset the Knolif server when I launched all 3 Rak Sal Industries servers at once and brought the machine to its knees. Good Times.

Leave a Reply