So heres a follow up I promised on the recent node issue of one specific vps node that had some troubles. In the end every client was split and put on other nodes.
Back story, what happened?
This server began triggering alerts on the 8th and standard troubleshooting steps were taken. Load was examined but appeared normal, error logs were checked, and hardware was checked along the start of with local console monitoring. No hardware issues were discovered and no obvious load anomalies were exposed.
The server and all ve's on the server restarted normally and continued to operate for over 3 hours as they should before it became unstable.
Other actions were taken as a suspected attack was being investigated and subsequently blocked. The machine remained under constant monitoring and while the second outage was indeed separate it was so close to the first one that most users beleived it to be one consecutive outage. Onsite techs and remote consoled techs alike were working to fight what was later discovered to be a ddos whose repeated attacks caused outages which caused hardware to degrade causing more outages. Eventually as a result the node was unsolvable and a backup restore began.
The cdpc restore, which takes longer than old rsync but is far more comprehensive and reliable, was started as we also realized we should take steps to speed up this process at this point. After all its been a solid day of ups and downs already and nobody can afford downtime. We devised a temp workaround, which is very clever I must confess, that allowed us to keep the node alive while doing the restores and we did not experience any other outages during the restore time. Of course the act of restoring the pure amounts of data these 22 ve's had on the node was still time consuming. Once it was complete we ordered the ve's split up to other nodes so very few kept the same neighbors. If the ddos returned it at least would no longer affect the same clients but rather a much smaller segement of just a few clients. It did not prevail, we did, and the services were all working.
What went wrong and how do prevent it?
More time was spent on troubleshooting initial findings than senior techs should have allowed. After the first sign, as opposed to the second or repeat, we should have began a quick split of clients to other nodes. We couldn't predict the hardware would fail as a result of the attack causing repeat rapid reboots, but it should have been a known possibility and eliminated early. We have adapted some new procedures and guidelines to address how and when things are split. We are also increasing the spares to an even greater number to accommodate any future split. We will also be promoting more of the failover options for all products, especially anyone that draws revenue or professional business on a service like vps.
The other area that was unsatisfactory was the public and private mailing and forum updates on the issue. We used to see this before, long ago, and we have established protocals for these things. These were ignored by certain people and this issue has been strongly addressed. A revised set of procedures is in place along with certain new responsibilities. Software is also being updated to facilitate new warning methods, sms/mms, and alert notices for lack of updates at set intervals.


LinkBack URL
About LinkBacks



Reply With Quote
Bookmarks