Experience on Emergency Response Guide
There is no one-size-fits-all emergency response guide, and all general templates can only be used as framework references. What can really save you at a critical moment are the localized rules that you have accumulated for your own business scenarios and have "trapped pitfalls."
Seven days before Double 11 last year, our core payment link suddenly collapsed. We had just opened the official emergency guide, and the first article read "locate the root cause first to avoid secondary failures." The three developers looked at the logs for 20 minutes, and they couldn't even figure out where the problem was. There were more than 3,000 refund applications from backend users, and the operators in the group were so anxious that they cried in their voices. In the end, Brother Li, a 10-year old operation and maintenance veteran, ignored any guidelines and directly cut off the backup payment link. The business was restored in 1 minute and 20 seconds. After a review, it was discovered that the third-party payment institution secretly updated the SSL certificate without sending a notification, and our old certificate failed to pass the verification, which is why we got stuck.
It’s quite interesting to say that the current industry’s thinking on emergency response has actually been divided into two groups that are at loggerheads. One group is the cause-first group, which is mostly recommended by companies that provide To B cloud services. They feel that if the root cause is not found, it is easy to turn a small fault into a major accident. When a friend had a storage failure before, he cut off the traffic without clear investigation, which directly brought down the backup cluster. The entire site was down for 3 hours, and the customer was compensated tens of millions of liquidated damages. The core logic of this group is "stability", and they would rather stop for ten minutes longer than take the risk of the fault expanding. The other group is the hemorrhage-first group. Most of them are used by companies engaged in To C consumer Internet. They believe that a minute of business interruption is a real loss of money. Restoring first and then reviewing the market is the right way. If our last payment failure had really waited until we found the root cause before dealing with it, not to mention that the Double 11 warm-up could not be completed, the merchants would have to pay millions in liquidated damages alone.
When I first entered the industry, I was a firm believer in root causes. I felt that it would be a blind operation to start without looking for the problem. It was not until I was scolded by my boss about the payment failure that I slowly realized that there is no right or wrong in these two ideas. It just depends on what the RTO (recovery time objective) of your business is. If you are working on a 120 emergency dispatch system, if it stops for 10 seconds, someone may die, so don’t hesitate, just go back to the system first if it can be restored first. ; If you are building an attendance system for internal employees, and no one can find it after being down for two hours, then it is absolutely fine to slowly dig through the logs to find the root cause.
There are also many people who think that the emergency guide is just a matter for the technical department, and that’s it after filling it with code commands and operating steps. This is not true. Last time, there was a problem with the cold chain of the fresh food warehouse in our community group purchase. The technical team corrected the inventory in 10 minutes. As a result, the customer service did not know how to respond to the user's complaint, and the public relations did not prepare a statement template in advance. In the end, the user scolded the company on social platforms for two full days, losing hundreds of thousands of followers, and the loss was more than three times greater than the cold chain failure itself. We are now revamping the emergency guide and have people from customer service, public relations, administration and even the canteen write it together. The customer service speech template is directly attached to the back. The first draft of the public relations statement is written in advance, leaving blanks to fill in the time and reason. There is even a note about "If something goes wrong in the early morning, order a hot drink for the person on duty first. If your hands are not shaking, the efficiency will be at least 30% higher." This kind of content seems to have nothing to do with technology at all.
Now when I read our team's emergency guide, I never read the high-sounding principles at the front first. I go straight to the red annotations at the back, which are all the random thoughts added by everyone after every failure, such as "Don't believe the operation and maintenance who say he definitely didn't delete the database. Checking the operation log first is better than anything else." When encountering a DDoS attack, it is more effective to call the operator first than to deal with it yourself. "If a user is dissatisfied, it is more effective to give him a 5-yuan coupon than to explain the technical principles in ten minutes." These "homespun methods" that have not been written into any official general guide are truly life-saving straws that have been tested in actual combat.
To put it bluntly, emergency response is not a perfect technical job. It is essentially the best choice to make when there is incomplete information, insufficient time, and full pressure. If you just follow the standard answers, you will easily fall into the trap. Stepping into pitfalls a few more times and accumulating more experience is more useful than memorizing ten emergency guides.
Disclaimer:
1. This article is sourced from the Internet. All content represents the author's personal views only and does not reflect the stance of this website. The author shall be solely responsible for the content.
2. Part of the content on this website is compiled from the Internet. This website shall not be liable for any civil disputes, administrative penalties, or other losses arising from improper reprinting or citation.
3. If there is any infringing content or inappropriate material, please contact us to remove it immediately. Contact us at:

