Dealing with (Personal) Infrastructure Failure

“If you plan for the worst, all surprises are pleasant.”
Gaul in “The Wheel of Time” by Robert Jordan

I read an interesting article that was just published in the “Communications of the ACM” by Ariel Tseitlin. It’s titled “The antifragile organization”. He talks how they deliberately induce failure in the infrastructure to test whether the system is resilient enough (due to redundancy and fault tolerance) to deal with it. It’s an interesting text in it’s own right — the idea of running programs that deliberately but reversibly damage the infrastructure is impressive, e.g., terminates virtual instances, connections to data centers, or (in the future) even entire regions. They put their infrastructure under stress to see whether it is resilient when real failures occur. And in case something goes wrong(er than expected), they can terminate these programs anytime (in contrast to real failures).

I think this approach is applicable to one’s own personal infrastructure — everything from an accidentally deleted files, selective or full loss of network connectivity, forgotten/lost display adapters in presentations, hard-disk crash, theft, or the apartment burning down.

These things are (usually) pretty rare, giving us few ‘opportunities‘ to test our resilience. Of course, at times we think about worst cases (and probably do backups or that second check if the video adapter is really in the bag), but it’s not introduced randomly. And sure, if you try to prepare for everything you won’t have time for anything else (e.g., work) and probably become pretty paranoid. But a few things should be prepared against.

It would be nice if there was an App which would run in the background and give you — at times — a simulated scenario (e.g., “the file you are working on at the moment was not saved correctly and is corrupted — it is beyond repair”) and let’s you deal with it for a moment. If you don’t have the time in the moment, you can ask for it to occur later, but if you take the time you can ask the right questions: Now what? Is that a problem for me? Do I have prior versions saved? How much work is lost since I last created a prior version? Would I be able to finish my work on time if it had really happened? You could then tell the app that you would have dealt with it/would be a non-problem (that problem would occur less frequently) or that it would have been a serious problem (which would create a todo task to find a solution in the future). Hmm, I’ll probably include such a function in my Task Manager App (if I ever come to programming it, the planning stage is almost finished).

Even if that event never happens, chances are, over time you become better in solving problems and are more relaxed and competent when real problems arrive.

Until then, you can do the following. Take a book, any book that has page numbers. Close your eyes and flip through the pages. Depending on the last number select one of the areas (that should reduce Benford’s Law). Then do it again to check the concrete case (due to the numbering style, 10 = zero here).

  1. Interrupted Morning Routine
    1. Your cellphone/smartphone is empty (e.g., did not charge over night).
    2. Your deodorant is empty (no reserves).
    3. There is no hot water.
    4. There is no water at all (pipes did break somewhere else).
    5. There is too much water (pipes did break somewhere near, very near).
    6. You are delayed 15 minutes (e.g., spilled something that has to be cleaned).
    7. You are delayed an hour (e.g., overslept).
    8. You can’t find your keys.
    9. You can’t find unwashed clothes.
    10. Your bootlace breaks.
  2. Transportation
    1. Your car does not start./Your usual means of transportation are delayed for an hour.
    2. Your car stops working on the road (or the train breaks down).
    3. Your car is gone./Your usual means of transportation is unavailable for the day.
    4. Your car has a smashed in window (and everything valuable was stolen).
    5. There is a serious traffic jam (or all trains are delayed for an hour).
    6. You get into a car/train crash.
    7. You arrive at your destination, your luggage took the long road (delayed for a day).
    8. You arrive at your destination, your luggage vanished on the way.
    9. You hand-luggage is gone (e.g., stolen, or lost due to an accident).
    10. You arrive in the wrong city (e.g., overslept in train, plane had to be re-routed).
  3. Physical/Virtual Access
    1. The office door won’t open.
    2. The building cannot be entered for an hour/day (fire alarm).
    3. The building is gone (e.g., fire).
    4. There was a break-in, everything valuable in your office was stolen.
    5. Another room you use is not available for the day.
    6. A whole section of the building and its technology is not available for the day (including servers).
    7. Intranet does not work for an hour.
    8. Intranet does not work for the day.
    9. Internet does not work.
    10. All cloud data is gone/corrupted (data center burned down, hacking, etc.).
  4. Personal Access
    1. Your boss is unexpectedly unavailable for a few hours (despite his/her promises that s/he would be there) and you need his/her signature.
    2. See 1, but for the day. Can be reached virtually (phone, eMail).
    3. See 2, but cannot be reached virtually (e.g., was involved in accident).
    4. Your boss died. [might come across wrong, but it happened to a PhD I know of, her supervisor died during her PhD]
    5. Someone who works for you is unexpectedly unavailable for a few hours and you need this person’s input.
    6. See 5, but for the day. Can be reached virtually (phone, eMail).
    7. See 6 but cannot be reached virtually (e.g., was involved in accident).
    8. The person who works for you died. [again, might come off wrong, but accidents happen, that’s the point]
    9. A collaboration partner quits suddenly and irrevocably (e.g. fraud case, bankruptcy, etc.).
    10. The one person who knows the workings of the office inside out (usually, the secretary) cannot be reached for the day/week/month or is completely unavailable.
  5. Cellphone/Smartphone
    1. Your battery is suddenly empty during the day.
    2. Your cellphone/smartphone gets misplaced for an hour.
    3. You lose your cellphone (and it will never come back).
    4. Your cellphone is stolen by someone with evil intent.
    5. The data on your cellphone is wiped.
    6. You have no cellphone connection.
    7. You have no Internet connection.
    8. Your Internet connection is excruciatingly slow.
    9. You battery will be empty in 10 minutes.
    10. The last used App is gone/deleted/corrupted.
  6. Notebook/Computer
    1. Your notebook battery dies (no power cable/plug).
    2. You do not have access for an hour (important systems check/virus scan).
    3. You do not have access for the day (non-memory related technical issue).
    4. The hard disc crashed (all data lost on the device).
    5. Your notebook is stolen (and won’t come back).
    6. Your notebook is stolen by someone with evil intent [any personal photos or “inappropriate” diary entries? Or data that could help the competition if given/sold to them?].
    7. Your writing software does not work.
    8. Your data is corrupted (and was for the last three backups).
    9. The file you are working on at the moment is suddenly broken.
    10. The notebook switches off suddenly (all unsaved information is lost).
  7. Home
    1. Someone breaks into your home and steals all valuables.
    2. Someone breaks into you home (with evil intent).
    3. Your home is gone (e.g., fire).
    4. Your home is suddenly inaccessible for a day (e.g., environmental issue/natural disaster without time prior to evacuation).
    5. Your home will not be accessible again for a day in 30 minutes (e.g., environmental issue/natural disaster).
    6. Your home is suddenly not accessible for a week or month (e.g., environmental issue/natural disaster without time prior to evacuation).
    7. Your home will not be accessible again for a week or month in 30 minutes (e.g., environmental issue/natural disaster that gives you 30 minutes prior to evacuation).
    8. Your house is on fire — you have 2 minutes [fire detector anyone? most people die due to the smoke that kills you while you sleep — you can get them on Amazon — with a magnetic holder to glue it to the ceiling and allow removal, worth a thought].
    9. Your house is on fire — you have only access to the room you are in for 2 minutes.
    10. Your house has to be sold/you lose your apartment.
  8. Health/Safety
    1. You have sore muscles and move at half speed.
    2. You have a problem with your back and cannot bend or sit down.
    3. You break your right arm.
    4. You break your left arm.
    5. You break both arms.
    6. You break your leg(s).
    7. You have a splitting headache/migraine.
    8. You are confined to the hospital for the day (telephone, no cellphones allowed).
    9. You are confined to the hospital for a month (telephone only, no cellphones allowed, you might get your/a laptop, if someone can bring it to you).
    10. You have a serious accident and are unconscious for a week/month (or even die).
  9. Financial
    1. You lose your wallet.
    2. You lose half of your savings.
    3. You lose all of your savings (e.g., due to bank fraud).
    4. You lose access to your credit card.
    5. You lose access to your credit card (while being in another city/abroad).
    6. You lose access to money that is dependent on banks (e.g., due to an accounting error you suddenly have a very bad credit rating).
    7. You need money for an important medical procedure (what would be a problem: $500, $1000, $2000?).
    8. and higher: redo
  10. Public Issues (you have a presentation/important meeting soon)
    1. You have/Someone spilled coffee over your clothes.
    2. You have a migraine.
    3. The video adapter cable is missing.
    4. You will be late for 15 minutes.
    5. You will not be able to attend [could someone else do it for you?]
    6. Someone you depend on is absent.
    7. Someone you depend on is hostile (person or audience).
    8. You have a bladder infection/stomach irritability/etc. [drugs available? scheduling?]
    9. You have no Internet connection (despite being promised).
    10. You cannot use your slides (projector is broken and cannot be repaired).

Note that I didn’t do any research, these are just things that came to me while writing (yup, I’m a ‘let’s plan for the worst’ kind of guy — and then enjoys the situation, mostly, something is always running in the back of my mind … anyway). Given the topic, I also did go broader and include work-related issues. After all, many creative projects happen in organizations, for example, a PhD in a department.

If a category is not relevant, simply limit it the the relevant categories. If there are multiple possible cases (e.g., more than one person works for you, simply write down their names and flip again through the book to chose one — or more). You can also chose more than one accident/problem. After all, there are not always independent of each other.

You could also use this list as a checklist (without any warranty, e.g., that everything important is covered, likely not) — if that does not make you paranoid. And sometimes the right reaction is probably that it can happen, but that you can deal with it.

BTW, if you know of cases that can happen, drop me a comment or an eMail. I’m happy to hear about them (what they were, not that they happened to you ;-)).

The next list will probably consist of useful items and strategies that cover most of these issues. I can think of a few — but many require some thinking and have to be implemented in advance.

 

Literature that inspired this posting (about technical infrastructure):