Chapter 7. Software Troubleshooting

I'm RARE

It is a reciprocal counterpart to five golden rules of troubleshooting (Volume 1, page 590). Whereas the former are for artefact submitters, internal and external customers of memory dump analysts and complex trace readers, I'm RARE are rules for writing analysis reports with easy to remember mnemonic:

I'm RARE - Iridium Rules of Analysis Report Excellence

Note about Iridium metal from Wikipedia: "It is one of the rarest elements in the Earth's crust, with annual production and consumption of only three tonnes."

Here is the number 5 of them (subject to change):

  1. Use a template.

  2. Structure a report according to audience technical level and organizational processes.

  3. Use checklists not only for commands and tools but also for things to avoid in reports and things to encourage.

  4. Put all relevant data for later search and for other engineers to reproduce the analysis.

  5. Provide appropriate explanations and narrative in the cases where analysis is inconclusive.

To Bugcheck or Not To Bugcheck

This "Hamlet's Question" of software technical support is often asked and unfortunately sometimes not even asked at all when troubleshooting and debugging complex enterprise environments. For applications the question of saving crash dumps is trivial. If a process is not in memory and is not visible in Task Manager we won't be able to dump it manually. With OS always running even when hanging the question often degenerates to "Let's bugcheck and send the crash dump to dump file divers". After that decision huge amounts of energy are spent in collecting, sending and storing gigabytes of data with always very little or no return. Therefore, here is the preliminary list of symptoms where manual system dumps are appropriate and when they are not:

When a manual system dump is appropriate

  • The system hangs visually (no GUI activity possible)

  • No connections or logins are possible

  • Abnormal system metrics (like pool, thread or process number growth)

  • Insufficient system or session memory

When a manual process user dump is more appropriate than a complete memory dump

  • Process hangs visually (other applications work as normal)

  • Error message box appears

  • Abnormal process metrics (like process memory growth or handle leaks)

When manual kernel and complete memory dumps are almost useless (I say almost because in rare circumstances they can aid in problem resolution so it is better not to collect them until explicitly asked from skilled memory dump file diver)

  • Application failures resulted in their disappearance from the list of running processes

  • Functional bugs (dynamic activity that requires historical tracing of events)

Note: 3rd-party kernel mode software developers should not face this question during the development of their drivers and delegate the responsibility for difficult bug-check or panic decisions to an operating system. Surely Windows core developers face this question too.

T&D Labyrinth

Here is a picture of troubleshooting and debugging labyrinth resting on a notion of universal memory dumps that are observational snapshots and it includes both memory and various traces we collect to resolve problems.

T&D Labyrinth

This picture shows possible pathes on how we arrive to problem resolution. For example:

T&D Labyrinth

Efficient vs. Effective: DATA View

DATA (Dump Artefact + Trace Artefact) - > DATA (Dump Analysis + Trace Analysis) examples:

  1. Efficient

    • My 64 GB server bluescreens. I set a complete memory dump option in Control Panel.

    • A user cannot connect. I started tracing yesterday. Stopped today.

    • I analyze all these artefacts every day.

  2. Effective

    • My 64 GB server bluescreens. I set a kernel memory dump option in Control Panel.

    • A user cannot connect. I started tracing, tried to connect, stopped tracing.

    • I analyze all these artefacts every day and write articles to reduce DATA load.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset