Errors and Exception Handling

Proactive monitoring and log analysis of potential and occurred errors to provide a graceful recovery and diagnostic capability for support teams.

Why

Regardless how well you test a product, you must anticipate and prepare for unhandled errors in production. Such errors could be due to unplanned user behavior, missed test scenarios or data incompatibility. Good products recover from errors and keep the team informed with all necessary information to trace the error and help the users to complete the disturbed functionality.

There are 3 levels of problems your product should handle:

  1. Validation failures: Issues that can be avoided through validations and checks, such that errors wont be resulted (e.g. user inputs).
  2. Handled Exceptions: Issues that are tackled through coded exception handling routes for gracefully recover (e.g. Third-party service not available).
  3. Unhandled Errors: Issues that are not anticipated and doesn’t get handled in exception handling routes. (e.g. An unplanned special character in data causing a data exchange protocol failure).

Your product require capability to detect the type-3 issue of Unhandled Errors occurring. Never rely on already frustrated users to take screenshots or send you error codes. Error handling must be automated and your team should proactively respond to users.

How

  • Build a high-importance work attitude towards production errors. It should be mandated that every error is brought to the limelight and immediately dealt with.
  • Follow a standard process on how errors are captured, communicated, and fixed. Document this process and keep all stakeholders informed so that the customer experience is uniform.
  • Make use of automation tools to capture, monitor, and report production errors. Increase the error visibility within the organization (through dashboards, etc.) to ensure they don’t go unnoticed.
  • Enable your application to log diagnostic information on product health to an easily accessible location. Make sure that the logs are fresh and relevant. Proactively analyze this information to find any anomalies.
  • Set up your tools in a way that it not only captures stack trace, but contextual information such as release version, account id, data status, etc. to make identification of the cause of error easier.
  • Proactively respond to users when they have faced an error. Assist users to recover any potential loses they have incurred through your customer support organization.
  • Audit the information exposed on an error condition. Hackers may use information exposed to get internal details.

References/further readings