Best Practices for Managing Failures in E-commerce Transactions

Written by iSeatz | Feb 19, 2019 3:49:25 PM

As the leading Ancillary Management System (AMS) in the industry, the iSeatz OneView platform sits between, and communicates with a large number of systems in order to power travel e-commerce 24/7/365. Whether our platform is interacting with Global Distribution Systems (GDS), Online Travel Agencies (OTA), Payment Gateways, or Loyalty Engines, the common glue is found in the web services and Application Programming Interfaces (APIs) used to enable near real-time transactions.

While we’d like to think that the happy path is the only path in these interactions, the reality is that failures are bound to appear when dealing with multiple, disparate systems coordinating with one another to facilitate thousands of transactions per hour. Given the inevitability of failures, it’s critical that the platform powering your business has the necessary logic and functions to handle them as gracefully as possible.

In this post we’ll start with a few key concepts that demonstrate the iSeatz approach to trapping and handling the various failures that commonly occur in the world of high traffic e-commerce.

Failure vs Errors or Warnings

Let’s start with some basic terminology. We’ll use the term failure to denote "the thing that went wrong." In contrast, the term error is used to mean "a description or representation of the thing that went wrong." In other words, when an application or service returns errors, these errors are not the failures themselves. This distinction is subtle yet important. Errors are representations (or sometimes misrepresentations) of failures, and all failures should be captured as errors.

We can further distinguish between errors, and their close cousin, the warning. While an error means "a failure prevented the transaction from being performed", a warning means "the transaction could be performed, but had some problems."

As an example, let’s suppose that a flight booking was successful but the requested seat did not get reserved. Although we don’t want a seat request failure to stop the overall booking, we also want to make sure we alert the end user and/or system admin; or provide an opportunity to try again via a different method. A warning is appropriate in this case

Responsibility

Nobody likes a tattle-tale, but in our case pointing an accusing finger is not only warranted but also critical to ensure that the most optimal corrective action is applied. At iSeatz, we look at three primary sources of errors and warnings, and implement logic that can detect and notify the responsible party.

Client The client originated or is responsible for the failure. Clients are human or machine end-users of iSeatz UIs, APIs, etc. In the case of an API failure, it’s possible that a client system calling the iSeatz API is not following the specifications correctly. Perhaps a required data point was not provided, or the data was formatted incorrectly. In this case, a well structured error message will enable the programmers responsible for the client application to correct their code to be in compliance with the API specification.
From the iSeatz perspective, we are also able to take proactive steps to make the API specification clearer if we see a high occurrence of these errors. In the case of a human end user, they could have entered an incorrect password or credit card number into a form field. With proper error messaging, the system can alert the end-user to their mistake and prompt them to complete the request with the correct information. A large occurrence of these types of errors could also be a signal that the UI / UX needs improvement. The iSeatz Product team takes particular interest in these types of errors.

External An external system originated or is responsible for the failure. “External” refers to all external systems that iSeatz integrates with through an API including supplier systems, partner systems, payment gateways, etc. Because the iSeatz platform relies on these systems to drive business value for our clients and end users, we monitor these errors closely, and ensure they’re communicated as quickly as possible to the external system contact.
Internal Given that iSeatz systems sit between a large number of Client and External systems, we can’t let ourselves off the hook and we take particular interest in any errors or warnings that originate from the internals of our platform. The iSeatz Operations and Site Reliability teams monitor for the occurrence of these errors on an ongoing basis and respond accordingly.

Context

It is also important to understand the context in which the failure occurred. At iSeatz, we generate errors and warnings across three primary contextual categories:

Business Exception The system is working correctly but there was a valid business exception. These include validation errors as well as errors related to the state of a resource such as hotel room availability or an insufficient loyalty point balance. Most often these failures are not transient, meaning that the same request will not succeed if retried, unless and until there is a change of state in the receiving system

Environment Failure These failures and errors are transient by their nature. The same request might succeed if retried, but the time interval before a retry will succeed depends on what system component failed, whether human intervention is required, and the nature of the recovery mechanism. Examples include failures of environment components such as databases, caches, file systems or file system permissions, out of memory failures, SSL certificates, and more.

Software Bug Typically these failures are not transient and the same request will not work on retry until the code is patched and deployed. Ideally, these types of errors should all be caught and fixed at development or QA time. Examples include Null pointer exceptions and foreign key violations.

"Unsuccessful" or "Unknown"

In general, every application transaction request is one of two types, either read or write. Examples of read requests are search, shop, and retrieve reservation. Examples of write requests are save, book reservation, and cancel reservation.

Read requests say "Give me X".
Write requests say "Do X".

To be successful, read transactions require only one thing, that the requested X is returned. If it is not, then the read transaction is unsuccessful.

By contrast, many write transactions require two things, that the requested X is done and that a confirmation is returned. If a confirmation is not returned, then the client cannot be sure that X really has been done.

Therefore, every error response should indicate the result of the transaction appropriately as "Unsuccessful" or "Unknown", taking into account the transaction type, the stage of processing, and the special case of timeouts.

Timeouts present an especially interesting scenario, particularly in the case of a write request. Consider an example in which the iSeatz platform requests a hotel reservation via a web service request to a supplier system such as a GDS. Because we can’t wait for an indefinite period of time to get a response, it may be necessary to short circuit the request after a predefined amount of time if the external system is taking too long to respond. In this case, we would have an unknown state in relation to the hotel reservation as our systems can’t be certain that the supplier system actually fulfilled the request or not. Although this represents an edge case, the handling of the customer experience is critical at this juncture. If the reservation succeeded, but the customer thinks it failed, they may attempt to create duplicate reservations. In contrast, if the reservation actually failed, but the customer thinks it succeeded, they could be in for a pleasant surprise when they arrive at the hotel on day of the their imaginary reservation. In this case, iSeatz could message the customer appropriately, directing them to a customer service agent for further assistance.

Although the concepts presented above are only the tip of the iceberg when it comes to best practices around managing failures in e-commerce transactions, hopefully they provide some insight into the nuances of the unhappy path, and how we approach it at iSeatz.

View full post