Tuesday, July 24, 2018

Your Microservices are asynchronous! Your approach to testing has to change!

Conceptually, the idea of an application being composed of a number of MicroServices is quite self-evident and therefore, quite appealing. A complex solution, which is broken into a number of self-sufficient modules, each running on its own, each capable of responding to messages, each amenable of upgradation/replacement without letting any other know about: sounds too good to be true! The icing on the cake of course, is that each such service can run on its own containers!

No wonder then, that senior stakeholders of software projects or initiatives, love to insist that the application be modelled as a bunch of MicroServices. The technology team responds favourably, of course, because everyone wants to be seen among the pack, if not ahead of the pack! Jumping in sounds exciting, even though the swimming may not be, but we don’t really care, do we?

The proverbial devil is in the details. In my experience so far, the term Microservice is thrown about without much thought. The benefits of having a Microservice-based solution comes with its own price. It is better to be aware of the aspects from which the price accrues and to fold them in the plan, rather than wishing them under the carpet. One such aspect is the property of inherent asynchronicity of  Microservices and verification (testing) of behaviour of the application as a whole, given this property.

No singleton Microservice

Let’s take one fatuous argument away to save time: Microservices-based applications cannot be monolithic. In all discussions, the word Microservices is used in plural. One accepts that there will indeed be more than one such services. They will interact between themselves as and when required by the application’s use-cases and logic of implementation. Putting the whole application in a big ball and attaching that with a HTTP handling layer, doesn’t constitute a Microservice.

First, the simple synchronous cases
Let’s assume that we are dealing with two Microservices. Let’s call them A and B. A has its upstream callers, who interact solely with A. In order to respond to its callers, A needs to get something done by B. Therefore, A needs to call an API exposed by B and expect something in return. To keep things simple, let’s assume that B offers an HTTP interface. So, A’s call is effectively synchronous.

Case 1: when B is up
Because B is a service, it has to keep running, because it doesn’t know when A may require something from it. At some point in time, A sends a request through an HTTP API, gets response back when B is done with the request. Testing this easy.
Case 2: when B is not up
A assumes that B is up but it is not. So, A will have to deal with an eventual non-availability of  response: some kind of timeout handling is obviously necessary. After all, HTTP is a session-based and single-connection protocol. If B is not available or reachable, TCP layer offers a number of error codes like ECONNREFUSED, ENETUNREACH, EHOSTUNREACH etc. ( Do you recall days of living with tomes by late Richard Stevens? I certainly do 😃). HTTP covers all of them for us, thankfully!
Case 3: B is slow
This is a little tricky! If B is slow to respond, uncertainty creeps in. Should A wait for a little longer for B’s response? If it does, then for how long? Configurable timeout value? That of course, is the most popular way and is most likely to be useful for a large number of cases, but the basic point remains: A is uncertain of receiving a response. We will revisit this a little later.
In all these situations, underlying TCP is helping us. HTTP sets up a single connection. Once the connection is set up (because B could be contacted), A is certain - well, almost certain but let’s not nitpick - that B has received the request. Everything else, including the testing approach, follows this.

Microservices are meant to be asynchronous


Asynchronous means just that: an interaction between two parties is based on the understanding that one of the parties may not be in a position to participate in the interaction, at that moment. It may or may not: this is the key understanding. In essence, this is an implementation of the classical Producer-Consumer problem. It is easy to visualize but may not be that straightforward to adopt when it comes to real-life situations.
The basic premise is this: A wants to tell B something but B may or may not available to sit and listen (it is not unwilling, it is just unavailable). So, A tells an intermediary what it has to tell B. When B is ready, it checks with the intermediary, and is told what A wants to tell it in person, but cannot.

Case 1:
A sends a message to B, but in reality, the message is dropped in the intermediary’s bin. The send() ( or post() ) operation of A, succeeds! A will not know if B ever comes to the intermediary and if it does, when! A can surely implement a Timeout logic, but it is difficult for it to know why B has never responded: is it because
  • the intermediary is sluggish, not really B’s fault
  • intermediary has become unavailable after accepting the message from A
  • B is sluggish (the same as B being slow, refer to #3 of synchronous case)
  • B runs into problems while readying for processing A’s message and throws an exception
  • … and more!
Observe, how the number of test-scenarios has increased (not necessarily associated with application’s functional correctness) simply because one intermediary has been introduced between A and B.

Case 2: A needs to interact with B and C
In this case, A needs to hear from B and C, before deciding what should it do with two responses before forming its own response. Remember A’s upstream callers; they are expecting something from A. This is rather obvious. All the points listed in the previous case, are now doubled and may be more. For example, the intermediary and B have been very quick and co-operative but C has been sluggish!

Case 3: A interacts with B and C, and C needs to tell D about this interaction
This is somewhat orthogonal to the caller->A->(B and C) flow. A’s callers are not interested about what information is shared with D but the application’s overall behaviour necessitates it.
Microservice D - mentioned in the arrangement above - poses an additional problem, when it comes to verification of application’s behaviour.
  • What if A successfully responds to its upstream callers - thereby giving an impression that everything is fine - and yet, C fails to inform D? Should we still assert that the application is behaving as expected?
  • How do we confirm if D has indeed received the correct piece of information from C? Please note that the flow allows C to tell D, but doesn’t mandate D to acknowledge to C. Therefore, inquiring with C will not give us the answer. We need a mechanism to inquire of D, if it has heard from C at all.


The effect of being asynchronous on testing

If you have been following me along, you must have agreed with me that the interactions between Microservices give rise to an increasing number of test-scenarios.
The important point is that entirely by itself, a Microservice is rather easy to test. It is supposed to be self-sufficient: it has its own data structures, components, databases and storage. We can possibly mock much of these and prove that its behaviour is in line with what is expected. Such a Microservice responds to messages from other Microservices. Even though, this brings in additional test-scaffolding (and associated complexity and effort), the situation is manageable. After all, there is a predictability here: send the Microservice a message and check how it reacts.

However, how do we deal with the asynchronous aspect of the interaction while testing: the cases where a message may or may not reach the Microservice and even if it does, then we don't know, by when? If there is a delay, is that acceptable? That is not all. As I have elucidated about (C telling D), the re is a sizeable jump in the cases, when two Microservices interact to complete a particular Use-case.

For example, if the number of messages from A to C is more than 50 in a minute, then D must be notified but D doesn’t need to acknowledge. If D is unavailable now, C will never know. If D is available after 2 minutes, C will never know either.
Think about the cases that we need to handle!

Summary

Microservices bring in a host of benefits for sure, but require us to be ready to pay for it as well. While deciding to go the Microservices way, It is very easy to overlook the complexities that we need to handle, only to be reminded of the price at a later stage of the projects, accompanied by unavoidable pain. It is obvious that we must plan for this. Usual ways of testing are going to be insufficient. Protocols must be well-thought. Designs should be ground up. Exposed functionality must take into account, operational demands and constraints. Tolerance for failure must be arrived at, following objective analysis of the overall behaviour of the application.

I have not touched upon the distributed nature of Microservice based applications. No, networks are not reliable! I have not touched the whole subject of pre-deployment and post-deployment testing. No, Dockers and Kubes don’t take away the basic fact that interaction between two temporally decoupled processes is fraught with uncertainty. These are topics of a follow-up blog, perhaps.


An usual synchronous testing approach doesn’t work with a bunch of Microservices meant to behave asynchronously. It is important that we bear this in mind.

(This blog is also posted here: I work as a Principal at Swanspeed Consulting)

Sunday, April 9, 2017

Akka's CircuitBreaker: how to incorporate in your application, by passing messages and handling futures

Background

These days, it is quite common to build an application, which makes use of various other services, to augment its own service. In fact, the value of such a service lies in this aggregation: an aggregation, if and when thoughtfully done - and its result is brought to the users - lifts the usefulness of the application by a few notches.
So, in a typical case of such aggregation, a call to our application give rise to a number of calls to services external to itself. Let's assume that all such calls follow the REST style, over HTTP. Our  application makes a series of calls (may be concurrently) , waits for them to return, processes/transforms the values that are returned and possibly, performs an aggregation. Regular stuff, nothing surprising here.
What if one or more of these external calls fail (unreachable host, as an example)?
The call from our application times out eventually, and our fallback logic takes over; perhaps, a default or an exception-indicating value is returned. Again, this is regular stuff too. We all encounter such cases and take requisite actions. This is certainly not a blocker, by itself.
However, if our application is widely popular (why shouldn't it be?), then it is quite likely to be the target of numerous simultaneous calls from its users; perhaps the result of a sudden surge. All these incoming calls are going to result in outbound calls to the external services. If an external service is unavailable, each of these incoming calls (call chains, one may say) is going to waste some time waiting for the responses.

The problem at hand

The point is this: if 2 or 3 successive incoming calls find an outbound call failing, then it is reasonable to assume that a persistent problem exists at the external service's endpoint. Therefore, it is to our application's (and to the users of our application, by extension) benefit that we instruct the subsequent incoming calls to bypass this.  This decision may result in disappointing (or irritating) a few users with missing pieces of information but overall throughput of our application is not compromised.
A Circuit Breaker is a handy pattern to apply to put this approach in practice.

Experts’ take on CircuitBreaker

I came to learn about the CircuitBreaker pattern, from Martin Fowler's blog on the subject: https://martinfowler.com/bliki/CircuitBreaker.html This still is a very good point to start, in my opinion.
Wikipedia has a detailed treatise on it. I also found this blog, to be quite informative:  

CircuitBreaker in Akka

This blog is about the support that Akka has, for this particular pattern. From whatever I have seen so far, a good example of how a CircuitBreaker works in Akka is hard to come by. Akka's own documentation on the subject, is woefully short of what one looks for. Here's my attempt to fill in the void. Let's jump into that.
Akka provides a CircuitBreaker. If an actor calls a service using a CircuitBreaker, then successful and failed calls are tracked. If a certain number of successive calls fails, CircuitBreaker prevents any further request - opens the circuit as it were - to the service, till a certain amount of time elapses. The first request after this duration is allowed to call the service. If this is successful, the CircuitBreaker opens the circuit temporarily; we say the circuit is half-open now. If the next request is successful, then the CircuitBreaker assumes that the external service is now available and the circuit is closed; if the next request fails again, the circuit becomes  open again.
To demonstrate this behaviour, let us implement a small application. It makes use of the information about leading soccer clubs, available at http://clubinfo.com. If we place a call to this REST api, to retrieve information about a soccer club we form a HTTP GET request:
where 7 represents a particular club (in this case, Hamburger SV)
The response is a JSON string, which looks like this:
Selection_283.png

We employ the following actors to implement the flow:
  • Requestor
  • SoccerClubInfoGetter
  • SysAdminConsole and
  • CallWastePreventor
Various actors and messages which our sample application employs

The code is here: https://github.com/nsengupta/Akka-CircuitBreaker-Demo
Requestor is our point of entry; anybody looking for the details of a club, must send a message to this actor. Requestor plays the role of a mediator; it calls either SoccerClubInfoGetter or CallWastePreventor. Let me explain.
SoccerClubInfoGetter encapsulates the actual call made to the external REST endpoint, asynchronously:

CompletableFuture.supplyAsync(
new Supplier<ClubDetailsFromXternalSource>() {
        @Override
         public ClubDetailsFromXternalSource get() {
String s = Http.get(clubInfoAskedFor).text();
return (
new   
                         InteractionProtocol.ClubDetailsFromXternalSource(s,originalSender)
                        );
                    }
}),
getContext().system().dispatcher()

I am using a simple REST endpoint accessing tool, named javalite (Http.get()) Any such tool can be used here.
SoccerClubInfoGetter simply pipes the response to javalite's (Http.get()) call above (available as a JSON):
pipe (
CompletableFuture.supplyAsync(
               // As shown above
           )
        ).to(getSender()
    );

In our case, Requestor is the recipient of the piped response above, because it is the sender.
In order to emulate a failing call to the external REST endpoint, we feign a longish sleep, so that the caller decides to give up and return, convinced that the external service is unavailable. Because all clubs are identified by a non-zero integer, we take such a step when the identifier of the club is passed as zero!
else { // Emulating a failed call to the external service
Thread.sleep(2000);
getSender().tell(
  new InteractionProtocol.UnavailableClubDetails("timed out"),
  getSelf()
      }

Take a look at SoccerClubInfoGetter to understand how is it enacting the role of a ‘failure generator’, if you will.
Requestor relies on the SoccerClubInfoGetterActor for getting the job done. If SoccerClubInfoGetterActor sleeps for 2 seconds, Requestor's ask() times out (ASK_TIMEOUT value is set to 1 second).

We convert the Object it returns, to a ClubDetailsFromXternalSource message using a Function object. If ask() fails, we absorb the Exception and return a TimedOutClubDetails message. In both of these cases, the Requestor feeds the message to itself, by calling a getSelf().
Requestor's receive() function is equipped with the logic of handling either of the two messages mentioned in the paragraph above:

Requestor remembers the ActorRef of the actor who has sought the club's information in the first place, which it uses to provide the appropriate response. In case, the ask() has timed out, the response is a piece of helpful information ("Service unresponsive, try again later"). In both the cases, we are returning a String type; so, the compiler is happy too.
Take a look at RESTDriver, to understand the behaviour of Requestor and SoccerClubInfoGetter.

Failure is emulated

Let's revisit the following code snippet from SoccerClubInfoGetter:
It is an emulation of an unpredictable situation: the call to the external API may or may not fail; we don't know. To bring in an element of determinism, we are targeting a clubID of zero to force the logic follow this path. By making it sleep, we are delaying the thread long enough to cause the ask() from Requestor to time out. In a real-life situation, this determinism is absent; so, it is possible that a series of calls to the external API fails after timing out. This is a waste for sure. Our application's throughput is adversely affected and users are unhappy.

A CircuitBreaker comes as an aid

A CircuitBreaker helps in a situation like this. When we instantiate a CircuitBreaker, we give it a number that represents the tolerance level of failures and a duration for which it remains . If we set it to 3, then we are telling the CircuitBreaker that
  1. If 3 successive calls through you fail, consider that target of these calls is in trouble.
  2. Cause the circuit to open and leave it at that for the finite duration you are initialized with
  3. Any call that reaches you during this duration, send it back immediately, citing a failure
  4. Once the duration elapses, be ready the for next call and let it go through.
  5. If this call fails, go back to step 2 above.
To execute this cycle predictably and flawlessly, a CircuitBreaker follows a series of well-defined State Transitions. This page from Akka's documentation provides a clear illustration.
If you follow the diagram, you can see that the step '4' above, leaves the story, tad incomplete. What actually happens is that at the step 4, the CircuitBreaker causes the circuit to open in half. If the very next call (step '5' above) is successful, then the CircuitBreaker causes the circuit to close; otherwise, it causes the circuit to be open again (step '2' again).
In our application, we implement a CircuitBreaker in the CallWastePreventor actor (code here).
The key aspect of CallWastePreventor's receive() function is a block, which is - unsurprisingly - very similar to SoccerClubInfoGetter:

workingCallable parameter to the circuit breaker is an instance of JDK 8's Callable. As the  pipe() call expects a CompletionStage, this Callable has to produce a CompletionStage. Moreover, as the pipe() call has to finally pipe a message of the type ClubDetailsFromXternalSource, the Callable has to produce a CompletionStage of ClubDetailsFromXternalSource:

nonWorkingCallable is initialized in an equivalent manner.
In its role as the mediator, the Requestor makes no difference between a SoccerClubInfoGetter and a CallWastePreventor. The mechanism it resorts to is the same: it asks whichever ActorRef it is injected with, (the construction parameter circuitBreakerJeeves) for the information about a particular club and deals with the response it eventually receives.
To get an idea of the way the application works, take a look at RESTDriver (uses SoccerClubInfoGetter) and CBDriver (uses CallWastePreventor).

CircuitBreaker’s Transition handlers

A CircuitBreaker provides a facility to register callback functions, namely onOpen, onHalfOpen and onClose, to mark each of the state-transitions mentioned earlier. We employ a SysAdminConsole actor; this actor is notified every time the CircuitBreaker's state changes. Using a specific message called AdminQueryMessage, we get to know what is the current state, the CircuitBreaker is in, at the moment. We use this to our advantage while testing the application. Take a look at CircuitBreakerTest (in the usual location of src/test/java).

Importance of durations

It is a common (and important) knowledge to realize the importance of duration in the behaviour of an Actor-based application. In our demonstrative application, it assumes even greater significance, because the basis of the emulation of failure in it, is occurrence of a Time-Out. For the sake of simplicity (and, some laziness too, I admit), some of these durations are initialized using hard-coded values. As a result, other components, whose behaviour depends on the aforementioned durations, also use hard-coded values. This understanding will help, when you go through CircuitBreakerTest, CBDriver and RESTDriver.

Let me know if you find this article helpful, or if you think something specific is missing, whose presence would have helped in grasping the behaviour of Akka’s CircuitBreaker.