Distributed Tracing

Mahendra Yadav
Mahendra Yadav
February 27, 2019
#devops

Distributed Tracing

  • What is tracing?
  • What is Distributed Tracing?
  • Why do we need tracing?
  •  Microservices vs Monolithic Example: E-commerce
  • Why we are not using tracing?
  • What is Jaeger?
  • Tracer
  • Span
  • Showing the OpenTracing Hot R.O.D. Demo
  • Conclusion
  • Req/Res Visualisation
  • Helps in Debugging
  • Root Cause Analysis

Key Takeaways

  • Introducing the concept of tracing in distributed systems.

What is Tracing?

In software engineering, tracing involves a specialized use of logging to record information about a program's execution. - Wikipedia

What is Distributed Tracing?

Distributed tracing is a method used to profile and monitor applications. It helps pinpoint where failures occur and what causes poor performance. It is also called as distributed request tracing.

Why do we need distributed tracing?

Let’s take the example of an E-commerce website.

Given below is the architecture of a monolithic e-commerce app with User management, Order management, Inventory management, etc.

Monolithic E-commerce Application

In a monolithic application, tracing can help to debug and analyze the performance of the application since the request coming to the app is processed in one location.

   

Now, let’s consider a refactoring of the whole monolithic app to different microservices.

E-Commerce application in Microservices Architecture

After a while to increase sales, we require to add a recommendation engine.

E-Commerce application in Microservices Architecture with one more new services

Soon after integrating the recommendation engine the system starts slowing down. At this point, we can easily say the culprit is the recommendation engine since it has been recently added but what about the existing systems. We do not have a mechanism to identify the bottlenecks. The immediate action would be to collect the request and response metrics in each service.

Why are we not using distributed tracing?

Instrumentation is complex

 It is very complex and time-consuming to implement distributed tracing in a project.

Vendor Lock-in

Requires a lot of change if we plan to change a distributed tracing implementation from one vendor to other.

Inconsistent APIs

Tracing semantics must not be language dependent.

OpenTracing is an API specification which overcomes the aforementioned limitations and makes the integration of distributed tracing seamless within microservices. Jaeger is one implementation of OpenTracing.

Jaeger

Jaeger is developed in Go by Uber for solving their distributed tracing requirements. It is inspired by Dapper and OpenZipkin. It has instrumentation libraries for all popular languages.

Image describing opentracing role in tracing distributed system.
Reference: lightstep.com
OpenTracing Instrumentation

Before going forward, let’s discuss two important concepts span and trace.

Span

A span represents an individual unit of work done in a distributed system. It contains an operation name, start time and end time of the operation.

Trace

A trace is a data execution path through the system. It can be thought of as a directed acyclic graph of spans.

Examining a Single Trace containing multiple Spans

Here in the above figure, we can see that the Service 1 - span is spread horizontally and under the same, there are spans from the same as well as different services performing different operations. A trace is a combination of all the spans from different services.

Example:

Hot R.O.D. - Rides on Demand app

The application HotROD comprises of frontend service, route service, customer service and driver service and it has MySQL and Redis as a data store. We can see the dependency graph of the system below.

Empirically constructed service dependency diagram
Jaeger UI - HotROD Dependencies Diagram

Moving forward let’s request for a cab from the frontend -

We have four customers, and by clicking one of the four buttons we summon a cab to arrive at the customer’s location. Once a request for a cab is sent to the backend, it responds with the cab’s license plate number and the expected time of arrival, which we can visualize in the app which will look similar to the image given below.

HotROD UI - Ride requested

Now, you can go to the Jaeger UI and explore how the request made above is flowing from one service to other.

Jaeger UI - data flow

Let’s zoom in and see what is happening in our MySQL span.

Jaeger UI trace example before exploring mysql span

Jaeger UI trace example - exploring MySQL span

We can see that from looking into the span we are getting information like from where the request is coming from client or server and we can also see that what is the query which we are executing on the database and finally we have logs from the system as well. So we can say that we have enough info to analyse and debug our system.

Conclusion

Request/Response Visualisation

Provides a UI to visualize the request flowing to different services which provide details like time being consumed between the calls.

Helps in Debugging

It helps in debugging and improving the system.

Root Cause Analysis

It helps in easily finding the real cause of an issue.

Reference:

https://opentracing.io/

https://www.jaegertracing.io/docs/1.9/

https://docs.lightstep.com/docs/opentracing-instrumentation

https://medium.com/opentracing/take-opentracing-for-a-hotrod-ride-f6e3141f7941