Let me tell you a short story. Once upon a time I had to replace the outdated OpenCensus (OC) tracing demo with a new OpenTelemetry (OT) instrumentation demo application. If you don’t know what OC or OT, you can stop reading. The story will not be very interesting for you.
The work was easy. I already had a code named cloud-trace-demo-opentelemetry. I did what would many people do. I asked someone who knows OT better than me to look into the code and tell me whether it will work or not. After getting their approval, I went to update README and review the step-by-step executing instructions. And then I tried to execute these instructions myself. And… the demo did not work. I ended up with looking into OT Flask instrumentation (oh, I forgot to mention that the demo was implemented in Python). And then into OT End-to-end Flask example. Using these reading materials, I instrumented the demo application code like the following:
from opentelemetry import propagate, trace
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.propagators.cloud_trace_propagator import CloudTraceFormatPropagator
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessorapp = flask.Flask(__name__)
tracer = trace.get_tracer(__name__)
It worked nicely by generating a new trace and span Ids each time a Flask handler was called and ingested the traces into Cloud Trace. The catch was that my demo application deployed three services which supposed to call one to another on receiving an outside request in the first one. I expected to get a nice waterfall view of the nested traces. Instead I got three unrelated traces, all with unique trace Ids and unrelated span Ids, which defeating the purpose of demonstrating the distributed tracing. It did not took long to figure out the missing part. I had to instrument the requests that are sent from within Flask handlers in addition to instrumenting the Flask handlers. So I added the import instruction and the following line of code to the end of the instrumentation (above):
This was the moment when everything was supposed to start working. And it was the moment when everything stopped working at all. My application got stuck (a.k.a. froze a.k.a. hung up) in the first service’s request handler. I looked for a solution on the internet but found nothing. And I did what many other people would do. I went back to someone who knows OT better than me. It took them time but the solution was found. Apparently it was in those OT code samples from the beginning (here is another example and one more). It is just no document page explained beyond performance reasons why one should use BatchSpanProcessor although other examples show use of SimpleSpanProcessor too. We found out that using Flask instrumentation and Request instrumentation together with SimpleSpanProcessor leads to a kind of deadlock when the processor tries to send a span (SimpleSpanProcessor does synchronously) and creates a new span somewhere on the way. Replacing in the instrumentation code the SimpleSpanProcessor with BatchSpanProcessor magically unlocked everything.
Today I learned that (at least in Python) two OT instrumentation FlaskInstrumentor and RequestInstrumentor do not work together unless spans are processed asynchronously.
P.S. The demo application I was talking about, if you ever want to look at it is the Cloud Trace demo of OpenTelemetry.