Rework the trace pipeline towards statelessness #121

athre0z · 2024-08-15T16:50:45Z

Problem

Our trace processing pipeline is currently engineered towards a backend keeps the information that receives around forever in a bunch of places. When information was sent once, it often won't be sent again until agent restart. This is problematic for two reasons:

The new OTLP protocol is stateless and requires us to send completely self-contained packets of data. The OTLP reporter currently works around this by keeping all the info in LRUs. This kind of works, but we run into issues of permanently missing symbols and executable names when the LRU starts evicting. It also costs a lot of memory.
Stateful reporter implementations with backends that phase out old data after a fixed period will run into issues with data that got evicted but will never be resent unless the profiling agent is restarted periodically.

Affected information

The following information is currently prone to falling out of LRU without a chance of it ever being resent:

Interpreter and kernel frame symbols
- Each interpreter handler currently implements a domain-specific approach to ensuring that frame info is sent just once for the lifetime of the interpreter process (i.e. Python)
- Kernel frame resends are suppressed with an LRU without an expiry
Executable info
- Sent only once when the executable is first seen, never resent

Rough outline of a solution

We need to rework the whole trace pipeline to ensure that all of this information is available all the time. There are two possible paths that we can pursue here:

Make all components resend all information all the time.
Rework the pipeline to be query-based instead: if the reporter needs an executable name, it would go to the process manager and ask for it. Probably nice efficiency-wise, but results in ugly circle dependencies.
Probably more options that I didn't think of when creating this issue

We can probably get rid of tracehandler entirely. The caches that it maintains
will likely go away and the remaining few lines can be merged directly into
Tracer.

Sub-issues

Investigate possible solutions (see comment)
interpreters, reporter: refactor towards statelessness #171 for discussion about reporter and interpreter changes
Make reporting of executable metadata stateless
Make reporting of kernel symbols stateless

The text was updated successfully, but these errors were encountered:

rockdaboot · 2024-09-30T11:32:34Z

The most important point of this issue is that some of the symbols are retrieved/reported only once per agent lifetime. This can even be problematic with a stateful backend, if data is removed, manually or via automatic data retention policies.

With a stateless protocol like the OTEL protocol, the issue becomes even more dominant. The agent core has been developed with a stateful protocol/backend in mind. So the switch to the stateless OTEL protocol requires changes in regards to caching (mostly symbols).

The possibly most important change is to move the caching of symbols out of the agent core into the Reporter implementation, which then decides about caching details and resending.

Consequently, the Reporter interface needs to be amended (as well as the agent core).

Possible solutions

The agent core passes always frame symbols to the reporter with every stacktrace.
The downside would be increased CPU usage for creating arrays of symbols, even if not needed.
The agent core passes provides a function to return symbols, which is called by the reporter if needed.
The downside is an ugly call/dependency recursion.
The Reporter interface provides a function that allows the agent core to ask whether symbols for a given frame are needed.
The downside is that this function needs to be called very often (one call per frame).

@fabled works on a PoC PR to implement point 3 for further discussion and for doing benchmarks.

Additional required work

Kernel modules are recognized only at agent startup. How can we parse their symbols lazily?

Due to legacy reasons, each interpreter kept their own state of which dynamic metadata should be sent to the reporter. Several of these caches would never expire, causing caching issues in the otlp reporter module. This removes the caching state from all interpreters and pushes it to the reporter module. A new reporter API call FrameNeeded is added to query if a specific Frame is in the cache or not. Not all interpreter modules use the call as all the information might be available with little overhead. FrameMetadata is also updated to use the FrameID type for symmetry. Improved are: - reduced memory overhead as per-interpreter caches are removed - reporter module can now control which frames need resolving - fixes otlp to get the frames re-symbolized if its internal lru already forgot about the earlier symbolization information ref open-telemetry#121

athre0z added the cleanup label Aug 15, 2024

fabled self-assigned this Sep 25, 2024

fabled mentioned this issue Sep 30, 2024

interpreters, reporter: refactor towards statelessness #171

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework the trace pipeline towards statelessness #121

Rework the trace pipeline towards statelessness #121

athre0z commented Aug 15, 2024 •

edited by rockdaboot

Loading

rockdaboot commented Sep 30, 2024

Rework the trace pipeline towards statelessness #121

Rework the trace pipeline towards statelessness #121

Comments

athre0z commented Aug 15, 2024 • edited by rockdaboot Loading

Problem

Affected information

Rough outline of a solution

Sub-issues

rockdaboot commented Sep 30, 2024

Possible solutions

Additional required work

athre0z commented Aug 15, 2024 •

edited by rockdaboot

Loading