Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the trace pipeline towards statelessness #121

Open
1 of 4 tasks
athre0z opened this issue Aug 15, 2024 · 1 comment
Open
1 of 4 tasks

Rework the trace pipeline towards statelessness #121

athre0z opened this issue Aug 15, 2024 · 1 comment
Assignees
Labels

Comments

@athre0z
Copy link
Member

athre0z commented Aug 15, 2024

Problem

Our trace processing pipeline is currently engineered towards a backend keeps the information that receives around forever in a bunch of places. When information was sent once, it often won't be sent again until agent restart. This is problematic for two reasons:

  • The new OTLP protocol is stateless and requires us to send completely self-contained packets of data. The OTLP reporter currently works around this by keeping all the info in LRUs. This kind of works, but we run into issues of permanently missing symbols and executable names when the LRU starts evicting. It also costs a lot of memory.
  • Stateful reporter implementations with backends that phase out old data after a fixed period will run into issues with data that got evicted but will never be resent unless the profiling agent is restarted periodically.

Affected information

The following information is currently prone to falling out of LRU without a chance of it ever being resent:

  • Interpreter and kernel frame symbols
    • Each interpreter handler currently implements a domain-specific approach to ensuring that frame info is sent just once for the lifetime of the interpreter process (i.e. Python)
    • Kernel frame resends are suppressed with an LRU without an expiry
  • Executable info
    • Sent only once when the executable is first seen, never resent

Rough outline of a solution

We need to rework the whole trace pipeline to ensure that all of this information is available all the time. There are two possible paths that we can pursue here:

  • Make all components resend all information all the time.
  • Rework the pipeline to be query-based instead: if the reporter needs an executable name, it would go to the process manager and ask for it. Probably nice efficiency-wise, but results in ugly circle dependencies.
  • Probably more options that I didn't think of when creating this issue

We can probably get rid of tracehandler entirely. The caches that it maintains
will likely go away and the remaining few lines can be merged directly into
Tracer.

Sub-issues

@fabled fabled self-assigned this Sep 25, 2024
@rockdaboot
Copy link
Contributor

The most important point of this issue is that some of the symbols are retrieved/reported only once per agent lifetime. This can even be problematic with a stateful backend, if data is removed, manually or via automatic data retention policies.

With a stateless protocol like the OTEL protocol, the issue becomes even more dominant. The agent core has been developed with a stateful protocol/backend in mind. So the switch to the stateless OTEL protocol requires changes in regards to caching (mostly symbols).

The possibly most important change is to move the caching of symbols out of the agent core into the Reporter implementation, which then decides about caching details and resending.

Consequently, the Reporter interface needs to be amended (as well as the agent core).

Possible solutions

  1. The agent core passes always frame symbols to the reporter with every stacktrace.
    The downside would be increased CPU usage for creating arrays of symbols, even if not needed.

  2. The agent core passes provides a function to return symbols, which is called by the reporter if needed.
    The downside is an ugly call/dependency recursion.

  3. The Reporter interface provides a function that allows the agent core to ask whether symbols for a given frame are needed.
    The downside is that this function needs to be called very often (one call per frame).

@fabled works on a PoC PR to implement point 3 for further discussion and for doing benchmarks.

Additional required work

  • Kernel modules are recognized only at agent startup. How can we parse their symbols lazily?

fabled added a commit to fabled/opentelemetry-ebpf-profiler that referenced this issue Sep 30, 2024
Due to legacy reasons, each interpreter kept their own state of
which dynamic metadata should be sent to the reporter. Several
of these caches would never expire, causing caching issues in
the otlp reporter module.

This removes the caching state from all interpreters and pushes
it to the reporter module. A new reporter API call FrameNeeded
is added to query if a specific Frame is in the cache or not.
Not all interpreter modules use the call as all the information
might be available with little overhead. FrameMetadata is also
updated to use the FrameID type for symmetry.

Improved are:
 - reduced memory overhead as per-interpreter caches are removed
 - reporter module can now control which frames need resolving
 - fixes otlp to get the frames re-symbolized if its internal
   lru already forgot about the earlier symbolization information

ref open-telemetry#121
fabled added a commit to fabled/opentelemetry-ebpf-profiler that referenced this issue Sep 30, 2024
Due to legacy reasons, each interpreter kept their own state of
which dynamic metadata should be sent to the reporter. Several
of these caches would never expire, causing caching issues in
the otlp reporter module.

This removes the caching state from all interpreters and pushes
it to the reporter module. A new reporter API call FrameNeeded
is added to query if a specific Frame is in the cache or not.
Not all interpreter modules use the call as all the information
might be available with little overhead. FrameMetadata is also
updated to use the FrameID type for symmetry.

Improved are:
 - reduced memory overhead as per-interpreter caches are removed
 - reporter module can now control which frames need resolving
 - fixes otlp to get the frames re-symbolized if its internal
   lru already forgot about the earlier symbolization information

ref open-telemetry#121
fabled added a commit to fabled/opentelemetry-ebpf-profiler that referenced this issue Sep 30, 2024
Due to legacy reasons, each interpreter kept their own state of
which dynamic metadata should be sent to the reporter. Several
of these caches would never expire, causing caching issues in
the otlp reporter module.

This removes the caching state from all interpreters and pushes
it to the reporter module. A new reporter API call FrameNeeded
is added to query if a specific Frame is in the cache or not.
Not all interpreter modules use the call as all the information
might be available with little overhead. FrameMetadata is also
updated to use the FrameID type for symmetry.

Improved are:
 - reduced memory overhead as per-interpreter caches are removed
 - reporter module can now control which frames need resolving
 - fixes otlp to get the frames re-symbolized if its internal
   lru already forgot about the earlier symbolization information

ref open-telemetry#121
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants