Ftrace: The Linux kernel hacker’s swiss knife

This post was written by eli on April 13, 2020
Posted Under: Linux kernel

Introduction

I ran into ftrace completely by chance, while trying to figure out why the call to usb_submit_urb() took so long time. In fact, it wasn’t. It was pr_info() that delayed the output. And it was ftrace that got me to realize that.

Whether you’re into dissecting existing kernel code, and want to know which function calls which, or if you need a close look on what your own code actually does (and when), ftrace is your companion. And it does a lot more.

And that’s maybe its problem: It offers so many different possibilities, that its documentation gets not so inviting to read. Add some confusing terminology and focus on advanced issues, and one gets the impression that starting to use it is a project in itself.

This is definitely not the case. This post consists of some simple & useful tasks. It’s not much about accuracy, doing it the right way nor showing the whole picture. It’s about about getting stuff done. If you’ll need to nail down something more specific, read the docs. It looks like they got virtually any useful scenario covered.

I’ll divert from keeping things simple in part on events at the end of this post. The concept of events is fairly simple, but the implementation, well, well. But the point is partly to demonstrate exactly that.

Does your kernel support ftrace?

Ftrace is often enabled in compiled kernels, but not always. Look for /sys/kernel/debug/tracing/ (as root) and in particular go

# cat available_tracers

If function_graph and function aren’t listed as available_tracers, the kernel needs to be recompiled with the correct options. Namely, CONFIG_FUNCTION_TRACER, CONFIG_FUNCTION_GRAPH_TRACER, CONFIG_STACK_TRACER and CONFIG_DYNAMIC_FTRACE.

References

These are some good resources. I would usually put them at the end of the post, but noone is expected to get there.

  • A gentle introduction by Alex Dzyoba.
  • /sys/kernel/debug/tracing/README — it’s like the -h flag on user space utilities
  • Detailed documentation from the kernel tree: Documentation/trace/ftrace.rst (and other files in the same directory)
  • A tutorial in LWN, part 1 and part 2.
  • Some good slides on the topic.

Getting around a bit

They’ll tell you that there’s a special tracefs to mount, but it’s easily available under /sys/kernel:

# cd /sys/kernel/debug/tracing/

For tracing invocations of functions, there are two formats: The simple “function” and it’s more sophisticated “function_graph”. It’s easiest to just try them out: Select the function graph tracer

# echo function_graph > current_tracer

Watch the tracing output

# less trace

There’s also trace_pipe with a “tail -f” flavor.

The output isn’t very helpful for now, as it shows every function invoked on all processors. Too much information. We’ll get to filtering just below.

Turn tracing off

# echo 0 > tracing_on

or on

# echo 1 > tracing_on

and clear (empty) the trace:

# echo > trace

Note that turning tracing off turns off everything, including trace_printk() discussed below. It’s a complete halt. To just stop one of function tracers, better go

# echo nop > current_tracer

The trace data is stored in circular buffers, so old entries are overwritten by newer ones if these buffers get full. The problem is that there’s a separate buffer for each CPU, so once this overwriting begins, the overall trace may miss out traces from only some CPUs on those time segments. Therefore I prefer turning off overwriting completely. At least the beginning of the trace reflects what actually happened (and the mess is left to the end):

# echo 0 > options/overwrite

Now to the “function” variant, just which function was invoked by which along with a timestamp. There is however no information on when the function returned (how much time it took), but the absolute timestamp can be matched with dmesg stamp. But this requires selecting ftrace’s clock as global:

# echo function > current_tracer
# echo global > trace_clock

It’s very important to note that printk (and its derivatives) can take up to a few milliseconds, so an apparent mismatch in the timestamps between a printk and a function called immediately after it may be a result of that.

Using trace_printk()

It’s quite common to use printk and its derivatives for debugging kernel code, which is fairly OK if there’s no problem if these slow down the execution — I’ve seen printk taking a few milliseconds (in process context, I should say).

So for quick and time-accurate printk-like debugging, definitely go for trace_printk(). Plus you get some extra info, such as the process / interrupt context and which CPU it ran on. It’s just a matter of adding something like

trace_printk("Submitting buffer ID=%d, len=%d\n", id, count);

in your kernel code, instead of pr_info(), dev_info() or whatever. No extra header file or anything of that sort needed, as far as I can tell.

I should mention that trace_printk() is intended for temporary debug messages, and these can’t be left in production code. If you want debug messages that stay there, go for events. Trickier, but the right way to go.

trace_printk() messages are logged by default, even when current_tracer is set to “nop”. However tracing must be on. In short, a newly booted system will show trace_printk()’s output in “trace”.

These messages will be interleaved with the other traces if current_tracer is in non-nop mode. This can be useful when the situation of interest occurs rarely — for example a code segments takes too long to run. Since the outcome is known only at the end, the simple solution is to run function tracing, and call trace_printk() when the time difference exceeds a threshold. This message will of course appear after the function sequence in the trace, but it’s easily looked up with plain text tools (e.g. “less”).

The system boots up with tracing on and in “nop” mode, but if you’ve been fiddling with these, bring them back with

# echo nop > current_tracer
# echo 1 > tracing_on

And then view the messages just like before (“trace”, “trace_pipe” or the CPU-individual counterparts).

The reason you don’t want to leave trace_printk() in production code is that the first call will leave a message of this form in the kernel log:

[ 1954.872204] **********************************************************
[ 1954.875230] **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
[ 1954.878242] **                                                      **
[ 1954.881257] ** trace_printk() being used. Allocating extra memory.  **
[ 1954.884281] **                                                      **
[ 1954.887345] ** This means that this is a DEBUG kernel and it is     **
[ 1954.890397] ** unsafe for production use.                           **
[ 1954.893470] **                                                      **
[ 1954.896496] ** If you see this message and you are not debugging    **
[ 1954.899527] ** the kernel, report this immediately to your vendor!  **
[ 1954.902539] **                                                      **
[ 1954.905598] **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
[ 1954.908713] **********************************************************

This is of course nothing to be alarmed about (if you’re the one who made the trace_printk() calls, that is). Seems like the purpose of this message is to gently convince programmers to use events in production code instead.

Tracing a specific function call

Ftrace comes with great filtering capabilities, which can be really complicated. So let’s take a simple usage case. Say that I want to see how much time elapses from the call to usb_submit_urb() in my driver and the callback function. Never mind the details. First, I might want to verify that the function is really traceable — it might not be if it has been optimized away by the compiler or if it’s a #define macro (which definitely isn’t the case for usb_submit_urb(), since it’s an exported function, but anyhow).

So first, look it up in the list of available functions (cwd is /sys/kernel/debug/tracing):

# grep submit_urb available_filter_functions
usb_hcd_submit_urb
usb_submit_urb

Yep, it’s there. So add it to the functions to filter, along with my driver’s callback function, which is bulk_out_completer():

# echo usb_submit_urb bulk_out_completer > set_ftrace_filter
# echo function > current_tracer
# echo > trace

Note that multiple functions can be given to set_ftrace_filter, delimited by plain whitespace. Wildcards can also be used, as shown below.

Now perform the operation (use dd command in my case) that makes the driver active, and harvest the output:

# head -30 trace
# tracer: function
#
# entries-in-buffer/entries-written: 64/64   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
              dd-847   [001] ....  1787.067325: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067335: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067342: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067348: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067353: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067358: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067363: usb_submit_urb <-try_queue_bulk_out
              dd-847   [001] ....  1787.067369: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.068019: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.068080: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.068528: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.068557: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.069039: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.069062: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.069533: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.069556: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.070012: bulk_out_completer <-__usb_hcd_giveback_urb
    kworker/2:1H-614   [002] ....  1787.070036: usb_submit_urb <-try_queue_bulk_out
          <idle>-0     [002] d.h1  1787.070506: bulk_out_completer <-__usb_hcd_giveback_urb

Not sure how useful the data above is, but it demonstrates how a sequence of events can be analyzed easily.

Just to close this issue, let’s take a look on wildcards. This is easiest shown with an example:

# echo '*_submit_urb' > set_ftrace_filter
# cat set_ftrace_filter
usb_hcd_submit_urb
usb_submit_urb

So quite evidently, the wildcard was applied when the filter was set, and its usage is quite trivial.

Don’t forget to remove the filter when it’s not necessary anymore. These filters won’t go away otherwise, and tracing may appear to suddenly not work if you go on doing something else. So simply:

# echo > set_ftrace_filter
# cat set_ftrace_filter
#### all functions enabled ####

It’s not a coincidence that I didn’t use function_graph above — the functions that are called by the selected function(s) won’t show anyhow. It does show the duration of the call, which may be helpful. This brings us to

Which functions does function X call?

This is a great way to get an idea of who-calls-what. Let’s stick with usb_submit_urb(). What other functions does it call, specifically in my case?

Simple. Be sure to have removed any previous filters (see above) and then just go:

# echo usb_submit_urb > set_graph_function
# echo function_graph > current_tracer
# echo > trace

Then do whatever causes the function to be called, after which “trace” reads something like:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  usb_submit_urb() {
 0)   0.979 us    |    usb_urb_ep_type_check();
 0)               |    usb_hcd_submit_urb() {
 0)   0.427 us    |      usb_get_urb();
 0)               |      xhci_map_urb_for_dma [xhci_hcd]() {
 0)               |        usb_hcd_map_urb_for_dma() {
 0)   0.658 us    |          dma_direct_map_page();
 0)   1.739 us    |        }
 0)   2.642 us    |      }

[ ... ]

and it goes on.

This is a good place to remind that if all you wanted was to get the stack trace of calls at a certain point in your code, WARN() and WARN_ONCE() may be handier.

Tracing a segment: quick and dirty

This method doesn’t really trace a specific segment. It’s a bit of a dirty trick: Tracing is disabled at first, and then enabled before the function call, and disabled after it. All function calls that take place on all CPUs during that time are recorded. It may mean additional noise, but this noise may also be the explanation to why something went wrong (if it went wrong).

There is probably a more elegant solution. In particular, ftrace triggers have “traceon” and “traceoff” actions, so this is probably the classier way to go. But let’s do something quick, dirty and simple to understand.

So say that I want to understand what usb_submit_urb() is up to and what other functions it calls along. I could go, in my own driver:

tracing_on();
rc = usb_submit_urb(urb, GFP_KERNEL);
tracing_off();

With this driver loaded into the kernel, then go something like:

# cd /sys/kernel/debug/tracing/
# echo global > trace_clock
# echo 32768 > buffer_size_kb
# echo 0 > tracing_on
# echo function > current_tracer

This assumes that current_tracer was “nop” before this (it’s by default), so the trace buffer begins empty. And when it’s changed to “function” at the last command, nothing happens, because tracing was just turned off. The buffer size is set to 32 MB per CPU (which is quite a lot).

Then run whatever makes the relevant code execute, and then

# less trace

The problem is that trace output from several CPUs is interleaved in this output. But we know which CPU the relevant command ran on from the trace itself, so obtain a filtered version (e.g. for CPU #2):

# less per_cpu/cpu2/trace

Events

The term “events” is somewhat misleading. An event is just a piece of printk-like debug message that can be enabled or disabled in runtime. It’s a convenient way to leave the debugging messages in place even in production code, and make it possible even for end-users to harvest the information without needing to compile anything.

I’ll try to explain this by a (ehm-ehm) simple example. Let’s enable the kmalloc event:

# echo nop > current_tracer
# echo kmalloc > set_event

This turns off the function tracer, and enables the kmalloc event. Looking at “trace” we now have something like:

# tracer: nop
#
# entries-in-buffer/entries-written: 1047/1047   #P:4
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            sshd-781   [003] ....  5317.446761: kmalloc: call_site=ffffffff8153b210 ptr=000000001f5ab582 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ...1  5317.446788: kmalloc: call_site=ffffffff8153cbf1 ptr=000000002525fcc0 bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ....  5317.748476: kmalloc: call_site=ffffffff8153b210 ptr=0000000095781cbc bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ...1  5317.748501: kmalloc: call_site=ffffffff8153cbf1 ptr=00000000c9801d3d bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ....  5317.900662: kmalloc: call_site=ffffffff8153b210 ptr=000000008e7d4585 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            sshd-781   [003] ...1  5317.900687: kmalloc: call_site=ffffffff8153cbf1 ptr=0000000004406a83 bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            bash-792   [000] ....  5318.420356: kmalloc: call_site=ffffffff8119f2c0 ptr=00000000a6835237 bytes_req=184 bytes_alloc=192 gfp_flags=GFP_KERNEL_ACCOUNT|__GFP_ZERO
            bash-792   [000] ....  5318.420365: kmalloc: call_site=ffffffff8119f34d ptr=000000008e7d4585 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL_ACCOUNT|__GFP_ZERO
            bash-792   [000] ....  5318.420404: kmalloc: call_site=ffffffff8116feff ptr=00000000c30e90f8 bytes_req=64 bytes_alloc=64 gfp_flags=GFP_KERNEL|__GFP_ZERO
            sshd-781   [003] ....  5318.420408: kmalloc: call_site=ffffffff8153b210 ptr=000000000b7b85e5 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC
            bash-792   [000] ....  5318.420415: kmalloc: call_site=ffffffff811701e4 ptr=00000000f54127a0 bytes_req=32 bytes_alloc=32 gfp_flags=GFP_KERNEL|__GFP_ZERO
            bash-792   [000] ....  5318.420431: kmalloc: call_site=ffffffff812eb5e2 ptr=0000000084bbe3b4 bytes_req=24 bytes_alloc=32 gfp_flags=GFP_KERNEL|__GFP_ZERO
            sshd-781   [003] ...1  5318.420435: kmalloc: call_site=ffffffff8153cbf1 ptr=00000000a6a3ac50 bytes_req=1024 bytes_alloc=1024 gfp_flags=GFP_ATOMIC|__GFP_NOWARN|__GFP_NOMEMALLOC
            bash-792   [000] ....  5318.420473: kmalloc: call_site=ffffffff811b1aac ptr=0000000095972087 bytes_req=56 bytes_alloc=64 gfp_flags=GFP_KERNEL_ACCOUNT

These are all kmalloc calls that were executed since the event was enabled. There are of course ways to filter the events to specific occasions, but this is really not something I’ve gotten into (yet?).

It’s however interesting to see how these messages came about. To do this, try the following (after enabling the kmalloc event as shown above):

# echo '*kmalloc*' > set_graph_function
# echo function_graph > current_tracer
# echo > trace

And then the trace output can be something like:

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 0)               |  finish_task_switch() {
 0)               |    _raw_spin_unlock_irq() {
 0)   0.794 us    |      do_raw_spin_unlock();
 0)   0.777 us    |      preempt_count_sub();
 0)   4.402 us    |    }
 0)   8.919 us    |  }
 0)               |  __kmalloc_reserve.isra.9() {
 0)               |    __kmalloc_track_caller() {
 0)   0.784 us    |      kmalloc_slab();
 0)   0.654 us    |      should_failslab();
 0)   0.711 us    |      check_irq_off();
 0)               |      cache_alloc_debugcheck_after() {
 0)   4.691 us    |        check_poison_obj();
 0)   0.685 us    |        poison_obj();
 0)   7.353 us    |      }
 0)               |      /* kmalloc: call_site=ffffffff81801a80 ptr=000000000493ffc2 bytes_req=640 bytes_alloc=1024 gfp_flags=GFP_KERNEL|__GFP_NOWARN|__GFP_NOMEMALLOC */
 0) + 15.571 us   |    }
 0) + 17.452 us   |  }

And it goes on like this several times. So now we have the event output in the middle of the function graph, and we can also see what happened: __kmalloc_reserve(), which is defined in net/core/skbuff.c, calls kmalloc_node_track_caller(), which is translated into __kmalloc_track_caller() by virtue of a #define in slab.h. That function is defined in mm/slab.c, but it just redirects the call to __do_kmalloc_node(), which makes the calls visible in the trace, and eventually calls kmem_cache_alloc_node_trace(). This call isn’t registered, most likely because it was optimized away by the compiler. And it reads:

#ifdef CONFIG_TRACING
void *
kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
{
	void *ret;

	ret = slab_alloc(cachep, flags, _RET_IP_);

	ret = kasan_kmalloc(cachep, ret, size, flags);
	trace_kmalloc(_RET_IP_, ret,
		      size, cachep->size, flags);
	return ret;
}
EXPORT_SYMBOL(kmem_cache_alloc_trace);
#endif

So there we have it. But wait! Where is trace_malloc defined? The answer lies at the top of slab.c:

#include <trace/events/kmem.h>

which includes include/trace/events/kmem.h, the beginning of which reads

 /* SPDX-License-Identifier: GPL-2.0 */
#undef TRACE_SYSTEM
#define TRACE_SYSTEM kmem

#if !defined(_TRACE_KMEM_H) || defined(TRACE_HEADER_MULTI_READ)
#define _TRACE_KMEM_H

#include <linux/types.h>
#include <linux/tracepoint.h>
#include <trace/events/mmflags.h>

DECLARE_EVENT_CLASS(kmem_alloc,

	TP_PROTO(unsigned long call_site,
		 const void *ptr,
		 size_t bytes_req,
		 size_t bytes_alloc,
		 gfp_t gfp_flags),

	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags),

	TP_STRUCT__entry(
		__field(	unsigned long,	call_site	)
		__field(	const void *,	ptr		)
		__field(	size_t,		bytes_req	)
		__field(	size_t,		bytes_alloc	)
		__field(	gfp_t,		gfp_flags	)
	),

	TP_fast_assign(
		__entry->call_site	= call_site;
		__entry->ptr		= ptr;
		__entry->bytes_req	= bytes_req;
		__entry->bytes_alloc	= bytes_alloc;
		__entry->gfp_flags	= gfp_flags;
	),

	TP_printk("call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s",
		__entry->call_site,
		__entry->ptr,
		__entry->bytes_req,
		__entry->bytes_alloc,
		show_gfp_flags(__entry->gfp_flags))
);

DEFINE_EVENT(kmem_alloc, kmalloc,

	TP_PROTO(unsigned long call_site, const void *ptr,
		 size_t bytes_req, size_t bytes_alloc, gfp_t gfp_flags),

	TP_ARGS(call_site, ptr, bytes_req, bytes_alloc, gfp_flags)
);

Tired already? We’re almost there. I won’t get into the details of this (mostly because I don’t know), but note two crucial points: One is the TP_printk(), which matches the output format of the event’s output. The second part is the DEFINE_EVENT, which defines a function linked with the kmem_alloc class, named trace_kmalloc(). The define-macro magic takes place in include/linux/tracepoint.h, and essentially translates DEFINE_EVENT into a DECLARE_TRACE, which is then turned into a __DECLARE_TRACE, which begins with:

#define __DECLARE_TRACE(name, proto, args, cond, data_proto, data_args) \
	extern struct tracepoint __tracepoint_##name;			\
	static inline void trace_##name(proto)				\
	{								\
		if (static_key_false(&__tracepoint_##name.key))		\
			__DO_TRACE(&__tracepoint_##name,		\
				TP_PROTO(data_proto),			\
				TP_ARGS(data_args),			\
				TP_CONDITION(cond), 0);			\
		if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) {		\
			rcu_read_lock_sched_notrace();			\
			rcu_dereference_sched(__tracepoint_##name.funcs);\
			rcu_read_unlock_sched_notrace();		\
		}							\
	}								\
[ ... ]

And then it goes on with several related functions. The point with this long story, except for the obvious masochism, was to show that events really are just some kind of printk, only not all that easy to use. It also explains why that scary message is needed to get people off trace_printk().

But since there’s some code to copy from, it shouldn’t be all that bad to set up new events.

Add a Comment

required, use real name
required, will not be published
optional, your blog address