Opening the perf file descriptor

A couple months back I recorded this line on my blog as part of investigating perf:

perf record --branch-filter any,save_type,u true 

What is the interface between the perf binary and the linux Kernel when I run this? There is a system call to open a file handle. The man page says this:

int syscall(SYS_perf_event_open, struct perf_event_attr *attr,
                    pid_t pid, int cpu, int group_fd, 
                    unsigned long flags);


But how does that get called by the perf binary…the answer is trickier than I originally thought.

When debugging system calls, the primary tool I use is strace. I ran it like this:

sudo strace  perf record --branch-filter any,save_type,u true  2>&1  | less

Inside of less, I can search for the string “event_open” and I am taken to a spot in the output where the system call is made…11 times. Each one opens a file handle, and all of them are kept open after the system call. Here is the first call:


perf_event_open({
                   type=PERF_TYPE_SOFTWARE, 
                   size=0 /* PERF_ATTR_SIZE_??? */,
                   config=PERF_COUNT_SW_CPU_CLOCK,
                   sample_period=0,
                   sample_type=0,
                   read_format=0,
                   exclude_kernel=1,
                   precise_ip=0 /* arbitrary skid */, ...},
                 -1,
                  2,
                 -1,
                  PERF_FLAG_FD_CLOEXEC) = 4 

And here is the last one


perf_event_open({
                   type=PERF_TYPE_HARDWARE, 
                   size=PERF_ATTR_SIZE_VER7,
                   config=PERF_COUNT_HW_CPU_CYCLES,
                   sample_freq=4000,                  sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_ID|PERF_SAMPLE_PERIOD|PERF_SAMPLE_BRANCH_STACK,
                   read_format=PERF_FORMAT_ID, 
                   disabled=1, 
                   inherit=1, 
                   mmap=1, 
                   comm=1, 
                   freq=1, 
                   enable_on_exec=1, 
                   task=1,
                   precise_ip=3 /* must have 0 skid */,
                   sample_id_all=1,
                   exclude_guest=1,
                   mmap2=1,
                   comm_exec=1,
                   ksymbol=1,
                   bpf_event=1,
                ...},
                11127,
                7,
               -1,
                PERF_FLAG_FD_CLOEXEC) = 12 

I’ve formatted this to make the relationship of the parameters a little clearer. Most significant is that the first parameter is a large structure, which strace knows how to interpret.

If we keep looking down the output of strace, we can see another block of perf_event_open. When all of these are executed we end up with file descriptors 4-12 and 13-20 open for the remainder of the program as products of the perf_event_open system call. These are never read from, so the information must come from the kernel via another means. Again, we can see a block of related system calls, this time ioctls:


ioctl(13, PERF_EVENT_IOC_ENABLE, 0)     = 0
ioctl(14, PERF_EVENT_IOC_ENABLE, 0)     = 0
ioctl(15, PERF_EVENT_IOC_ENABLE, 0)     = 0
ioctl(16, PERF_EVENT_IOC_ENABLE, 0)     = 0
ioctl(17, PERF_EVENT_IOC_ENABLE, 0)     = 0
ioctl(18, PERF_EVENT_IOC_ENABLE, 0)     = 0
ioctl(19, PERF_EVENT_IOC_ENABLE, 0)     = 0
ioctl(20, PERF_EVENT_IOC_ENABLE, 0)     = 0

But we do a see a clone system call, which implied a child process or thread.

clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7faf329ef910, parent_tid=0x7faf329ef910, exit_signal=0, stack=0x7faf321ef000, stack_size=0x7ffec0, tls=0x7faf329ef640} => {parent_tid=[11539]}, 88) = 11539

If we look for system calls based on this pid (11539) we later see:

poll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=14, events=POLLIN|POLLERR|POLLHUP}, {fd=15, events=POLLIN|POLLERR|POLLHUP}, {fd=16, events=POLLIN|POLLERR|POLLHUP}, {fd=17, events=POLLIN|POLLERR|POLLHUP}, {fd=18, events=POLLIN|POLLERR|POLLHUP}, {fd=19, events=POLLIN|POLLERR|POLLHUP}, {fd=20, events=POLLIN|POLLERR|POLLHUP}], 8, 1000 <unfinished ...>

However, the only time this pid is referenced again is to clean up, in this block of syscalls:


[pid 11539] <... poll resumed>)         = 0 (Timeout)
[pid 11539] rt_sigprocmask(SIG_BLOCK, ~[RT_1], NULL, 8) = 0
[pid 11539] madvise(0x7faf321ef000, 8368128, MADV_DONTNEED) = 0
[pid 11539] exit(0)                     = ?
[pid 11539] +++ exited with 0 +++

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.