A couple months back I recorded this line on my blog as part of investigating perf:
perf record --branch-filter any,save_type,u true
What is the interface between the perf binary and the linux Kernel when I run this? There is a system call to open a file handle. The man page says this:
int syscall(SYS_perf_event_open, struct perf_event_attr *attr,
pid_t pid, int cpu, int group_fd,
unsigned long flags);
But how does that get called by the perf binary…the answer is trickier than I originally thought.
When debugging system calls, the primary tool I use is strace. I ran it like this:
sudo strace perf record --branch-filter any,save_type,u true 2>&1 | less
Inside of less, I can search for the string “event_open” and I am taken to a spot in the output where the system call is made…11 times. Each one opens a file handle, and all of them are kept open after the system call. Here is the first call:
perf_event_open({
type=PERF_TYPE_SOFTWARE,
size=0 /* PERF_ATTR_SIZE_??? */,
config=PERF_COUNT_SW_CPU_CLOCK,
sample_period=0,
sample_type=0,
read_format=0,
exclude_kernel=1,
precise_ip=0 /* arbitrary skid */, ...},
-1,
2,
-1,
PERF_FLAG_FD_CLOEXEC) = 4
And here is the last one
perf_event_open({
type=PERF_TYPE_HARDWARE,
size=PERF_ATTR_SIZE_VER7,
config=PERF_COUNT_HW_CPU_CYCLES,
sample_freq=4000, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_ID|PERF_SAMPLE_PERIOD|PERF_SAMPLE_BRANCH_STACK,
read_format=PERF_FORMAT_ID,
disabled=1,
inherit=1,
mmap=1,
comm=1,
freq=1,
enable_on_exec=1,
task=1,
precise_ip=3 /* must have 0 skid */,
sample_id_all=1,
exclude_guest=1,
mmap2=1,
comm_exec=1,
ksymbol=1,
bpf_event=1,
...},
11127,
7,
-1,
PERF_FLAG_FD_CLOEXEC) = 12
I’ve formatted this to make the relationship of the parameters a little clearer. Most significant is that the first parameter is a large structure, which strace knows how to interpret.
If we keep looking down the output of strace, we can see another block of perf_event_open. When all of these are executed we end up with file descriptors 4-12 and 13-20 open for the remainder of the program as products of the perf_event_open system call. These are never read from, so the information must come from the kernel via another means. Again, we can see a block of related system calls, this time ioctls:
ioctl(13, PERF_EVENT_IOC_ENABLE, 0) = 0
ioctl(14, PERF_EVENT_IOC_ENABLE, 0) = 0
ioctl(15, PERF_EVENT_IOC_ENABLE, 0) = 0
ioctl(16, PERF_EVENT_IOC_ENABLE, 0) = 0
ioctl(17, PERF_EVENT_IOC_ENABLE, 0) = 0
ioctl(18, PERF_EVENT_IOC_ENABLE, 0) = 0
ioctl(19, PERF_EVENT_IOC_ENABLE, 0) = 0
ioctl(20, PERF_EVENT_IOC_ENABLE, 0) = 0
But we do a see a clone system call, which implied a child process or thread.
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7faf329ef910, parent_tid=0x7faf329ef910, exit_signal=0, stack=0x7faf321ef000, stack_size=0x7ffec0, tls=0x7faf329ef640} => {parent_tid=[11539]}, 88) = 11539
If we look for system calls based on this pid (11539) we later see:
poll([{fd=13, events=POLLIN|POLLERR|POLLHUP}, {fd=14, events=POLLIN|POLLERR|POLLHUP}, {fd=15, events=POLLIN|POLLERR|POLLHUP}, {fd=16, events=POLLIN|POLLERR|POLLHUP}, {fd=17, events=POLLIN|POLLERR|POLLHUP}, {fd=18, events=POLLIN|POLLERR|POLLHUP}, {fd=19, events=POLLIN|POLLERR|POLLHUP}, {fd=20, events=POLLIN|POLLERR|POLLHUP}], 8, 1000 <unfinished ...>
However, the only time this pid is referenced again is to clean up, in this block of syscalls:
[pid 11539] <... poll resumed>) = 0 (Timeout)
[pid 11539] rt_sigprocmask(SIG_BLOCK, ~[RT_1], NULL, 8) = 0
[pid 11539] madvise(0x7faf321ef000, 8368128, MADV_DONTNEED) = 0
[pid 11539] exit(0) = ?
[pid 11539] +++ exited with 0 +++