背景

参考
https://bpftrace.org/tutorial-one-liners

bpftrace 作为 eBPF 之上的脚本语言,灵感来自于 awk 和 c
编写 bpftrace 的效率比 开发 bpf 程序高很多,简单又方便

这篇文章包含了 bpftrace 的 12 个命令,可以作为学习 bpftrac 的基础
每个命令都是单行程序,简单明了

bpftrace 的各种 Probe 类型:
23.jpg

1. Listing Probes

命令

1
bpftrace -l 'tracepoint:syscalls:sys_enter_*'

打印结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
tracepoint:syscalls:sys_enter_accept
tracepoint:syscalls:sys_enter_accept4
tracepoint:syscalls:sys_enter_access
tracepoint:syscalls:sys_enter_acct
......
tracepoint:syscalls:sys_enter_open
tracepoint:syscalls:sys_enter_open_by_handle_at
tracepoint:syscalls:sys_enter_open_tree
tracepoint:syscalls:sys_enter_openat
tracepoint:syscalls:sys_enter_openat2
......
tracepoint:syscalls:sys_enter_wait4
tracepoint:syscalls:sys_enter_waitid
tracepoint:syscalls:sys_enter_write
tracepoint:syscalls:sys_enter_writev

解释

  • bpftrace -l 是搜索命令,查看有哪些跟踪点
  • tracepoint 就是一个 probe,可以被跟踪的:跟踪点
  • 支持 *, ? 通配符

还可以加上-v,打印出更多的参数

1
bpftrace -lv 'tracepoint:syscalls:sys_enter_open'

结果:

1
2
3
4
5
tracepoint:syscalls:sys_enter_open
    int __syscall_nr
    const char * filename
    int flags
    umode_t mode

再测试一下

1
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("pid->%d, filename->%s \n", pid, str(args->filename)); }'

结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Attaching 1 probe...
pid->46975, filename->/kafka/kafka-logs-abc0cd3cf0b2/recovery-point-offset-checkpoint
pid->46975, filename->/kafka/kafka-logs-abc0cd3cf0b2/log-start-offset-checkpoint.tmp
pid->46975, filename->/kafka/kafka-logs-abc0cd3cf0b2/cleaner-offset-checkpoint
pid->21568, filename->/etc/hosts
pid->46718, filename->/usr/share/elasticsearch/config
pid->46718, filename->/usr/share/elasticsearch/config/jvm.options.d
pid->46718, filename->/usr/share/elasticsearch/config
pid->46718, filename->/usr/share/elasticsearch/config/jvm.options.d
pid->46718, filename->/usr/share/elasticsearch/config
pid->46718, filename->/usr/share/elasticsearch/config/jvm.options.d

2. Hello World

命令

1
bpftrace -e 'BEGIN { printf("hello world\n"); }'

BEGIN 是一个特殊的 probe,类似 awk 的 BEGIN
对应的是 END

比如 bpftrace 官网的 openscoop.bt 程序中:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
BEGIN
{
        printf("Tracing open syscalls... Hit Ctrl-C to end.\n");
        printf("%-6s %-16s %4s %3s %s\n", "PID", "COMM", "FD", "ERR", "PATH");
}

.....

END
{
        clear(@filename);
}

3. File Opens

命令

1
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'

打印结果:

1
2
3
4
5
6
7
8
9
Attaching 1 probe...
redis-server /proc/1/stat
redis-server /proc/1/stat
redis-server /proc/1/stat
redis-server /proc/1/stat
redis-server /proc/1/stat
redis-server /proc/1/stat
redis-server /proc/1/stat
redis-server /proc/1/stat

解释

  • comm,是 btrace 内置的变量
  • args->filename 用于打印文件名
  • 主要要用 str() 做转换,str()也是内置函数

参数格式:

1
tracepoint:<subsystem>:<event>

解释

  • ​​tracepoint​​: 跟踪点类型 ​​- syscalls​​: 子系统
  • ​​sys_enter_openat​​: 特定的事件

子系统类型非常多

1
bpftrace -l 'tracepoint:*:*'  | awk -F: '{print $2}' | sort | uniq

结果:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
alarmtimer
amd_cpu
avc
block
bpf_test_run
bpf_trace
bridge
btrfs
cgroup
clk
compaction
cpuhp
cros_ec
dev
devfreq
devlink
dma_fence
drm
error_report
exceptions
ext4
fib
fib6
filelock
filemap
fs_dax
gpio
huge_memory
hwmon
hyperv
i2c
initcall
intel_iommu
interconnect
iocost
iomap
iommu
io_uring
irq
irq_matrix
irq_vectors
jbd2
kmem
libata
mce
mdio
migrate
mmap
mmap_lock
mmc
module
mptcp
msr
napi
neigh
net
netlink
nmi
oom
osnoise
page_isolation
pagemap
page_pool
percpu
power
printk
pwm
qdisc
ras
raw_syscalls
rcu
regmap
regulator
resctrl
rpm
rseq
rtc
sched
scsi
signal
skb
smbus
sock
spi
sunrpc
swiotlb
sync_trace
syscalls
task
tcp
thermal
thermal_power_allocator
timer
tlb
tls
udp
vmscan
vsyscall
wbt
workqueue
writeback
x86_fpu
xdp
xen
xhci-hcd

主要包括:

  • Core System Subsystems​
  • Memory Management​
  • Storage & Filesystems
  • Networking
  • Security
  • Virtualization
  • Debugging & Tracing​

4. Syscall Counts By Process

命令

1
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @num[comm] = count(); }'

打印:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
@[SessionTracker]: 2
@[mysql-cj-abando]: 2
@[in:imuxsock]: 3
@[in:imtcp]: 3
@[http-nio-8080-P]: 4
@[ConnnectionExpi]: 5
@[rs:main Q:Reg]: 6
@[tini]: 8
@[GC Thread#12]: 9
@[GC Thread#15]: 9
@[GC Thread#7]: 9
。。。。。。
。。。。。。

解释

  • @num 是一个特殊的变量,可以跨多个事件共享,类似全局变量
  • [], 类似一个map 的 key,关联数组
  • count(),内置的函数,统计 key 被执行了多少次

类似的函数如下:

Name Description Sync/async
avg Calculate the running average of n between consecutive calls. Sync
clear Clear all keys/values from a map. Async
count Count how often this function is called. Sync
delete Delete a single key from a map. Sync
has_key Return true (1) if the key exists in this map. Otherwise return false (0). Sync
hist Create a log2 histogram of n using buckets per power of 2, 0 <= k <= 5, defaults to 0. Sync
len Return the number of elements in a map. Sync
lhist Create a linear histogram of n. lhist creates M ((max - min) / step) buckets in the range [min,max) where each bucket is step in size. Sync
max Update the map with n if n is bigger than the current value held. Sync
min Update the map with n if n is smaller than the current value held. Sync
stats Combines the count, avg and sum calls into one. Sync
sum Calculate the sum of all n passed. Sync
zero Set all values for all keys to zero. Async

5. Distribution of read() Bytes

命令

1
bpftrace -e 'tracepoint:syscalls:sys_exit_read /pid == 18644/ { @bytes = hist(args->ret); }'

结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Attaching 1 probe...
^C

@bytes:
[0, 1]                12 |@@@@@@@@@@@@@@@@@@@@                                |
[2, 4)                18 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
[4, 8)                 0 |                                                    |
[8, 16)                0 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)              30 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)             19 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                    |
[128, 256)             1 |@

解释

  • /pid == 18644/ ,表示只跟踪 18644 进程
  • args->ret,是返回值,如果出错则返回 -1,否则返回读取了多少字节
  • hist(),是按照 2 的幂,打印的柱状图

@num[comm] = count()

graph LR
    A[comm] -->|Key: Process name| B[Hash Map]
    B -->|Value| C[Counter]
    C -->|Increments| D[Per-key count]
    style A stroke:#3f3
    style C stroke:#09f

@bytes = hist(args->ret)

graph LR
    A[args.ret] -->|Input value| B[hist function]
    B -->|Bucket assignment| C[Histogram Map]
    C --> D[Logarithmic Buckets]
    style D stroke:#f90

6. Kernel Dynamic Tracing of read() Bytes

命令

1
bpftrace -e 'kretprobe:vfs_read { @bytes = lhist(retval, 0, 2000, 200); }'                             

打印结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
@bytes:
(..., 0)             269 |@@@@@@@@@@@@@@                                      |
[0, 200)             960 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[200, 400)           134 |@@@@@@@                                             |
[400, 600)            68 |@@@                                                 |
[600, 800)            58 |@@@                                                 |
[800, 1000)          108 |@@@@@                                               |
[1000, 1200)         416 |@@@@@@@@@@@@@@@@@@@@@@                              |
[1200, 1400)          36 |@                                                   |
[1400, 1600)           4 |                                                    |
[1600, 1800)           3 |                                                    |
[1800, 2000)           1 |                                                    |
[2000, ...)          276 |@@@@@@@@@@@@@@                                      |    

解释

  • lhist(),内置函数,是线性柱状图,根据 min、max、步长来打印
  • kretprobe 这是内核的动态追踪点,跟 tracepoint不同,它是静态的

kretprobe

  • 其功能非常强大
  • bpftrace -l ‘kprobe:*’ | wc -l
  • 打印出来有:55758 数量
  • 这些都是可以跟踪的函数
  • 但是这些可能不“稳定”,随着内核版本变化,函数名,参数都有可能变化
  • 如果想要获取参数,智能查对应版本的内核源码

7. Timing read()s

命令

1
2
3
4
5
6
7
8
bpftrace -e '
kprobe:vfs_read {
  @start[tid] = nsecs;
}
kretprobe:vfs_read /@start[tid]/ {
  @ns[comm] = hist(nsecs - @start[tid]);
 delete(@start[tid]);
}'

打印结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
@ns[http-nio-8080-e]:
[2K, 4K)               2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)               0 |                                                    |
[8K, 16K)              1 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |

@ns[elasticsearch[b]:
[2K, 4K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)              1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@ns[cleanup]:
[4K, 8K)               3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@ns[runc:[1:CHILD]]:
[64K, 128K)            4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

解释

  • 当触发 vfs_read 时,将 @start[tid] 作为 key,
  • 当退出 vfs_read时:kretprobe:vfs_read
  • 条件:/@start[tid]/ 确保我们只处理具有记录的开始时间的操作
  • 计算延迟: nsecs - @start[tid]
  • @ns[comm] 按照 进程 name 做聚合
  • delete(@start[tid]):删除临时的 timestamp

执行过程

sequenceDiagram
    participant Kprobe
    participant Kernel
    participant Kretprobe
    participant Map

    Note over Kernel: vfs_read called
    Kprobe->>Map: Record start time (nsecs) for TID
    Note over Kernel: File read operation executes
    Note over Kernel: vfs_read returns
    Kretprobe->>Map: Check if start time exists for TID
    Kretprobe->>Map: Calculate latency (now - start)
    Kretprobe->>Map: Store latency histogram by comm
    Kretprobe->>Map: Delete TID entry

8. Count Process-Level Events

命令

1
bpftrace -e 'tracepoint:sched:sched* { @[probe] = count(); } interval:s:5 { exit(); }'

打印结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@[tracepoint:sched:sched_wakeup_new]: 1
@[tracepoint:sched:sched_process_fork]: 1
@[tracepoint:sched:sched_process_exec]: 1
@[tracepoint:sched:sched_process_exit]: 1
@[tracepoint:sched:sched_process_free]: 2
@[tracepoint:sched:sched_process_wait]: 7
@[tracepoint:sched:sched_wake_idle_without_ipi]: 53
@[tracepoint:sched:sched_stat_runtime]: 212
@[tracepoint:sched:sched_wakeup]: 253
@[tracepoint:sched:sched_waking]: 253
@[tracepoint:sched:sched_switch]: 510

解释

  • 跟踪的是 sche 类型事件
  • 比如 fork,exec,上下文切换等
  • interval:s:5, 每 5秒触发一次
  • exit(),表示退出,配合 interval 表示跟踪 5秒后退出

9. Profile On-CPU Kernel Stacks

命令

1
bpftrace -e 'profile:hz:99 { @[kstack] = count(); }'

打印:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
@[
    apparmor_file_alloc_security+101
    security_file_alloc+78
    __alloc_file+89
    alloc_empty_file+69
    path_openat+71
    do_filp_open+178
    do_sys_openat2+159
    __x64_sys_openat+85
    x64_sys_call+6741
    do_syscall_64+86
    entry_SYSCALL_64_after_hwframe+108
]: 1
......
......

解释

  • profile:hz:99,cpu在 99 HZ 的时候触发
  • 不是 100HZ 也不是 1000HZ,因为太高太低都不好,100足够了
  • 因为 100可能会跟其他定时任务冲突,所以用 99
  • kstack 打印内核堆栈
  • @[kstack] 相当于 map 的key

也可以打印具体的 pid 和 comm

1
bpftrace -e 'profile:hz:999 { printf("pid->%d, comm->%s\n", pid, comm);  }'

10. Scheduler Tracing

命令

1
bpftrace -e 'tracepoint:sched:sched_switch { @[kstack] = count(); }'

打印:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
@[
    __traceiter_sched_switch+63
    __traceiter_sched_switch+63
    __schedule+855
    __cond_resched+57
    mutex_lock+19
    unix_stream_read_generic+231
    unix_stream_recvmsg+140
    sock_recvmsg+120
    sock_read_iter+143
    new_sync_read+385
    vfs_read+262
    ksys_read+181
    __x64_sys_read+25
    x64_sys_call+7610
    do_syscall_64+86
    entry_SYSCALL_64_after_hwframe+108
]: 1

......
......

解释

  • sched,用来跟踪内核中的各种 cpu 调度事件
  • 比如:sched_switch, sched_wakeup, sched_migrate_task 等等
  • sched_switch:当前线程离开cpu 时候触发,比如发生了阻塞,等待I/O,定时,paging/swapping,或者 lock
  • kstack 是内置变量,打印内核堆栈,通过线程离开,可以打印出堆栈上下文信息

另一个命令:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
bpftrace -e 'tracepoint:sched:sched_process_exec {
   printf("filename -> %s, pid -> %d\n", str(args->filename), args->pid);
}'

Attaching 1 probe...
filename -> /usr/bin/mysqladmin, pid -> 25650
filename -> /usr/bin/runc, pid -> 25656
filename -> /proc/self/exe, pid -> 25663
filename -> /dev/fd/5, pid -> 25663
filename -> /usr/sbin/dumpe2fs, pid -> 25665
filename -> /usr/sbin/dumpe2fs, pid -> 25673
filename -> /bin/sh, pid -> 25667
filename -> /usr/bin/nc, pid -> 25674

所有的 sched 事件:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
bpftrace -l 'tracepoint:sched:*'

tracepoint:sched:sched_kthread_stop
tracepoint:sched:sched_kthread_stop_ret
tracepoint:sched:sched_kthread_work_execute_end
tracepoint:sched:sched_kthread_work_execute_start
tracepoint:sched:sched_kthread_work_queue_work
tracepoint:sched:sched_migrate_task
tracepoint:sched:sched_move_numa
tracepoint:sched:sched_pi_setprio
tracepoint:sched:sched_process_exec
tracepoint:sched:sched_process_exit
tracepoint:sched:sched_process_fork
tracepoint:sched:sched_process_free
tracepoint:sched:sched_process_hang
tracepoint:sched:sched_process_wait
tracepoint:sched:sched_stat_blocked
tracepoint:sched:sched_stat_iowait
tracepoint:sched:sched_stat_runtime
tracepoint:sched:sched_stat_sleep
tracepoint:sched:sched_stat_wait
tracepoint:sched:sched_stick_numa
tracepoint:sched:sched_swap_numa
tracepoint:sched:sched_switch
tracepoint:sched:sched_wait_task
tracepoint:sched:sched_wake_idle_without_ipi
tracepoint:sched:sched_wakeup
tracepoint:sched:sched_wakeup_new
tracepoint:sched:sched_waking

11. Block I/O Tracing

命令

1
bpftrace -e 'tracepoint:block:block_rq_issue { @res = hist(args->bytes); }'

打印:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
@res:
[0]                   21 |@@@@@@@@@@@@@@@@@                                   |
[1]                    0 |                                                    |
[2, 4)                 0 |                                                    |
[4, 8)                 0 |                                                    |
[8, 16)                1 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)               0 |                                                    |
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             0 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)             11 |@@@@@@@@                                            |
[16K, 32K)             2 |@                                                   |
[32K, 64K)             5 |@@@@                                                |
[64K, 128K)            5 |@@@@                                                |

解释

  • tracepoint:block,用来跟踪各种 存储类的事件
  • block_rq_issue,当向设备发出 I/O 时触发
  • args.bytes,发送的字节数量
  • hist,以 2次幂,打印直方图

完整参数

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
bpftrace -lv 'tracepoint:block:block_rq_issue'

tracepoint:block:block_rq_issue
    dev_t dev
    sector_t sector
    unsigned int nr_sector
    unsigned int bytes
    char rwbs[8]
    char comm[16]
    __data_loc char[] cmd

所有 block 相关的事件:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
bpftrace -l 'tracepoint:block:*'
tracepoint:block:block_bio_backmerge
tracepoint:block:block_bio_bounce
tracepoint:block:block_bio_complete
tracepoint:block:block_bio_frontmerge
tracepoint:block:block_bio_queue
tracepoint:block:block_bio_remap
tracepoint:block:block_dirty_buffer
tracepoint:block:block_getrq
tracepoint:block:block_plug
tracepoint:block:block_rq_complete
tracepoint:block:block_rq_insert
tracepoint:block:block_rq_issue
tracepoint:block:block_rq_merge
tracepoint:block:block_rq_remap
tracepoint:block:block_rq_requeue
tracepoint:block:block_split
tracepoint:block:block_touch_buffer
tracepoint:block:block_unplug

12. Kernel Struct Tracing

命令

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
cat path.bt
#ifndef BPFTRACE_HAVE_BTF
#include <linux/path.h>
#include <linux/dcache.h>
#endif

kprobe:vfs_open
{
	printf("open path: %s\n", str(((struct path *)arg0)->dentry->d_name.name));
}

打印结果:

1
2
3
4
5
6
7
bpftrace path.bt

Attaching 1 probe...
open path: dev
open path: if_inet6
open path: retrans_time_ms
[...]

解释

  • arg0 是内置的变量表示第一个参数,以此类推arg1, arg2…
  • ((struct path *)arg0)->dentry->d_name.name
  • 将第一个参数 cast 成:struct path * 类型,再解引用:dentry
  • 因为 引用了内核的结构体,所以这里需要加入 #include

trace: vfs_read

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#ifndef BPFTRACE_HAVE_BTF
#include <linux/path.h>
#include <linux/fs.h>
#include <linux/dcache.h>
#endif

kprobe:vfs_read /pid == 46711/
{
    $file = (struct file*)arg0;
    $path = $file->f_path;
    $name_ptr = $path.dentry->d_name.name;
    printf("open path: %s\n", str($name_ptr));
}

参考