背景

网络

网络类 eBPF 程序可以分为

  • XDP 程序
  • TC 程序
  • 套接字程序
  • cgroup 程序

Program Comparison Table

Program Type Trigger Event Running Location Common Scenarios Subtypes Packet Access Connection Access Logic Support
XDP When NIC driver receives packets • Kernel network driver layer • Can be offloaded to NIC hardware • Firewalls • DDoS protection • Layer 4 load balancing • Packet filtering N/A ★★★ (Header only) ★★
TC (Traffic Control) When NIC queues send/receive data Kernel protocol stack (Network layer) • Traffic shaping (QoS) • Bandwidth management • Complex packet processing • Network policy enforcement N/A ★★★★ ★★ ★★★
Socket Socket operations: • Creation • Modification • Data transfer Kernel protocol stack (Transport layer) • Socket filtering • Connection monitoring • Socket redirection • Protocol analysis • BPF_PROG_TYPE_SOCK_OPS • BPF_PROG_TYPE_SK_SKB • BPF_PROG_TYPE_SK_MSG ★★★★ ★★★★ ★★★★
cgroup Socket operations in cgroups: • Creation • Option changes Kernel cgroup control points • Container network policies • Process-level firewalls • Connection rate limiting • BPF_PROG_TYPE_CGROUP_SOCK • BPF_PROG_TYPE_CGROUP_SOCKOPT ★★ ★★★★ ★★★★

Processing Flow Sequence

graph LR
    XDP[XDP] --> TC[TC]
    TC --> Socket[Socket Program]
    Socket --> cgroup[cgroup Program]

Capability Rating System

Capability Meaning Rating Scale
Packet Access Ability to inspect packet contents ★★ = Limited (cgroup)★★★★ = Full access (Socket/TC)
Connection Access Access to connection state/metadata ★★ = Basic awareness (TC)★★★★ = Full control (Socket/cgroup)
Logic Support Support for complex programming constructs ★★ = Simple conditionals (XDP)★★★★ = Full BPF features (Socket/cgroup)

Processing Latency Comparison

   XDP (10-100ns) < TC (1-10μs) < Socket (10-100μs) < cgroup (100μs-1ms)

Hardware Acceleration:

  • Only XDP supports full NIC offload
  • TC has limited hardware acceleration (e.g., Intel QAT)
  • Socket/cgroup run exclusively in software

Cloud Native Usage:

Platform Primary eBPF Program
Cilium/Linkerd Socket + cgroup
Facebook Katran XDP
AWS VPC TC

Performance Optimization Guidelines

flowchart TD
    A[Network Requirement] -->|Line-rate filtering| B[XDP]
    A -->|Complex QoS| C[TC]
    A -->|Connection optimization| D[Socket]
    A -->|Container networking| E[cgroup]
    
    style B stroke:#ff5555,stroke-width:3px
    style C stroke:#5555ff,stroke-width:3px
    style D stroke:#55ff55,stroke-width:3px
    style E stroke:#ffaa00,stroke-width:3px

Why Maps Exist

Problem Solution via Maps
BPF programs are stateless Maps provide persistent storage
Kernel-user space separation Maps act as shared memory bridge
Inter-program communication Global data exchange via maps
Atomic operations BPF_ATOMIC_ADD, BPF_XCHG, etc.
Event streaming perf_event, ringbuf maps

Map Types for Different Use Cases

Map Type Communication Scenario
BPF_MAP_TYPE_HASH Sharing metrics between programs
BPF_MAP_TYPE_PERF_EVENT_ARRAY Sending events to user space
BPF_MAP_TYPE_RINGBUF High-throughput event streaming
BPF_MAP_TYPE_PROG_ARRAY Calling between BPF programs
BPF_MAP_TYPE_SOCKHASH Socket redirection

几个不同模块的交互

graph LR
    %% First Row: Network Interface to Kernel Stack
    A[Network Interface] -->|XDP| B[Layer 2-3 Processing]
    B -->|Pass| C[Kernel Stack]
    
    %% Second Row: TC Processing
    C -->|TC Ingress| D[Layer 4 Filtering]
    
    %% Third Row: Container Policy
    D -->|cgroup/skb| E[Container Policy]
    
    %% Fourth Row: Application Analysis
    E -->|sock_ops| F[Layer 5-7 Analysis]
    
    %% Fifth Row: Security Enforcement
    F -->|LSM BPF| G[Security Enforcement]
    G --> H{{Applications}}
    
    %% Styling
    style B fill:#9f0,stroke:#333
    style D fill:#f90,stroke:#333
    style F fill:#09f,stroke:#333
    style G fill:#f44,stroke:#333

搭建负载均衡

建立基于 nginx 的负载均衡环境

1
2
3
4
5
6
7
8
docker run -itd --name=http1 --hostname=http1 feisky/webserver
docker run -itd --name=http2 --hostname=http2 feisky/webserver

# Client
docker run -itd --name=client alpine

# Nginx
docker run -itd --name=nginx nginx

查询 ip 地址

1
2
3
4
IP1=$(docker inspect http1 -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}')
IP2=$(docker inspect http2 -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}')
echo "Webserver1's IP: $IP1"
echo "Webserver2's IP: $IP2"

更新 nginx 配置

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 生成nginx.conf文件
cat>nginx.conf <<EOF
user  nginx;
worker_processes  auto;

error_log  /var/log/nginx/error.log notice;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
   include       /etc/nginx/mime.types;
   default_type  application/octet-stream;

    upstream webservers {
        server $IP1;
        server $IP2;
    }

    server {
        listen 80;

        location / {
            proxy_pass http://webservers;
        }
    }
}
EOF

# 更新Nginx配置
docker cp nginx.conf nginx:/etc/nginx/nginx.conf
docker exec nginx nginx -s reload

验证负载均衡是否生效

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# 查询Nginx容器IP(输出为172.17.0.5)
docker inspect nginx -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}'

# 进入client容器终端,安装curl之后访问Nginx
docker exec -it client sh

# (以下命令运行在client容器中)
/ # apk add curl wrk --update
/ # curl "http://172.17.0.5"

# 多次执行会返回 http1, http2

测试

1
2
/ # apk add wrk --update
/ # wrk -c100 "http://172.17.0.5"

网络问题排查

eBPF 提供了贯穿整个网络协议栈的过滤、捕获以及重定向等丰富的网络功能
2025\06\bpf\16.jpg

bpf 内置的函数

根据调用栈,打印内核执行的堆栈:

1
2
3
4
5
6
7
sudo bpftrace -e 'tracepoint:skb:kfree_skb /comm=="curl"/ { 
    printf("Location: %s\n", kstack); 
}'

sudo bpftrace -e 'tracepoint:skb:consume_skb /comm=="curl"/ { 
    printf("Location: %s\n", kstack); 
}'

curl 一个网站的结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Location:
        sk_skb_reason_drop+173
        sk_skb_reason_drop+173
        unix_stream_connect+674
        __sys_connect_file+107
        __sys_connect+181
        __x64_sys_connect+24
        x64_sys_call+8640
        do_syscall_64+126
        entry_SYSCALL_64_after_hwframe+118

。。。。。

内核中释放 skb 的两个函数

完整的程序

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/env bpftrace
/* Robust TCP drop monitoring for curl process */

BEGIN
{
    printf("Monitoring TCP drops for curl process...\n");
}

// Primary tracepoint for packet drops
tracepoint:skb:kfree_skb /comm == "curl"/
{
    $skb = (struct sk_buff *)args->skbaddr;
    $protocol = args->protocol;
    $location = args->location;

    printf("TCP packet dropped by %s at %p\n", comm, $location);
    printf("Stack trace:\n%s\n", kstack);
}

END
{
    printf("Monitoring stopped.\n");
}

nslookup,或者 dig 查询下 baidu.com 的 ip

1
2
iptables -I OUTPUT -d 182.61.244.181/32 -p tcp -m tcp --dport 80 -j DROP
iptables -I OUTPUT -d 182.61.201.211/32 -p tcp -m tcp --dport 80 -j DROP

打印结果:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
TCP packet dropped by curl at 0xffffffff9fced182
Stack trace:

        sk_skb_reason_drop+173
        sk_skb_reason_drop+173
        unix_stream_connect+674
        __sys_connect_file+107
        __sys_connect+181
        __x64_sys_connect+24
        x64_sys_call+8640
        do_syscall_64+126
        entry_SYSCALL_64_after_hwframe+118

TCP packet dropped by curl at 0xffffffff9fced182
Stack trace:

        sk_skb_reason_drop+173
        sk_skb_reason_drop+173
        unix_stream_connect+674
        __sys_connect_file+107
        __sys_connect+181
        __x64_sys_connect+24
        x64_sys_call+8640
        do_syscall_64+126
        entry_SYSCALL_64_after_hwframe+118

TCP packet dropped by curl at 0xffffffff9fced182
Stack trace:

        sk_skb_reason_drop+173
        sk_skb_reason_drop+173
        unix_stream_connect+674
        __sys_connect_file+107
        __sys_connect+181
        __x64_sys_connect+24
        x64_sys_call+8640
        do_syscall_64+126
        entry_SYSCALL_64_after_hwframe+118

TCP packet dropped by curl at 0xffffffff9fced182
Stack trace:

        sk_skb_reason_drop+173
        sk_skb_reason_drop+173
        unix_stream_connect+674
        __sys_connect_file+107
        __sys_connect+181
        __x64_sys_connect+24
        x64_sys_call+8640
        do_syscall_64+126
        entry_SYSCALL_64_after_hwframe+118

TCP packet dropped by curl at 0xffffffffc0a6677c
Stack trace:

        sk_skb_reason_drop+173
        sk_skb_reason_drop+173
        nft_do_chain+988
        nft_do_chain_ipv4+110
        nf_hook_slow+67
        __ip_local_out+249
        ip_local_out+28
        __ip_queue_xmit+392
        ip_queue_xmit+21
        __tcp_transmit_skb+2694
        tcp_connect+1062
        tcp_v4_connect+1120
        __inet_stream_connect+277
        inet_stream_connect+59
        __sys_connect_file+107
        __sys_connect+181
        __x64_sys_connect+24
        x64_sys_call+8640
        do_syscall_64+126
        entry_SYSCALL_64_after_hwframe+118

分析其中的一个调用栈

  • 从堆栈可以大致分析出,tcp准备建立连接
  • 也可以用:tcpdump -i any port 80 -X,此时是没有数据包的
  • 之后触发 tcp 发送逻辑,放到 ip queue中
  • 然后是 ip 本地发送
  • 再是触发了 nf(​​NFTables firewall​​)规则
  • 最后是 sk_skb_reason_drop,包丢弃了,那么应该就是跟防火墙规则有关
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
sk_skb_reason_drop+173       // Actual drop point
nft_do_chain+988             // NFTables firewall processing
nf_hook_slow+67              // Netfilter hook
__ip_local_out+249           // Outbound packet processing
ip_local_out+28
__ip_queue_xmit+392          // IP transmission
ip_queue_xmit+21
__tcp_transmit_skb+2694      // TCP packet transmission
tcp_connect+1062             // TCP connection attempt
tcp_v4_connect+1120
inet_stream_connect+277
__sys_connect_file+107       // connect() syscall from curl

Ubuntu 系统,执行下面的命令安装调试信息

1
2
3
4
5
6
7
8
codename=$(lsb_release -cs)
sudo tee /etc/apt/sources.list.d/ddebs.list << EOF
deb http://ddebs.ubuntu.com/ ${codename}      main restricted universe multiverse
deb http://ddebs.ubuntu.com/ ${codename}-updates  main restricted universe multiverse
EOF
sudo apt-get install -y ubuntu-dbgsym-keyring
sudo apt-get update
sudo apt-get install -y linux-image-$(uname -r)-dbgsym

安装 faddr2line 脚本

  • Linux 内核维护了一个 faddr2line 脚本,根据函数名+偏移量输出源码文件名和行号
  • faddr2line
  • 执行:chmod +x faddr2line

使用下面的命令来搜索 vmlinux 开头的文件

1
2
3
find /usr/lib/debug/ -name 'vmlinux*'
# 结果
/usr/lib/debug/boot/vmlinux-6.11.0-29-generic

搜索

1
./faddr2line /usr/lib/debug/boot/vmlinux-6.11.0-29-generic ip_local_out+28

对应的内核源码,参考

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	struct iphdr *iph = ip_hdr(skb);

	IP_INC_STATS(net, IPSTATS_MIB_OUTREQUESTS);

	iph_set_totlen(iph, skb->len);
	ip_send_check(iph);

	/* if egress device is enslaved to an L3 master device pass the
	 * skb to its handler for processing
	 */
	skb = l3mdev_ip_out(sk, skb);
	if (unlikely(!skb))
		return 0;

	skb->protocol = htons(ETH_P_IP);

	return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
		       net, sk, skb, NULL, skb_dst(skb)->dev,
		       dst_output);
}

6.11 nf_hook_slow

基于 socket 的负载均衡

使用套接字映射转发网络包需要以下几个步骤

  • 创建套接字映射
  • 在 BPF_PROG_TYPE_SOCK_OPS 类型的 eBPF 程序中,将新创建的套接字存入套接字映射中
  • 在流解析类的 eBPF 程序(如 BPF_PROG_TYPE_SK_SKB 或 BPF_PROG_TYPE_SK_MSG )中,从套接字映射中提取套接字信息,并调用 BPF 辅助函数转发网络包
  • 加载并挂载 eBPF 程序到套接字事件

参考的源码,6.11 内核

bpf_msg_redirect_hash()

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
long bpf_msg_redirect_hash(struct sk_msg_buff *msg, 
struct bpf_map *map, void *key, u64 flags)

Description
This helper is used in programs implementing policies at the socket  level.  If  the
message  msg  is allowed to pass (i.e. if the verdict eBPF program returns SK_PASS),
redirect it to the socket referenced by map (of  type  BPF_MAP_TYPE_SOCKHASH)  using
hash  key.  Both  ingress  and  egress  interfaces  can be used for redirection. The
BPF_F_INGRESS value in flags is used to make the distinction (ingress  path  is  se‐
lected  if  the  flag is present, egress path otherwise). This is the only flag sup‐
ported for now.

Return SK_PASS on success, or SK_DROP on error.

sockops.h 头文件

  • 以 BPF_MAP_TYPE_SOCKHASH 类型的套接字映射为例
  • 值是 socket 文件的描述符
  • 使用 SEC 关键字来定义套接字映射
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#ifndef __SOCK_OPS_H__
#define __SOCK_OPS_H__

#include <linux/bpf.h>

struct sock_key {
        __u32 sip;
        __u32 dip;
        __u32 sport;
        __u32 dport;
        __u32 family;
};

struct {
        __uint(type, BPF_MAP_TYPE_SOCKHASH);
        __uint(key_size, sizeof(struct sock_key));
        __uint(value_size, sizeof(int));
        __uint(max_entries, 65535);
        __uint(map_flags, 0);
} sock_ops_map SEC(".maps");

#endif                          /* __SOCK_OPS_H__ */

sockops.bpf.c 实现

  • BPF_PROG_TYPE_SOCK_OPS 类型的 eBPF 程序中跟踪套接字事件
  • 并把套接字信息保存到 SOCKHASH 映射中
  • 调用 BPF 辅助函数去更新套接字映射
  • BPF_NOEXIST 表示键不存在的时候才添加新元素
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#include <linux/bpf.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>
#include <linux/in.h>       // AF_INET, etc.
#include <linux/tcp.h>
#include "sockops.h"

#ifndef AF_INET
#define AF_INET 2  // IPv4 protocol family number
#endif

SEC("sock_ops")
int bpf_sockmap(struct bpf_sock_ops *skops)
{
        /* skip if the packet is not ipv4 */
        if (skops->family != AF_INET) {
                return BPF_OK;
        }

        /* skip if it is not established op */
        if (skops->op != BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
            && skops->op != BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB) {
                return BPF_OK;
        }

        struct sock_key key = {
                .dip = skops->remote_ip4,
                .sip = skops->local_ip4,
                /* convert to network byte order */
                .sport = bpf_htonl(skops->local_port),
                .dport = skops->remote_port,
                .family = skops->family,
        };

        bpf_sock_hash_update(skops, &sock_ops_map, &key, BPF_NOEXIST);
        return BPF_OK;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

sockredir.bpf.c 转发程序 注意一点

  • 在套接字转发之前,即便是在同一台机器的两个容器中,负载均衡器和 Web 服务器的两个套接字通信还是需要通过完整的内核协议栈进行处理的
  • 而在套接字转发之后,来自发送端套接字 1 的网络包在套接字层就交给了接收端的套接字 2,从而避免了额外的内核协议栈处理过程。
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#include <linux/bpf.h>
#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>
#include "sockops.h"

#ifndef AF_INET
#define AF_INET 2  // IPv4 protocol family
#endif

SEC("sk_msg")
int bpf_redir(struct sk_msg_md *msg)
{
        struct sock_key key = {
                .sip = msg->remote_ip4,
                .dip = msg->local_ip4,
                .dport = bpf_htons(msg->local_port), // 16-bit conversion
                .sport = msg->remote_port,
                .family = msg->family,
        };

        bpf_msg_redirect_hash(msg, &sock_ops_map, &key, BPF_F_INGRESS);
        return SK_PASS;
}

char LICENSE[] SEC("license") = "Dual BSD/GPL";

编译

1
2
3
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I/usr/include/x86_64-linux-gnu -I. -c sockops.bpf.c -o sockops.bpf.o

clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I/usr/include/x86_64-linux-gnu -I. -c sockredir.bpf.c -o sockredir.bpf.o

Makefile

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
cat Makefile
.PHONY: build
build:
        clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I/usr/include/x86_64-linux-gnu -I. -c sockops.bpf.c -o sockops.bpf.o
        clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I/usr/include/x86_64-linux-gnu -I. -c sockredir.bpf.c -o sockredir.bpf.o

run:
        bpftool prog load sockops.bpf.o /sys/fs/bpf/sockops type sockops pinmaps /sys/fs/bpf
        bpftool prog load sockredir.bpf.o /sys/fs/bpf/sockredir type sk_msg map name sock_ops_map pinned /sys/fs/bpf/sock_ops_map
        bpftool cgroup attach /sys/fs/cgroup/ sock_ops pinned /sys/fs/bpf/sockops
        bpftool prog attach pinned /sys/fs/bpf/sockredir msg_verdict pinned /sys/fs/bpf/sock_ops_map

map:
        sudo bpftool map dump name sock_ops_map

clean:
        bpftool prog detach pinned /sys/fs/bpf/sockredir msg_verdict pinned /sys/fs/bpf/sock_ops_map
        bpftool cgroup detach /sys/fs/cgroup/ sock_ops name bpf_sockmap
        rm -rf /sys/fs/bpf/sockops /sys/fs/bpf/sockredir /sys/fs/bpf/sock_ops_map
		docker rm -f http1 http2 client nginx

打印 section

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
llvm-objdump --section-headers sockops.bpf.o

sockops.bpf.o:  file format elf64-bpf

Sections:
Idx Name                   Size     VMA              Type
  0                        00000000 0000000000000000
  1 .strtab                00000120 0000000000000000
  2 .text                  00000000 0000000000000000 TEXT
  3 sock_ops               000000d0 0000000000000000 TEXT
  4 .relsock_ops           00000010 0000000000000000
  5 .maps                  00000028 0000000000000000 DATA
  6 license                0000000d 0000000000000000 DATA
  7 .debug_loclists        00000017 0000000000000000 DEBUG
  8 .debug_abbrev          00000168 0000000000000000 DEBUG
  9 .debug_info            000004a3 0000000000000000 DEBUG
 10 .rel.debug_info        00000050 0000000000000000
 11 .debug_str_offsets     000001cc 0000000000000000 DEBUG
 12 .rel.debug_str_offsets 00000710 0000000000000000
 13 .debug_str             000005a5 0000000000000000 DEBUG
 14 .debug_addr            00000020 0000000000000000 DEBUG
 15 .rel.debug_addr        00000030 0000000000000000
 16 .BTF                   000008df 0000000000000000
 17 .rel.BTF               00000020 0000000000000000
 18 .BTF.ext               00000130 0000000000000000
 19 .rel.BTF.ext           00000100 0000000000000000
 20 .debug_frame           00000028 0000000000000000 DEBUG
 21 .rel.debug_frame       00000020 0000000000000000
 22 .debug_line            00000100 0000000000000000 DEBUG
 23 .rel.debug_line        000000c0 0000000000000000
 24 .debug_line_str        0000009a 0000000000000000 DEBUG
 25 .llvm_addrsig          00000003 0000000000000000
 26 .symtab                00000168 0000000000000000

BPF Loading Really Works​

graph TD
    A[Source Code] -->|Compiled| B[ELF Object]
    B -->|Contains| C[Sections]
    C --> D["SEC('sock_ops')"]
    D --> E["Program bytecode"]
    E -->|Loaded via| F["bpftool type sockops"]
    C --> G["SEC('sk_msg')"]
    G --> H["Program bytecode"]
    H -->|Loaded via| I["bpftool type sk_msg"]

关于 attach

1
sudo bpftool cgroup attach /sys/fs/cgroup/ sock_ops pinned /sys/fs/bpf/sockops

​​Cgroups Control​​:

  • Attaching to the root cgroup (/sys/fs/cgroup/) makes the BPF program ​​affect all containers and processes​​ on the system ​​ Socket Lifecycle Hooks​​:
  • sock_ops programs trigger on ​​socket events​​ (creation, connection, teardown) within cgroup-managed processes ​​ Docker Integration​​:
  • Since Docker uses cgroups for resource isolation, this ensures BPF hooks:
  • Apply to all containers
  • Capture socket operations of Nginx and web services
  • Work across container boundaries

BPF vs Nginx Data Flow​

graph TB
    Client-->Nginx[Nginx Userspace]
    subgraph Kernel Space
        Nginx-->BPF[BPF Socket Redirection]
        BPF-->Web1[Web Service 1]
        BPF-->Web2[Web Service 2]
    end

对比

1
2
3
4
5
# ​​Normal Nginx Path​​:
Client → NIC → Kernel TCP/IP → Nginx (userspace) → Kernel → Web Service

# ​​BPF-Optimized Path
Client → NIC → BPF Sockmap → Web Service (direct socket redirect)

回收

  • 与 attach 相对应的清理操作为 detach
  • bpftool 并没有一个与 load 相对应的 unload 子命令
  • eBPF 程序和映射都是与 BPF 文件系统绑定的
  • 文件删除后,引用计数降为 0 ,会被系统自动清理了

查看

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
bpftool prog show

57: sock_ops  name my_bpf_sockmap  tag f9342171d0f7792b  gpl
        loaded_at 2025-07-12T19:13:23+0800  uid 0
        xlated 256B  jited 142B  memlock 4096B  map_ids 2
        btf_id 120
64: sk_msg  name my_bpf_redir  tag 1cb972b0d638695a  gpl
        loaded_at 2025-07-12T19:13:29+0800  uid 0
        xlated 200B  jited 135B  memlock 4096B  map_ids 2
        btf_id 131



bpftool map show 

1: hash  flags 0x0
        key 9B  value 1B  max_entries 500  memlock 59424B
2: sockhash  name sock_ops_map  flags 0x0
        key 20B  value 4B  max_entries 65535  memlock 1048984B

强制删除

1
bpftool cgroup detach /sys/fs/cgroup/ sock_ops id 101

观察转发的效果

1
sudo bpftool map dump name sock_ops_map

传统的转发流程

sequenceDiagram
    Client->>Nginx NIC: SYN
    Nginx NIC->>Nginx Kernel: Packet
    Nginx Kernel->>Nginx Userspace: Copy
    Nginx Userspace->>Nginx Kernel: Process + New Connection
    Nginx Kernel->>Server NIC: SYN
    Server NIC->>Server Kernel: Packet
    Server Kernel->>Server App: Copy
    Note right of Server App: 4 Copy Operations<br>2 Full TCP Stacks<br>2 Userspace Context Switches

bpf 转发流程

sequenceDiagram
    Client->>Nginx NIC: SYN
    Nginx NIC->>BPF: Packet
    BPF->>Server NIC: SYN (Direct)
    Server NIC->>Server App: Packet
    Note right of BPF: 0 Copy Operations<br>1 Partial TCP Stack<br>0 Userspace Context Switches

执行流程

sequenceDiagram
    participant Kernel as Kernel Network Stack
    participant BPF as sockops BPF Program
    participant Map as sock_ops_map
    
    Kernel->>BPF: Triggers on socket event (TCP handshake)
    BPF->>BPF: Verify IPv4 && Established connection
    BPF->>BPF: Build sock_key from skops context
    BPF->>Map: Insert socket into sockhash map
    Kernel-->>BPF: Return BPF_OK

XDP 转发

参考的一些信息

xdp_md

  • 开始位置,结束位置,元数据位置
  • 后三个表示关联网卡的信息(包括入口网卡、入口网卡队列以及出口网卡的编号)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
struct xdp_md {
	__u32 data;
	__u32 data_end;
	__u32 data_meta;
	/* Below access go through struct xdp_rxq_info */
	__u32 ingress_ifindex; /* rxq->dev->ifindex */
	__u32 rx_queue_index;  /* rxq->queue_index  */

	__u32 egress_ifindex;  /* txq->dev->ifindex */
};

xdp-proxy.bpf.c

  • 这里会根据 ctx 的开始指针,转为 ethhdr 结构体,也就是以太网包
  • 后面会判断长度,是否为 IP 类型
  • 通过这样跳过 二层的包头,获取三层的包头
  • struct iphdr *iph = data + sizeof(struct ethhdr);
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
#include <stddef.h>
#include <linux/bpf.h>
#include <linux/in.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>

/* IP Addresses (network byte order) */
#define CLIENT_IP 0x40011ac        // 172.17.0.4
#define LOADBALANCER_IP 0x50011ac  // 172.17.0.5
#define ENDPOINT1_IP 0x20011ac     // 172.17.0.2
#define ENDPOINT2_IP 0x30011ac     // 172.17.0.3

/* Full MAC Addresses */
// static const __u8 CLIENT_MAC[ETH_ALEN] = {0x72, 0xF7, 0x37, 0xC2, 0x9B, 0x2A};
// static const __u8 LOADBALANCER_MAC[ETH_ALEN] = {0x72, 0xF7, 0x37, 0xC2, 0x9B, 0x2A};
// static const __u8 ENDPOINT1_MAC[ETH_ALEN] = {0xca, 0x07, 0x58, 0x08, 0x50, 0xc8};
// static const __u8 ENDPOINT2_MAC[ETH_ALEN] = {0x0a, 0x3b, 0xff, 0x01, 0xb3, 0x44};

// static const __u8 CLIENT_MAC[ETH_ALEN] = {0x62, 0xA3, 0x5B, 0x0F, 0xE9, 0x6A};
// static const __u8 LOADBALANCER_MAC[ETH_ALEN] = {0x72, 0x2e, 0x3e, 0xc1, 0x64, 0x79};
// static const __u8 ENDPOINT1_MAC[ETH_ALEN] = {0x1e, 0x5f, 0x13, 0xf3, 0xb4, 0x1f};
// static const __u8 ENDPOINT2_MAC[ETH_ALEN] = {0xfa, 0x81, 0x9f, 0x76, 0x5f, 0xb0};

static const __u8 CLIENT_MAC[ETH_ALEN] = {0x1a, 0x44, 0xba, 0xee, 0x0c, 0x36};
static const __u8 LOADBALANCER_MAC[ETH_ALEN] = {0xb2, 0x4f, 0x4e, 0x9e, 0xbe, 0xf3};
static const __u8 ENDPOINT1_MAC[ETH_ALEN] = {0x32, 0xca, 0x39, 0x89, 0xcc, 0xfa};
static const __u8 ENDPOINT2_MAC[ETH_ALEN] = {0xc6, 0x56, 0x83, 0x3d, 0x78, 0xa4};


/* Checksum helpers (unchanged) */
static __always_inline __u16 csum_fold_helper(__u64 csum) {
    int i;
    #pragma unroll
    for (i = 0; i < 4; i++) {
        if (csum >> 16)
            csum = (csum & 0xffff) + (csum >> 16);
    }
    return ~csum;
}

static __always_inline __u16 ipv4_csum(struct iphdr *iph) {
    iph->check = 0;
    unsigned long long csum = bpf_csum_diff(0, 0, (unsigned int *)iph, sizeof(struct iphdr), 0);
    return csum_fold_helper(csum);
}

/* MAC address copy helper */
static __always_inline void copy_mac(__u8 *dest, const __u8 *src) {
    #pragma unroll
    for (int i = 0; i < ETH_ALEN; i++) {
        dest[i] = src[i];
    }
}

SEC("xdp")
int xdp_proxy(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if (data + sizeof(struct ethhdr) > data_end)
        return XDP_ABORTED;

    if (eth->h_proto != bpf_htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *iph = data + sizeof(struct ethhdr);
    if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) > data_end)
        return XDP_ABORTED;

    if (iph->protocol != IPPROTO_TCP)
        return XDP_PASS;

    const __u8 *target_mac;
    __be32 target_ip;

    if (iph->saddr == CLIENT_IP) {
        // Forward to random endpoint
        if (bpf_ktime_get_ns() & 0x1) {
            target_ip = ENDPOINT2_IP;
            target_mac = ENDPOINT2_MAC;
        } else {
            target_ip = ENDPOINT1_IP;
            target_mac = ENDPOINT1_MAC;
        }
    } else {
        // Return to client
        target_ip = CLIENT_IP;
        target_mac = CLIENT_MAC;
    }

    // Update destination MAC and IP
    copy_mac(eth->h_dest, target_mac);
    iph->daddr = target_ip;

    // Update source to LB
    copy_mac(eth->h_source, LOADBALANCER_MAC);
    iph->saddr = LOADBALANCER_IP;

    // Recalculate checksum
    iph->check = ipv4_csum(iph);

    return XDP_TX;
}

char _license[] SEC("license") = "Dual BSD/GPL";

上述的 ip 和 mac 对应关系

1
2
3
4
	client       => 172.17.0.4 (Hex 0x40011ac) => 1a:44:ba:ee:0c:36 
	loadbalancer => 172.17.0.5 (Hex 0x50011ac) => b2:4f:4e:9e:be:f3
	endpoint1    => 172.17.0.2 (Hex 0x20011ac) => 32:ca:39:89:cc:fa
	endpoint2    => 172.17.0.3 (Hex 0x30011ac) => c6:56:83:3d:78:a4

用户态代码 xdp-proxy.c

  • 引入脚手架头文件
  • 增大 RLIMIT_MEMLOCK
  • 初始化并加载 BPF 字节码
  • 挂载 XDP 程序
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#include <stdio.h>
#include <sys/resource.h>
#include <bpf/libbpf.h>
#include <bpf/bpf.h>
#include <net/if.h>
#include <linux/types.h>
#include <linux/if_link.h>
#include "xdp-proxy.skel.h"

/* Attach to eth0 by default */
#define DEV_NAME "eth0"

int main(int argc, char **argv)
{
        __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE;
        struct xdp_proxy_bpf *obj;
        int err = 0;

        struct rlimit rlim_new = {
                .rlim_cur = RLIM_INFINITY,
                .rlim_max = RLIM_INFINITY,
        };

        err = setrlimit(RLIMIT_MEMLOCK, &rlim_new);
        if (err)
        {
                fprintf(stderr, "failed to change rlimit\n");
                return 1;
        }

        unsigned int ifindex = if_nametoindex(DEV_NAME);
        if (ifindex == 0)
        {
                fprintf(stderr, "failed to find interface %s\n", DEV_NAME);
                return 1;
        }

        obj = xdp_proxy_bpf__open();
        if (!obj)
        {
                fprintf(stderr, "failed to open and/or load BPF object\n");
                return 1;
        }

        err = xdp_proxy_bpf__load(obj);
        if (err)
        {
                fprintf(stderr, "failed to load BPF object %d\n", err);
                goto cleanup;
        }

        /* Attach the XDP program to eth0 */
        int prog_id = bpf_program__fd(obj->progs.xdp_proxy);
        LIBBPF_OPTS(bpf_xdp_attach_opts, attach_opts);
        err = bpf_xdp_attach(ifindex, prog_id, xdp_flags, &attach_opts);
        if (err)
        {
                fprintf(stderr, "failed to attach BPF programs\n");
                goto cleanup;
        }

        printf("Successfully run! Tracing /sys/kernel/debug/tracing/trace_pipe.\n");
        system("cat /sys/kernel/debug/tracing/trace_pipe");

cleanup:
        /* detach and free XDP program on exit */
        bpf_xdp_detach(ifindex, xdp_flags, &attach_opts);
        xdp_proxy_bpf__destroy(obj);
        return err != 0;
}

编译 bpf 内核代码

1
clang -g -O2 -target bpf -D__TARGET_ARCH_x86 -I/usr/include/x86_64-linux-gnu -I. -c xdp-proxy.bpf.c -o xdp-proxy.bpf.o

生成骨架 头文件

1
bpftool gen skeleton xdp-proxy.bpf.o > xdp-proxy.skel.h 

编译 用户态代码

1
clang -g -O2 -Wall -I. -c xdp-proxy.c -o xdp-proxy.o

编译生成二进制程序(动态链接)

1
clang -Wall -O2 -g xdp-proxy.o -lbpf -lelf -lz -o xdp-proxy

静态链接方式生成 二进制程序(按照主机的链接方式load 的静态库)

1
2
3
4
5
6
clang xdp-proxy.o \
  /usr/lib/x86_64-linux-gnu/libbpf.a \
  /usr/lib/x86_64-linux-gnu/libelf.a \
  /usr/lib/x86_64-linux-gnu/libz.a \
  /usr/lib/x86_64-linux-gnu/libzstd.a \
-O2 -g    -o xdp-proxy

静态链接方式

1
2
3
clang -Wall -O2 -g xdp-proxy.o -static \
  -lbpf -lelf -lz -lzstd \
  -o xdp-proxy

内核处理流程

sequenceDiagram
    participant NIC as Network Interface
    participant XDP as XDP Program
    participant Driver as NIC Driver
    participant Kernel as Kernel Network Stack

    NIC->>XDP: Packet Received
    XDP->>XDP: Modify Packet (IP/MAC)
    XDP->>Driver: Return XDP_TX
    Driver->>NIC: Transmit Modified Packet
    Note right of Driver: Bypasses Kernel Network Stack
    Note left of NIC: Packet Sent Back to Wire

启动docker

1
2
3
4
5
6
7
docker run -itd --name=http1 --hostname=http1 feisky/webserver
docker run -itd --name=http2 --hostname=http2 feisky/webserver

docker run -itd --name=client alpine

# LB
docker run -itd --name=lb --privileged -v /sys/kernel/debug:/sys/kernel/debug alpine

进入 client 容器
反复执行 curl,应该会返回 http1、http2,说明实现了负载均衡

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
/ # curl "http://172.17.0.5"
Hostname: http2

/ # curl "http://172.17.0.5"
Hostname: http2

/ # curl "http://172.17.0.5"
Hostname: http1

/ # curl "http://172.17.0.5"
Hostname: http1

/ # curl "http://172.17.0.5"
Hostname: http1


优化的版本

xdp-proxy.h 头文件

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#ifndef __XDP_PROXY_H__
#define __XDP_PROXY_H__

#include <linux/types.h>
#include <linux/if_ether.h>

#define SVC1_KEY 0x1

struct endpoints {
        __be32 client;
        __be32 ep1;
        __be32 ep2;
        __be32 vip;
        unsigned char ep1_mac[ETH_ALEN];
        unsigned char ep2_mac[ETH_ALEN];
        unsigned char client_mac[ETH_ALEN];
        unsigned char vip_mac[ETH_ALEN];
} __attribute__((packed));

#ifndef memcpy
#define memcpy(dest, src, n) __builtin_memcpy((dest), (src), (n))
#endif

#endif

xdp-proxy.bpf.c

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
#include <stddef.h>
#include <linux/bpf.h>
#include <linux/in.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_endian.h>
#include "xdp-proxy.h"

/* define a hashmap for userspace to update service endpoints */
struct
{
        __uint(type, BPF_MAP_TYPE_HASH);
        __type(key, __be32);
        __type(value, struct endpoints);
        __uint(max_entries, 1024);
} services SEC(".maps");

/* Refer https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/csum_helpers.h#L30 */
static __always_inline __u16 csum_fold_helper(__u64 csum)
{
        int i;
#pragma unroll
        for (i = 0; i < 4; i++)
        {
                if (csum >> 16)
                        csum = (csum & 0xffff) + (csum >> 16);
        }
        return ~csum;
}

static __always_inline __u16 ipv4_csum(struct iphdr *iph)
{
        iph->check = 0;
        unsigned long long csum =
                bpf_csum_diff(0, 0, (unsigned int *)iph, sizeof(struct iphdr), 0);
        return csum_fold_helper(csum);
}

SEC("xdp")
int xdp_proxy(struct xdp_md *ctx)
{
        void *data = (void *)(long)ctx->data;
        void *data_end = (void *)(long)ctx->data_end;

        struct ethhdr *eth = data;
        /* abort on illegal packets */
        if (data + sizeof(struct ethhdr) > data_end)
        {
                return XDP_ABORTED;
        }

        /* do nothing for non-IP packets */
        if (eth->h_proto != bpf_htons(ETH_P_IP))
        {
                return XDP_PASS;
        }

        struct iphdr *iph = data + sizeof(struct ethhdr);
        /* abort on illegal packets */
        if (data + sizeof(struct ethhdr) + sizeof(struct iphdr) > data_end)
        {
                return XDP_ABORTED;
        }

        /* do nothing for non-TCP packets */
        if (iph->protocol != IPPROTO_TCP)
        {
                return XDP_PASS;
        }

        __be32 svc1_key = SVC1_KEY;
        struct endpoints *ep = bpf_map_lookup_elem(&services, &svc1_key);
        if (!ep)
        {
                return XDP_PASS;
        }
        // bpf_printk("Client IP: %ld", ep->client);
        // bpf_printk("Endpoint IPs: %ld, %ld", ep->ep1, ep->ep2);
        // bpf_printk("New TCP packet %ld => %ld\n", iph->saddr, iph->daddr);

        if (iph->saddr == ep->client)
        {
                iph->daddr = ep->ep1;
                memcpy(eth->h_dest, ep->ep1_mac, ETH_ALEN);

                /* simulate random selection of two endpoints */
                if ((bpf_ktime_get_ns() & 0x1) == 0x1)
                {
                        iph->daddr = ep->ep2;
                        memcpy(eth->h_dest, ep->ep2_mac, ETH_ALEN);
                }
        }
        else
        {
                iph->daddr = ep->client;
                memcpy(eth->h_dest, ep->client_mac, ETH_ALEN);
        }

        /* packet source is always LB itself */
        iph->saddr = ep->vip;
        memcpy(eth->h_source, ep->vip_mac, ETH_ALEN);

        /* recalculate IP checksum */
        iph->check = ipv4_csum(iph);

        /* send packet back to network stack */
        return XDP_TX;
}

char _license[] SEC("license") = "Dual BSD/GPL";

参考