⌈xv6-fall2021⌋ lab 7：networking

mit6.828

Word count: 5.6kReading time: 23 min

 2022/04/11 

ABOUT

实验地址：Lab: networking

INTO

这个实验涉及驱动方面的编程，自然少不了编程手册，为方便起见，我提取了实验推荐阅读的几个 section 进行了机翻，仅供参考，网盘如下：

提取码：byhb

不过说实话，实验的推荐材料虽然确实非常全面，甚至包括 PCI 总线介绍、寄存器组等等。但对于驱动编程来说，比如这个实验，我只想找到如何传输（发出）包，如何接受包就够了。所以在实验过程中，我在看了 section 3.2 和 section 3.3，发现并没有我期望的传输和接受包的程序流程之后，我就没翻下去了。我感觉整个实验对我帮助最大的，是前几年这个实验的前身 lab networking 的提示，这可以在网上找到当前的资料，以下资料仅供参考：

貌似是 2018 的实验？

背景（节选）（译）

You'll use a network device called the E1000 to handle network communication. To xv6 (and the driver you write), the E1000 looks like a real piece of hardware connected to a real Ethernet local area network (LAN). In fact, the E1000 your driver will talk to is an emulation provided by qemu, connected to a LAN that is also emulated by qemu. On this emulated LAN, xv6 (the "guest") has an IP address of 10.0.2.15. Qemu also arranges for the computer running qemu to appear on the LAN with IP address 10.0.2.2. When xv6 uses the E1000 to send a packet to 10.0.2.2, qemu delivers the packet to the appropriate application on the (real) computer on which you're running qemu (the "host").

"你将要使用 E1000 这个网络设备去处理网络通信。对于 xv6（以及你写的驱动）来说，E1000 就像一个真实的硬件一样，可以连接到真正的以太网局域网。事实上，你写的驱动另一端的这个 E1000 硬件，是由 qemu 模拟出来的，并连接到一个同样由 qemu 模拟出来的局域网内。在这个模拟的局域网 LAN 上，xv6（作为 guest）拥有 ip 地址 10.0.2.15。qemu 安排运行 qemu 的这台计算机（译者注，即宿主计算机，同下文的 host）出现在 LAN 中的地址为 10.0.2.2。当 xv6 使用 E1000 发送数据包到 10.0.2.2 时，qemu 会分发这个包到宿主计算机（host）上某个合适的程序"

The Makefile configures QEMU to record all incoming and outgoing packets to the file packets.pcap in your lab directory. It may be helpful to review these recordings to confirm that xv6 is transmitting and receiving the packets you expect. To display the recorded packets:
1
tcpdump -XXnr packets.pcap

"Makefile 配置 qemu，使之将所有到达的以及发出的数据包记录在本地实验目录下的 packets.pcap 文件内。检查这些记录以确认 xv6 发出和收到的包是否你期待的，可能有助于你的调试。输入上面的命令可以显示记录的包："

We've added some files to the xv6 repository for this lab. The file kernel/e1000.c contains initialization code for the E1000 as well as empty functions for transmitting and receiving packets, which you'll fill in. kernel/e1000_dev.h contains definitions for registers and flag bits defined by the E1000 and described in the Intel E1000 Software Developer's Manual. kernel/net.c and kernel/net.h contain a simple network stack that implements the IP, UDP, and ARP protocols. These files also contain code for a flexible data structure to hold packets, called an mbuf. Finally, kernel/pci.c contains code that searches for an E1000 card on the PCI bus when xv6 boots.

"我们已经为这个实验在 xv6 repo 新增了一些文件。kernel/e1000.c 包含了 E1000 初始化以及一些用于传输和接受的，需要你补全的空函数。kernel/e1000_dev.h 包含 E1000 对应的寄存器组和一些标志位的定义，详见 Intel E1000 Software Developer's Manual。kernel/net.c 和 kernel/net.h 包含了 xv6 启动时在 PCI 总线上寻找 E1000 网卡的代码"

实验要求（节选）（译）

Your job is to complete e1000_transmit() and e1000_recv(), both in kernel/e1000.c, so that the driver can transmit and receive packets. You are done when make grade says your solution passes all the tests.

"你的工作是完成在 kernel/e1000.c 里面的 e1000_transmit() 和 e1000_recv()，以便驱动可以传输和接受包。当 make grade 显示你的解决方案 passes all the tests 时才是完成实验"

The e1000_init() function we provide you in e1000.c configures the E1000 to read packets to be transmitted from RAM, and to write received packets to RAM.

"实验提供在 e1000.c 里的 e1000_init() 负责配置 E1000，以从内存中读取将要传输的包、向内存写入已收到的包"

Because bursts of packets might arrive faster than the driver can process them, e1000_init() provides the E1000 with multiple buffers into which the E1000 can write packets. The E1000 requires these buffers to be described by an array of "descriptors" in RAM; each descriptor contains an address in RAM where the E1000 can write a received packet. struct rx_desc describes the descriptor format. The array of descriptors is called the receive ring, or receive queue. It's a circular ring in the sense that when the card or driver reaches the end of the array, it wraps back to the beginning. e1000_init() allocates mbuf packet buffers for the E1000 to DMA into, using mbufalloc(). There is also a transmit ring into which the driver places packets it wants the E1000 to send. e1000_init() configures the two rings to have size RX_RING_SIZE and TX_RING_SIZE.

"由于包的猝发传输可能会快于驱动的包处理速率，所以 e1000_init() 提供给 E1000 多个包缓冲区（buffer）以备包写入。E1000 用 '描述符' 数组来描述内存里的这些缓冲区，每个描述符包含一个 E1000 可以写入包的内存地址。描述符对应的结构体是 struct rx_desc。这个描述符数组被称为接收环、或接收队列，它是一个圆环，从而保证网卡或驱动使用数组最后一个元素时，之后能环绕回去环的开头。e1000_init() 用 mbufalloc() 将 E1000 的包缓冲区（mbuf 数组）分配到 DMA（译者注：包到达这个数组相当于直接到达内存）。驱动用来存放 E1000 将要发送的包的传输环也是同样道理。e1000_init() 配置这两个环的大小为 RX_RING_SIZE 和 TX_RING_SIZE"

When the network stack in net.c needs to send a packet, it calls e1000_transmit() with an mbuf that holds the packet to be sent. Your transmit code must place a pointer to the packet data in a descriptor in the TX (transmit) ring. struct tx_desc describes the descriptor format. You will need to ensure that each mbuf is eventually freed, but only after the E1000 has finished transmitting the packet (the E1000 sets the E1000_TXD_STAT_DD bit in the descriptor to indicate this).

"当 net.c 中的网络栈需要发送数据包时，首先调用 e1000_transmit()，并传入一个保存要发送的包的包缓冲区（mbuf）参数。你要补全的传输函数的代码必须设置一个指针，指向 TX 环（传输环）的描述符的包缓冲区。结构体 struct tx_desc 描述了这个环的描述符的格式。你需要确保只有在 E1000 完成传输包之后（E1000 会自动设置 E1000_TXD_STAT_DD 状态位指出），才最终释放了每一个包缓冲区 mbuf"

When the E1000 receives each packet from the ethernet, it first DMAs the packet to the mbuf pointed to by the next RX (receive) ring descriptor, and then generates an interrupt. Your e1000_recv() code must scan the RX ring and deliver each new packet's mbuf to the network stack (in net.c) by calling net_rx(). You will then need to allocate a new mbuf and place it into the descriptor, so that when the E1000 reaches that point in the RX ring again it finds a fresh buffer into which to DMA a new packet.

"当 E1000 从因特网接收到每一个数据包时，首先会通过下一个 RX 环的描述符指出的包缓冲区 mbuf 的地址，直接 DMA 整个包到这个包缓冲区，然后再发出中断。你要补全的 e1000_recv() 代码必须遍历整个 RX 环，并通过 net_rx() 将每一个新到达的包对应的包缓冲区，分发给网络栈（在 net.c 里定义）。你之后需要为这个已分发包的描述符，分配一个新地址，并将对应的指针赋值给这个描述符，以便下一次 E1000 在此到达这里的时候可以继续 DMA 数据包到新的包缓冲区"

To test your driver, run make server in one window, and in another window run make qemu and then run nettests in xv6. The first test in nettests tries to send a UDP packet to the host operating system, addressed to the program that make server runs.

"测试驱动的方法是，先打开一个窗口运行 make server，再打开另一个窗口运行 make qemu 以及在 xv6 中执行 nettests。第一个测试会尝试发送一个 UDP 包给宿主计算机 host，并寻址到运行着 make server 的这个程序。"

After you've completed the lab, the E1000 driver will send the packet, qemu will deliver it to your host computer, make server will see it, it will send a response packet, and the E1000 driver and then nettests will see the response packet. Before the host sends the reply, however, it sends an "ARP" request packet to xv6 to find out its 48-bit Ethernet address, and expects xv6 to respond with an ARP reply. kernel/net.c will take care of this once you have finished your work on the E1000 driver. If all goes well, nettests will print testing ping: OK, and make server will print a message from xv6!.

"在完成这个实验之后，E1000 驱动会发出数据包，qemu 就会分发到宿主电脑 host，make server 这个程序就能收到，之后该程序会发出响应包，那么 E1000 驱动以及 nettests 就会收到响应。但是，在宿主电脑 host 真正发出响应包之前，它会先发出 'ARP' 请求包到 xv6 上以寻找它自身的 48 位以太网物理地址，并期待 xv6 的 ARP 响应包。一旦你在 E1000 驱动上完成任务，kernel/net.c 会检查这个步骤。如果执行顺利，nettests 会打印 testing ping: OK，然后 make server 程序会打印 a message from xv6"

实验提示（节选）（译）

Some hints for implementing e1000_transmit:

First ask the E1000 for the TX ring index at which it's expecting the next packet, by reading the E1000_TDT control register.

Then check if the the ring is overflowing. If E1000_TXD_STAT_DD is not set in the descriptor indexed by E1000_TDT, the E1000 hasn't finished the corresponding previous transmission request, so return an error.

Otherwise, use mbuffree() to free the last mbuf that was transmitted from that descriptor (if there was one).

Then fill in the descriptor. m->head points to the packet's content in memory, and m->len is the packet length. Set the necessary cmd flags (look at Section 3.3 in the E1000 manual) and stash away a pointer to the mbuf for later freeing.

Finally, update the ring position by adding one to E1000_TDT modulo TX_RING_SIZE.

If e1000_transmit() added the mbuf successfully to the ring, return 0. On failure (e.g., there is no descriptor available to transmit the mbuf), return -1 so that the caller knows to free the mbuf.

"实现 e1000_transmit() 的提示：

首先通过读 E1000_TDT 控制寄存器向 E1000 询问它期待的下一个数据包的 TX 环索引
然后检查环是否溢出。如果 E1000_TDT 这个索引的描述符的 E1000_TXD_STAT_DD 未置位，那么 E1000 就不会结束前一次的传输，索引应该返回 error
否则，调用 mbuffree() 释放上一次传输的包缓冲区 mbuf（如果有的话）
然后填充描述符。m->head 指出内存中包的地址，m->len 是包的长度。设置必要的 cmd 标志，并保存指向包缓冲 mbuf 的指针以供下次调用的释放
最后，通过将 E1000_TDT 加一模除 TX_RING_SIZE 以更新环的位置
如果 e1000_transmit() 成功向环增加了包缓冲 mbuf 就返回 0；否则算失败（如没有可用于传输包缓冲 mbuf 的描述符），此时返回 -1 方便 caller 可以知晓情况以释放包缓冲 mbuf"

Some hints for implementing e1000_recv:

First ask the E1000 for the ring index at which the next waiting received packet (if any) is located, by fetching the E1000_RDT control register and adding one modulo RX_RING_SIZE.

Then check if a new packet is available by checking for the E1000_RXD_STAT_DD bit in the status portion of the descriptor. If not, stop.

Otherwise, update the mbuf's m->len to the length reported in the descriptor. Deliver the mbuf to the network stack using net_rx().

Then allocate a new mbuf using mbufalloc() to replace the one just given to net_rx(). Program its data pointer (m->head) into the descriptor. Clear the descriptor's status bits to zero.

Finally, update the E1000_RDT register to be the index of the last ring descriptor processed.

e1000_init() initializes the RX ring with mbufs, and you'll want to look at how it does that and perhaps borrow code.

At some point the total number of packets that have ever arrived will exceed the ring size (16); make sure your code can handle that.

"实现 e1000_recv() 的提示：

通过取 E1000_RDT 控制寄存器并加一模除 RX_RING_SIZE，向 E1000 询问下一个等待接收的数据包所在的环描述符的索引
然后通过检查描述符里 E1000_RXD_STAT_DD 状态位判断新接收的包是否可用，如果不可用那就停止
否则，更新包缓冲 mbuf 的 m->len 为描述符记录的长度。使用 net_rx() 将包缓冲 mbuf 送入网络栈
然后使用 mbufalloc() 分配一个新的 mbuf 代替刚才 net_rx() 送入网络栈的那个
最后将 E1000_RDT 更新为最后一个环描述符索引
e1000_init() 用 mbufs 初始化 RX 环，你可以看看怎样实现的，甚至可以借阅
有时候到达 E1000 的包的数量会超过接收环的大小（16），确保你的代码可以处理这种情况"

You'll need locks to cope with the possibility that xv6 might use the E1000 from more than one process, or might be using the E1000 in a kernel thread when an interrupt arrives.

"xv6 可能会在超过一个进程上面使用 E1000，或者当中断到达时在内核线程上使用 E1000。你需要锁住共享数据"

实验思路

首先从 "背景" 这个 section 中，可以得知以下内容：

E1000 是网卡驱动与之沟通的对象，我们的实验任务是实现它的驱动。不过这个硬件是 xv6 模拟的，包括局域网（LAN）也是
在 LAN 内，xv6（题干称为 guest）的地址是 10.0.2.15；宿主电脑（运行 qemu 的电脑，题干称为 host）地址是 10.0.2.2
E1000 往 10.0.2.2（host）发包，实际会被宿主电脑上某应用（不一定是分发给 qemu）收到
e1000.c 以及 e1000.h：用以初始化 E1000，内含待补充的传输和接收包的函数
net.c 以及 net.h：用以实现网络栈
pci.c：用以在 boot 阶段在 PCI 总线上寻找 E1000 网卡

之后从 "实验要求" 这个 section 中，可以得知：

e1000.c/e1000_init()：用以配置 E1000，其开头两个 for() 循环用以配置传输方和接收方的描述符数组以及包缓冲区
对于传输来说，实际上是从内存读取要传输的包
- 驱动负责将数据送入 tx_ring
- 递增 E1000_TDT
- 网卡会自动将描述符上的数据发出，这我们不用关心
对于接收来说，实际上是向内存写入已收到的包
- 网卡会自动将收到的包通过 DMA 的方式送入 rx_ring，这我们也不用关心
- 驱动负责读 rx_ring 的数据

那么之后的 "实验提示" section 就给出了具体步骤：

传输：
- 获取 tx_ring 索引，这是当前可用的描述符
- 检查环溢出，并释放上一次传输完的包缓冲 mbuf（注：此处说的环溢出，意思是位于内核缓冲区的 rx_ring 描述符数组接收包速率太快，而网卡从 rx_ring 描述符数组中读取包的速率又太慢，导致 rx_ring 描述符数组已被填充满，那么用户进程下一个提供的待发送的包就无法送入 rx_ring。不过实际上不需要考虑这种情况，后文会详细讨论这件事）
- 填充当前这个描述符，如长度、cmd 字段等等
- 更新 tx_ring 环的位置，即更新 E1000_TDT 寄存器
接收：
- 获取 rx_ring 索引，这是当前需要检查的描述符，不过要注意需要先加一并模除
- 检查这个索引所表示的描述符是否可用（即是否真的接收到包了），这通过描述符上某个状态位可以获悉
- 更新包缓冲 mbuf 的属性，比如长度之类的，之后就将更新后的 mbuf 送入网络栈
- 为刚才送入网络栈的包缓冲重新分配一个 mbuf
- 更新 rx_ring 环的位置，即更新 E1000_RDT 寄存器

可能你会疑惑 "描述符 descriptor" 和 "包缓冲 mbuf" 有什么区别？如下：

描述符：这是从硬件视角来看的，可以读 section 3.2.6 接收描述符队列结构，第一段获取原文，总之现在只需要知道描述符是硬件维护的，硬件获取包或者发出包，都是从描述符上取数据。所以驱动的任务是给出这样一个符合硬件要求的描述符
包缓冲区：这是从驱动视角来看的，可以读 section 2.8 缓冲区和描述符结构，第一段获取原文，总之现在只需知道软件负责分配发送和接收 package 的缓冲区，并形成包含指向这些缓冲区的指针和状态的描述符。一句话，用户进程提交数据只知道内存地址（或者说只能访问内存），驱动负责将这些数据交给硬件

为了方面理解，我画出了我所理解的两个模型，如有不同意见欢迎讨论：

上图所示是传输模型，即 e1000_transmit() 的逻辑：驱动不断往 rx_ring 尾指针处送入数据，硬件不断往头指针处取走。还记得上面说的环溢出吗？参见下图：

上文我说过在这个 lab 中不会发生，原因在于 transmiter 第二点提示：

If E1000_TXD_STAT_DD is not set in the descriptor indexed by E1000_TDT, the E1000 hasn't finished the corresponding previous transmission request, so return an error.

这点提示是用尾指针来检查 E1000_TXD_STAT_DD，而这个状态位用来标识网卡是否已经取走描述符对应的包缓冲了，但实际上硬件应该是检查头指针的。这说明了什么？说明头尾指针是重叠的，即考虑这样一个场景：初始化时头尾指针均指向 index0，但网卡太慢，而驱动又处理得太快，并且由于数组是循环队列，那么当尾指针一直后移，index 从 0 变成 1、2、3...15，最后又变回 0，此时头尾指针就重叠，也即传输环 tx_ring 满。在这个提示中，要求我们返回 error，所以就规避了环溢出的情况

相应地，上图所示是接收模型，即 e1000_recv() 的逻辑：硬件不断将收到的包存入描述符对应的包缓冲，之后驱动负责取出。但这里有个值得一提的地方是，如果你也考虑性能，那么最好在一次中断里将环中所有的数据读出，所以该怎么处理这里的逻辑呢？你可以先想想。当然，不考虑性能就算了，怎么出效果就怎么来

这里有一个比较坑的地方，receiver 最后一点提示给出：

At some point the total number of packets that have ever arrived will exceed the ring size (16); make sure your code can handle that.

有时候到达的包的数量会超出环的大小（16），需要采取一些手段预防

我刚开始观察到 struct mbuf 这个结构体有一个字段是 next，并且接收环的描述符里也有一个状态位为 E1000_RXD_STAT_EOP，标识着同一个包里最后一个分块（应该是这样理解？），那么我就很自然地想要利用这两个字段。结果我卡在 receiver 这里，很多包重复收发了很多次（make server 重复发出 a message from xv6、packets.pcap 重复打印相同输出）。所以，结果就是，和这两个字段一点都不相关，其实最终完成后，我才发现真的很简单，原来是我想复杂了

可能它这个提示的意思，仅仅是让我们注意一个中断尽可能多地接收所有包而已。目前未知我觉得还是不太能理解这句提示意思，虽然我 pass 了实验。当然，一切都交给你自己探索了

Solution

// path: kernel/net.h
void net_rx(struct mbuf *);// 需要声明这个函数，不然 receiver 无法找到进入网络栈的函数签名

// path: kernel/e1000.c
// 从内存读取要传输的包
// 形参 mbuf 是包含要发送的包的 mbuf
int
e1000_transmit(struct mbuf *m)
{
    //
    // Your code here.
    //
    // the mbuf contains an ethernet frame; program it into
    // the TX descriptor ring so that the e1000 sends it. Stash
    // a pointer so that it can be freed after sending.
    //
    // mbuf 包含了一个以太网帧，编程使 mbuf 送入 tx 描述符环
    // 而使 E1000 可以发送出去
    // 储存一个指针以便发送完后释放它

    acquire(&e1000_lock);

    // 获取 tx 环索引
    // 这是 E1000 期待的待发送数据包所在的描述符索引
    uint32 idx = regs[E1000_TDT]; 

    // 检查环溢出，并释放上一次传输完的 mbuf
    if ((tx_ring[idx].status & E1000_TXD_STAT_DD) == 0) {
        release(&e1000_lock);
        return -1;
    }
    if (tx_mbufs[idx])
        mbuffree(tx_mbufs[idx]);

    // 填充描述符
    tx_ring[idx].addr = (uint64)m->head;
    tx_ring[idx].length = m->len;
    tx_ring[idx].cmd = E1000_TXD_CMD_RS
                            | E1000_TXD_CMD_EOP;
    tx_mbufs[idx] = m;// stash away the pointer to the mbuf

    // 更新环位置
    regs[E1000_TDT] = (regs[E1000_TDT] + 1) % TX_RING_SIZE;
    release(&e1000_lock);
    return 0;
}
// 向内存写入收到的包
static void
e1000_recv(void)
{
    //
    // Your code here.
    //
    // Check for packets that have arrived from the e1000
    // Create and deliver an mbuf for each packet (using net_rx()).
    //
    // 检查已经到达 E1000 的包
    // 使用 net_rx() 为每个包送入 E1000

    while (1) {
        acquire(&e1000_lock);

        // 获取 RX 索引
        // 如果当前描述符还未有数据到达，需要后移即加一模除
        // 因为一开始这个描述符肯定为空，那么随着 E1000 不断接收
        // 数据包————即写入数据，头指针会赶上尾指针
        // 而当尾指针等于头指针时，硬件视为队列空————即已经没有
        // 地方再接收数据包了
        // 如果读到当前描述符空却不后移，那么每次循环都会读到空，
        // 那么 E1000 就再也无法接收数据包
        uint32 idx = (regs[E1000_RDT] + 1) % RX_RING_SIZE;

        // 判断包是否可用，即 E100 是否真的写入数据包了
        if ((rx_ring[idx].status & E1000_RXD_STAT_DD) == 0) {
        release(&e1000_lock);
        return;
        }

        // 更新包缓冲区（mbuf），即填充 mbuf 结构体，并推送网络栈
        rx_mbufs[idx]->len = rx_ring[idx].length;
        release(&e1000_lock);
        net_rx(rx_mbufs[idx]);

        // 为刚才的 mbuf 新分配空间
        rx_mbufs[idx] = mbufalloc(0);
        rx_ring[idx].addr = (uint64)rx_mbufs[idx]->head;
        rx_ring[idx].status = 0;

        // 更新尾指针，更新为最后处理的环描述符的索引
        regs[E1000_RDT] = idx;
    }
}