⌈xv6-fall2021⌋ lab 10：mmap

Word count: 5.9kReading time: 25 min

 2022/04/21 

ABOUT

实验地址：Lab: mmap

Mmap (HARD)

实验要求（译）（节选）

mmap can be called in many ways, but this lab requires only a subset of its features relevant to memory-mapping a file. You can assume that addr will always be zero, meaning that the kernel should decide the virtual address at which to map the file. mmap returns that address, or 0xffffffffffffffff if it fails. length is the number of bytes to map; it might not be the same as the file’s length. prot indicates whether the memory should be mapped readable, writeable, and/or executable; you can assume that prot is PROT_READ or PROT_WRITE or both. flags will be either MAP_SHARED, meaning that modifications to the mapped memory should be written back to the file, or MAP_PRIVATE, meaning that they should not. You don’t have to implement any other bits in flags. fd is the open file descriptor of the file to map. You can assume offset is zero (it’s the starting point in the file at which to map).

“mmap 可以用在许多方面，但本次 lab 只涉及关于内存映射文件的部分。你可以假设 addr 总是 0，即由内核决定将文件映射去哪里。mmap 调用成功时返回这个映射地址，或者调用失败返回 0xffffffffffffffff。length 要映射的字节数，可能和文件本身的长度不同。 prot 指出被映射的内存是否可读、可写、可执行，可以假定要么可读，要么可写，要么可读可写。flags 要么是 MAP_SHARED，意味着被映射内存的修改要写回文件；要么是 MAP_PRIVATE，意味着不用写回，这个参数不用考虑其他标志位的实现。fd 是要映射的文件打开时的文件描述符。你可以假定 offset 是 0，偏移指出映射文件从哪里开始访问”

munmap(addr, length) should remove mmap mappings in the indicated address range. If the process has modified the memory and has it mapped MAP_SHARED, the modifications should first be written to the file. An munmap call might cover only a portion of an mmap-ed region, but you can assume that it will either unmap at the start, or at the end, or the whole region (but not punch a hole in the middle of a region).

“munmap(addr, length) 要移除指定地址范围处 mmap() 设置的映射。如果进程已经修改了内存，并且标志位为 MAP_SHARED，那么这次修改应该首先写回文件。一个 munmap() 调用可能只作用于已映射的那部分区域，所以你可以假定取消映射要么开始于该区域开头处，要么开始于该区域结尾处，要么是整个趋于（而不会是区域中间某个部分）”

实验提示（译）（节选）

Fill in the page table lazily, in response to page faults. That is, mmap should not allocate physical memory or read the file. Instead, do that in page fault handling code in (or called by) usertrap, as in the lazy page allocation lab. The reason to be lazy is to ensure that mmap of a large file is fast, and that mmap of a file larger than physical memory is possible.

“页表分配推迟到相应 page-fault 时才进行，即 mmap() 不应该分配物理内存或者读文件。相反，在 usertrap() 的 page-fault 中断例程时才作出实际分配，就像 lazy page allocation lab 那样。因为这样做可以确保 mmap() 能迅速分配大文件，也可以分配一个比物理内存更大的文件”

Keep track of what mmap has mapped for each process. Define a structure corresponding to the VMA (virtual memory area) described in Lecture 15, recording the address, length, permissions, file, etc. for a virtual memory range created by mmap. Since the xv6 kernel doesn’t have a memory allocator in the kernel, it’s OK to declare a fixed-size array of VMAs and allocate from that array as needed. A size of 16 should be sufficient.

“保持追踪 mmap() 为每个进程映射了什么。定义一个 LEC-15 指出的 VMA 对应的结构体，为 mmap() 创建的 VMA 结构记录地址、长度、权限、待映射的文件等等。因为 xv6 内核没有内存分配器，所以声明一个固定大小的 VMA 数组，并在需要时才使用是没问题的。大小固定为 16 就足够了”

Implement mmap: find an unused region in the process’s address space in which to map the file, and add a VMA to the process’s table of mapped regions. The VMA should contain a pointer to a struct file for the file being mapped; mmap should increase the file’s reference count so that the structure doesn’t disappear when the file is closed (hint: see filedup). Run mmaptest: the first mmap should succeed, but the first access to the mmap-ed memory will cause a page fault and kill mmaptest.

“实现 mmap()：为待映射的文件在进程地址空间找一段不使用的区域，然后将此区域添加到进程的映射区域内。VMA 结构应该包含一个指向 struct file 的指针，指向待映射的文件，并为映射文件增加一次引用，以便文件关闭时该结构不会消失（参考 filedup()）。运行 mmaptest：第一个 mmap 测试点会成功，但第一次访问已映射区域会抛出 page-fault，然后内核会杀死 mmaptest”

Add code to cause a page-fault in a mmap-ed region to allocate a page of physical memory, read 4096 bytes of the relevant file into that page, and map it into the user address space. Read the file with readi, which takes an offset argument at which to read in the file (but you will have to lock/unlock the inode passed to readi). Don’t forget to set the permissions correctly on the page. Run mmaptest; it should get to the first munmap.

“对于已映射区域的 page-fault，遇到才分配内存，将相关文件的 4096 字节读入页（即分配一页），并将这一页映射到用户地址空间。此时调用 readi() 读取文件，readi() 需要待读文件上的偏移作为参数（但上不上锁传入 readi() 的 indoe 都可以）。同时也不要忘记分配这一页权限。此时运行 mmaptest 应该可以到达第一个 munmap”

Implement munmap: find the VMA for the address range and unmap the specified pages (hint: use uvmunmap). If munmap removes all pages of a previous mmap, it should decrement the reference count of the corresponding struct file. If an unmapped page has been modified and the file is mapped MAP_SHARED, write the page back to the file. Look at filewrite for inspiration.

“实现 munmap()：找到给定地址范围的 VMA，然后使用 uvmunmap() 取消指定的多个页的映射。如果 munmap() 移除了前面调用的 mmap() 的所有页，那就一并减去相应文件 struct file 结构体的引用计数。已修改且标为 MAP_SHARED 的页要先写回文件（参考 filewrite()）”

Ideally your implementation would only write back MAP_SHARED pages that the program actually modified. The dirty bit (D) in the RISC-V PTE indicates whether a page has been written. However, mmaptest does not check that non-dirty pages are not written back; thus you can get away with writing pages back without looking at D bits.

“理论上只需写回实际修改过的 MAP_SHARED 页，这些页通过 PTE 里的脏位（D 位）指出该页是否已写入。不过 mmaptest 不检查非脏页是否没有被写回，因此你无需检查 D 位就可以写回”

Modify exit to unmap the process’s mapped regions as if munmap had been called. Run mmaptest; mmap_test should pass, but probably not fork_test.

“exit() 要取消进程的已映射区域的所有页的映射，像 munmap() 那样。此时运行 mmaptest，mmap_test 测试点可以通过，但可能 fork_test 不能通过”

Modify fork to ensure that the child has the same mapped regions as the parent. Don’t forget to increment the reference count for a VMA’s struct file. In the page fault handler of the child, it is OK to allocate a new physical page instead of sharing a page with the parent. The latter would be cooler, but it would require more implementation work. Run mmaptest; it should pass both mmap_test and fork_test.

“修改 fork() 使子进程拥有与父进程一样的映射区域。但不要忘了增加 VMA 的 struct file 结构体的引用计数。子进程的 page-fault 时期，可以分配新物理页而不是共享父进程的页。共享页可能很高效率，但也需要更复杂的实现。此时运行 mmaptest 才能通过 fork_test 测试点”

实现思路

首先要知道 mmap() 是什么，但是 mmap 只是一个功能，需要作用在一个称为 内存映射文件（memory-mapped file） 的数据结构上，其核心思想是应用将文件映射到自己的地址空间，此时使用 mmap 这种技术就可以达到 访问用户进程自己的地址空间相当于访问文件

但是，其实上（就我个人理解来说）mmap 技术读写小文件和常规 read() / write() 逻辑是一样的，见下图：

如图所示是 常规 read() / write() 读写文件，程序读写文件需要经过两次拷贝，第一次是从磁盘到 block cache；第二次是从 block cache 到用户空间

上图所示是我理解的 mmap 技术读写文件，可以发现逻辑上也是经过了两次拷贝，同常规技术一样，都是先拷贝到内核的 block cache，然后再拷贝到用户进程可访问的缓冲区

我个人认为，这里的开销是差不多的，起码 xv6 源码看起来是这样（如 xv6 执行 read() 系统调用，先陷入内核态，然后由对应的 sys_read() 调用 fileread()，再调用 readi() 完成读文件操作。这里最后的 readi() 最终完成了将磁盘文件读入内存 inode，然后再从内存 inode 将数据拷贝至用户空间），所以 绝不是网上那种一致的说法，说使用了 mmap 技术可以省去最后拷贝回用户进程这步的开销（当然，POSIX 接口比如 Linux 的 mmap() 是怎样实现的，我不了解，可能就因为这样疏忽了更加关键的细节也说不定，所以这里涉及的开销问题的讨论仅供参考）

其实无论用户态还是内核态，本质都是内存的某一片区域，拷贝开销都是一样的 ———— 如果我在某一个内核函数里定义了一个缓冲区，那么从 block cache 到这个缓冲区（内核缓冲区到内核缓冲区）；以及从 block cache 到上图所示的用户进程 buffer（内核缓冲区到用户缓冲区），这两者开销是一样的。都是将数据从内存某处拷贝到某处，开销还能不一样吗？但区别还是有的，只是访问权限不同而已

现在我们再来想想使用 mmap 技术有什么不同。可能你会觉得是不是常规的 read() / write() 是系统调用，所以多了一步保护现场的开销？但你别忘了我们这个 lab 实现的 mmap 也是要封装成系统调用的形式

那还有什么地方不同？或者更优的地方在哪？我个人理解是，mmap 技术最大的不同是按需分配，可以回顾下实验开头的这段话：

The mmap and munmap system calls allow UNIX programs to exert detailed control over their address spaces. They can be used to share memory among processes, to map files into process address spaces, and as part of user-level page fault schemes such as the garbage-collection algorithms discussed in lecture.

这段话指出，mmap 这种技术可以用在：

进程间共享内存：常规的 read() / write() 如果是多个进程访问同一个文件，那么内存会同时存在两份副本，这就造成更大的内存开销
将文件映射至进程地址空间：这个和常规读写是一样的，常规读写虽然谈不上映射，但逻辑上访问它自己的缓冲区也是相当于在访问文件
用于 page-fault：这是我觉得和常规读写最大的区别。读写小文件无论哪种技术，由于核心逻辑都要经过两次拷贝，所以开销都是差不多的。但是读写大文件，你内存只有 16GB，我想访问 80G 的大文件，常规读写根本做不到；但是引入 page-fault（回顾前面的 lab cow）这就成为可能

现在需要考虑的是 VMA 这个数据结构，我先把上面 mmap 技术这幅图拿下来

图上紫箭头 ① ② 两步访问的这个 vma 区域就是和 mmap 技术相关的一个数据结构，你可以把这个数据结构定义在内核数据区（如在进程结构体 PCB 上新增一个字段），那么这个数据结构至少需要记录一些虚拟地址，用户进程通过这些虚拟地址访问实际的、存储着文件数据的 mmap-ed 区域（后文均统称为 “已映射区域”）

上图 ③ ⑥ 两个箭头的读写可能会产生一些误解。它们访问的是缺页时分配的页（实际的物理地址），但实际上，用户进程不可能直接访问物理地址，这里其实是先访问 vma 结构记录的虚拟地址，通过页表机制的翻译最终访问到物理内存上，但我此处画图直接画到物理内存上了

至于 vma 数据结构应该包含什么内容，你可以参考 POSIX 接口。但我这里只定义实验要求和实验提示给出的内容：

该 vma 区域对应的虚拟地址起址
映射文件的长度
权限（读写）
是否写回
文件描述符
文件偏移
待映射的文件的指针

现在考虑另一个问题，那就是 vma 结构所需的虚拟地址应该如何分配？你总不可能从 0 开始分配吧，又或是随便用一个虚拟地址。这里使用的虚拟地址必须是进程没有被使用的，但是实际上，xv6 内核可没有管理虚拟地址分配的能力。kalloc() 分配的是物理地址，管理着用户进程的物理页的分配

所以这就需要我们对 xv6 的内存分布非常熟悉：

如图所示是进程的地址空间，对于一个进程来说，我们只能使用堆区。虽然我不清楚 xv6 有没有实现堆内存分配、堆分配是怎样的。但我直觉上认为如果 xv6 有这个实现，也是从堆底开始，所以这里我管理的 vma 结构使用的虚拟内存是从堆顶开始的，如图所示：

那么，考虑 vma 分配逻辑，我们可以定义一个初始的指针，指向的地址刚好是 MAXVA 减去两个页面的大小，那么我们每次分配一个 vma 区域，思想和入栈一样，向低地址增长，像上图右半部分一样，vma 指针需要减去用户请求的映射长度。可以发现，vma 分配是非常简单的，只要指针无脑减减减就行了

但是释放逻辑稍微有点复杂，考虑过程略去，我这里直接报上结论：根据取消映射全部的文件还是部分，而复杂情况不同。这里面判是否取消部分映射，取决于：

vma 起址与 unmap 起址
vma 长度与 unmap 长度

上面两点，两两搭配共有四种情况，分别是：

第一种情况：这是释放整个文件，此时 vma 区域直接释放掉
第二种情况：这是释放部分文件，此时 vma 区域不单单要修改起始虚拟地址，还要修改映射长度
第三种情况：不存在，因为不可能文件起始地址与释放地址不同，但释放长度是整个文件的长度，这会释放后一个 vma 的部分区域
第四种情况：这也是释放部分文件，但此时 vma 区域只需修改映射长度

现在还需要考虑最后一个问题，既然 vma 的分配释放类似栈的思想（但实际上区别很大，此处 vma 可以随机访问任一个虚拟地址，但栈不能访问栈底、栈中间。这里说的只是 vma 指针的调整类似于栈），那么还需要考虑怎么让 vma 指针往上走回去的问题

上图 mmap 映射了两个文件，随后释放文件 2，此时 vma 指针可以往上调整吗？答案是不可以，因为文件 1 还未释放，此时往上调那么下次分配映射会覆盖了文件 1 的地址

现在情况来到文件 1 的释放时，此时可以往上调整 vma 指针吗？答案是可以，因为往上的这两个区域都已经没有映射了

最后的最后，剩下 fork() 和 exit() 的改进。exit 没其他注意事项，逻辑上一个进程退出时，释放所有的映射文件就行，其实就是 munmap() 的逻辑；至于 fork，由于子进程是另外的进程，它也有自己独立的地址空间，所以直接将父进程的 vma 数组全部拷贝过去就行，不需要担心子进程拥有和父进程一样的页表映射，这里只拷贝 vma 数组并不会拷贝父进程的页表，所以实际上子进程是没有任何页表映射的，如果子进程访问到自己的 vma 数组，会抛出 page-fault，然后内核分配另外的物理页（虽然也是映射到同一个文件）。这个 lab 不要求子进程共享父进程的物理页，所以实际上简单很多。注意子进程 fork 后记得为映射文件自增引用计数，否则 fork_test 在 fileclose() 时肯定会出问题

Solution

# path: ./Makefile
UPROGS=\
    ...
    $U/_mmaptest\

1
2
3

# path: user/usys.pl
entry("mmap");
entry("munmap");

// path: user/user.h
void *mmap(void *, int, int, int, int, int);
int munmap(void *, uint);

// path: kernel/syscall.h
#define SYS_mmap   22
#define SYS_munmap 23

// path: kernel/syscall.c
extern uint64 sys_mmap(void);
extern uint64 sys_munmap(void);

static uint64 (*syscalls[])(void) = {
...
[SYS_mmap]    sys_mmap,
[SYS_munmap]  sys_munmap,
};

// path: kernel/riscv.h
#define PTE_D (1L << 7)

// path: kernel/defs.h
// vm.c
pte_t *walk(pagetable_t, uint64, int);
// sysfile.c
uint64 pgfault(uint64);
uint64 subunmap(uint64, int);

// path: kernel/proc.h
#define MAXVMA 16

struct vma_t {
    int valid;// 1 表示有效；0 表示无效即可使用
    uint64 va;
    uint len;
    int  prot;
    int  flags;
    int  fd;
    long off;
    struct file *f;
};

struct proc {
    ...
    struct vma_t vma[MAXVMA];
    uint64 curend;
};

// path: kernel/sysfile.c
// mmap(虚拟地址, 映射长度, 权限, 写回, fd, 偏移)
uint64
sys_mmap(void) {
    uint64 va;
    int len, prot, flags, fd, off;

    if(argaddr(0, &va) < 0)    return -1;
    if (argint(1, &len) < 0 || argint(2, &prot) < 0
        || argint(3, &flags) < 0 || argint(4, &fd) < 0)
        return -1;
    if (argint(5, &off) < 0)    return -1;

    // 检查参数
    struct proc *p = myproc();
    if (len <= 0)    return -1;
    if (p->ofile[fd] == 0)    return -1;// fd 代表的打开文件不存在
    else    filedup(p->ofile[fd]);// 否则增加一次引用
    if (p->ofile[fd]->writable == 0// 不可写文件映射可写属性
        && (prot & PROT_WRITE) == PROT_WRITE
        && (flags & MAP_SHARED) == MAP_SHARED)    return -1;

    // 分配 vma 结构体
    int i;
    for (i = 0; i < MAXVMA; ++i) {
        if (p->vma[i].valid == 0) {
            p->vma[i].valid = 1;
            break;
        }
    }
    if (i == MAXVMA)    return -1;// vma 数组没有空闲位置

    // 填充 vma
    p->vma[i].va = p->curend - len;// 分配一个虚拟地址
    p->curend = p->vma[i].va;// 同时修正下一个虚拟地址
    p->vma[i].len = len;
    uint pteflags = 0;// 以下几句赋值 PTE 权限
    if ((prot & PROT_READ) == PROT_READ)    pteflags |= PTE_R;
    if ((prot & PROT_WRITE) == PROT_WRITE)    pteflags |= PTE_W;
    p->vma[i].prot = pteflags;
    p->vma[i].flags = flags;
    p->vma[i].fd = fd;
    p->vma[i].off = off;
    p->vma[i].f = p->ofile[fd];

    return (uint64)p->vma[i].va;
}

uint64
sys_munmap(void) {
    uint64 va;
    int len;

    if(argaddr(0, &va) < 0)    return -1;
    if (argint(1, &len) < 0)    return -1;

    // 因为 exit() 也要 unmap，所以主体逻辑封装为另一个函数
    if (subunmap(va, len) == -1)    return -1;

    return 0; 
}

// 因为头文件问题
// usertrap() 处理逻辑要写在包含 file.h 的文件里
uint64
pgfault(uint64 va) {
    struct proc *p = myproc();
    struct vma_t *v = 0;
    // 寻找出对应的 vma 元素
    for (int i = 0; i < MAXVMA; ++i) {
        if (p->vma[i].valid == 1 && p->vma[i].va <= va
            && va <= p->vma[i].va + p->vma[i].len) {
            v = &p->vma[i];
            break;
        }
    }
    if (v == 0)    return -1;

    // 分配一页
    // 安装映射
    // 将文件 4096B 写入（使用 readi()）
    uint64 pa = (uint64)kalloc();
    if (pa == 0)    return -1;
    memset((char *)pa, 0, PGSIZE);

    // 由于等会 readi() 是从 inode 读数据，输出至用户地址 va
    // 所以要先建立用户地址 va 到物理地址 pa 的映射
    if (mappages(p->pagetable, va, PGSIZE, pa, v->prot|PTE_U) == -1) {
        kfree((void *)pa);
        return -1;
    }
    // inode 数据写入 va
    ilock(v->f->ip);
    int ret = readi(v->f->ip, 1, va, v->off + va - v->va, PGSIZE);
    if (ret == -1) {
        kfree((void *)pa);
        iunlock(v->f->ip);
        return -1;
    }
    iunlock(v->f->ip);

    return 0;
}

uint64
subunmap(uint64 va, int len) {
    if (len == 0)    return 0;
    struct proc *p = myproc();
    struct vma_t *v = 0;
    // 寻找出对应的 vma 元素
    for (int i = 0; i < MAXVMA; ++i) {
        if (p->vma[i].valid == 1 && p->vma[i].va <= va
            && va <= p->vma[i].va + p->vma[i].len) {
            v = &p->vma[i];
            break;
        }
    }
    if (v == 0)    return -1;

    // 检查 va 对应的映射
    va = PGROUNDDOWN(va);
    pte_t *pte = walk(p->pagetable, va, 0);// 第一个取消页的 pte
    if (pte == 0)    return -1;
    uint64 pteflags = PTE_FLAGS(*pte);// 第一个取消页的 pte flags
    uint64 pa;
    if ((pa = walkaddr(p->pagetable, va)) != 0) {
        // 已安装映射，物理地址必存在，需要检查写回，再取消映射
        // 未安装映射，什么都不用做
        if (v->flags & MAP_SHARED) {
            if ((pteflags & PTE_D) == PTE_D) {
                int ret = filewrite(v->f, va, len);
                if (ret == -1)    return -1;
            }
        }

        // 取消映射
        uvmunmap(p->pagetable, va, len / PGSIZE, 1);
    }

    // 依照文件是部分 unmap 还是全部 unmap，情况不同
    // 判断是否是部分 unmap，可对比
    //     1.  vma 起址是否等同本次 unmap 起址
    //     2.  vma 长度是否等同本次 unmap 长度
    // 上述情况两两组合共四种情况
    if (v->va == va && v->len == len) {
        // vma 起址 == un 起址，且 vma 长度 == un 长度
        // 即完全 unmap 文件
        v->len -= len;
    } else if (v->va == va && v->len != len) {
        // vma 起址 == un 起址，且 vma 长度 != un 长度
        // 部分 unmap 文件，un 文件开头部分
        v->va += len;
        v->len -= len;
    } else if (v->va != va && v->len == len) {
        // vma 起址 != un 起址，且 vma 长度 == un 长度
        // 这种情况不存在
    } else if (v->va != va && v->len != len) {
        // vma 起址 != un 起址，且 vma 长度 != un 长度
        // 部分 unmap 文件，un 文件后面部分
        v->len -= len;
    }

    if (v->len == 0) {
        // 如果 unmap 整个文件
        v->va = 0;
        v->valid = 0;
        fileclose(v->f);

        // 调整 curend
        if (va == p->curend) {
        // 当最后一个文件 unmap
        p->curend += len;
        for (uint64 unva = PGROUNDDOWN(p->curend);
                unva < MAXVA - 2 * PGSIZE; unva += PGSIZE) {
            // 往上遍历每一个虚拟地址
            int i;
            for (i = 0; i < MAXVMA; ++i) {
            // 判断是否在 vma 数组中存在
            // 不存在的话，curend 上调
            if (p->vma[i].va == unva && p->vma[i].valid == 1)
                break;
            }
            if (i == MAXVMA)    p->curend += PGSIZE;// 不存在
            else    break;// 只要上面的 vma 区域有一个没取消，就停止上调
        }// end for()
        }// end if()
    }

    return 0;
}

// path: kernel/proc.c
static struct proc*
allocproc(void) {
    ...
    p->curend = MAXVA - 2 * PGSIZE;// 堆最高端
    ...
}

int
fork(void) {
    ...
    // 将父进程的 vma 数组全部拷贝过去
    memmove(np->vma, p->vma, sizeof(struct vma_t));
    np->curend = p->curend;
    // 注意文件计数加 1
    for (int i = 0; i < MAXVMA; ++i) {
        if (np->vma[i].valid == 1)
            filedup(np->vma[i].f);
    }

    return pid;
}

void
exit(int status) {
    ...
    if(p == initproc)
        panic("init exiting");

    for (int i = 0; i < MAXVMA; ++i) {
        if (p->vma[i].valid == 1)
            subunmap(p->vma[i].va, p->vma[i].len);
    }
    ...
}

// path: kernel/trap.c
void
usertrap(void) {
    ...
    } else if((which_dev = devintr()) != 0){
        // ok
    } else if (which_dev == 0 && (uint64)r_scause() == 13) {
        // page-fault 时 scause 记录错误码；stval 记录无法翻译的 va
        // riscv 三种 page-fault
        // 13: load page-fault, 加载指令不能翻译它的虚拟地址
        // 15: store page-fault, 存储指令不能翻译它的虚拟地址
        // 12: instruction page-fault, PC 上的地址不能翻译

        if (pgfault(PGROUNDDOWN(r_stval())) == -1)    p->killed = 1;
    } else {
        printf("usertrap(): unexpected scause %p pid=%d\n", r_scause(), p->pid);
        printf("            sepc=%p stval=%p\n", r_sepc(), r_stval());
        p->killed = 1;
    }

    ...
}