⌈xv6-fall2021⌋ lab 5：COW

Word count: 2.9kReading time: 13 min

 2022/04/04 

LAB

INTRO

COW fork，原文来自 xv6-handout-fall2021, section 4.6，大概意思是 fork() 非常浪费，不仅体现于有些场景子进程拷贝完内存映像后，一点也没使用便随着紧接着的 exec() 丢弃了；也体现在当父进程非常大时可能会耗尽内存

为了解决 fork() 这种拷贝的困境，大多数操作系统采用 COW fork 技术，在 fork() 时期并不真正分配内存给子进程，而是让父进程和子进程共享同一份内存映像。之后，无论是父进程还是子进程往共享内存写入并抛出 page-fault 异常时，内核才真正分配内存

随之而来的还有一个问题要考虑，那就是释放物理页，由于 COW fork 会使多个映射都指向同一个物理页，如果不采取手段会导致重复释放同一个地址的内存。通常的办法是为物理页引入 "引用计数器"（reference counter），多个映射每次释放时都只减少 counter，直至 counter 为零才真正释放物理页

Implement Copy-on-Write (HARD)

要求

Your task is to implement copy-on-write fork in the xv6 kernel. You are done if your modified kernel executes both the cowtest and usertests programs successfully.

"你的任务是在 xv6 内核上实现 Copy-on-Write fork。只有当你修改过的内核都通过 cowtest 和 usertests 两个测试时才算通过整个 lab"

可采取的策略

Modify uvmcopy() to map the parent's physical pages into the child, instead of allocating new pages. Clear PTE_W in the PTEs of both child and parent.

"将 uvmcopy() 的子进程物理页映射修改为父进程的映射，而不是分配新页。父进程和子进程所映射的所有 PTE 的 PTE_W 清位"

Modify usertrap() to recognize page faults. When a page-fault occurs on a COW page, allocate a new page with kalloc(), copy the old page to the new page, and install the new page in the PTE with PTE_W set.

"修改 usertrap() 让它识别 page-fault 异常。当 COW 页上出现 page-fault 时，用 kalloc() 分配一页，然后将旧页拷贝到新页，最后将新页 PTE_W 置位"

Ensure that each physical page is freed when the last PTE reference to it goes away -- but not before. A good way to do this is to keep, for each physical page, a "reference count" of the number of user page tables that refer to that page. Set a page's reference count to one when kalloc() allocates it. Increment a page's reference count when fork causes a child to share the page, and decrement a page's count each time any process drops the page from its page table. kfree() should only place a page back on the free list if its reference count is zero. It's OK to to keep these counts in a fixed-size array of integers. You'll have to work out a scheme for how to index the array and how to choose its size. For example, you could index the array with the page's physical address divided by 4096, and give the array a number of elements equal to highest physical address of any page placed on the free list by kinit() in kalloc.c.

"确保释放每个物理页只发生在失去最后一个 PTE 引用计数器时，而不是之前。一个可以采纳的策略是为每个物理页，维持一个指向这个页的映射数量的 '引用计数器'。当 kalloc() 分配一个新页时就将引用计数器设为 1。当 fork 使子进程共享一页时就为对应的引用计数器增加 1。当任意进程不使用页时就相应地将引用计数器减 1。kfree() 只有在引用计数器为 0 时才将释放的页链接回空闲页链表。可以将引用计数器用一个固定大小的整型数组来维持。你要考虑怎样索引这个数组，以及数组的大小怎么表示。比如，你可以用以 4096 划分的页的物理地址，作为数组的索引；然后将数组的大小设置为 kalloc.c 中，kinit() 分配的在最高地址范围内的，物理页的数量"

———— 首先很明确的，引用计数器的变化是这样的：

初始化：在 kernel/kalloc.c/kalloc() 刚分配的时候就设为 1
自增：在 kernel/proc.c/fork() 每次 fork() 时引用加 1
自减：在 kernel/kalloc.c/kfree() 中计数为 0 时才释放，其他情况减 1

然后是如何设置数组大小以及如果访问数组元素（索引）的问题，我是这么理解的：

数组大小：参考 kernel/kalloc.c/kinit() 发现 xv6 内核使用的 RAM 是有范围的，进而得出可用 RAM 是宏 KERNELBASE 和 PHYSTOP 之间
数组索引：最直接的方法是用当前的物理地址模除页大小（4096）

Modify copyout() to use the same scheme as page faults when it encounters a COW page.

"修改 copyout()，为（内核）抛出一个 COW 页时使用相同的策略处理 page-fault"

———— 这一点提示我们解决 COW 页的这个 "scheme" 会在两个地方使用，所以为 "scheme" 封装成一个函数会比较好

提示

It may be useful to have a way to record, for each PTE, whether it is a COW mapping. You can use the RSW (reserved for software) bits in the RISC-V PTE for this.

"为每个 PTE 设置一种方式来记录它是否 COW 映射，可以使用 risc-v PTE 的 RSW（reserved for software, 软件保留）位"

If a COW page fault occurs and there's no free memory, the process should be killed.

"如果 COW 页异常出现，但此时没有可用内存，那就杀死这个进程"

Summary

最后整理一下思路，按照 lab 提示推荐的策略，COW fork() 是分两步进行的：第一步是 "分配"，第二步是回收

分配其实很好处理，就是子进程共享父进程的映射

困难的是回收，由于每 fork() 一次就会多一个计数，也即是计数器是所有映射至同一页的，所有的进程所共用的，这就涉及同步的问题，需要考虑引入锁

现在，程序流大概是这样的：

共享映射：由于 fork() 是调用 uvmcopy() 的，所以修改 uvmcopy()，其实直接去掉 kalloc() 分配有关的语句就行。值得一提的是，只完成这一步仍会报 "remap" 的 panic，所以需要将 mappage() 对应的检查删除
缺页分配：这里会封装成一个函数，因为 uvmcopy() 和 copyout() 都需要处理 cow 页。我的思路是，对于引用等于 1 的 cow 页，比如子进程已经执行完成，现在只有父进程需要写入，那么对应的物理页就完全属于父进程了，所以直接修改属性为可写可读就行；而对于引用大于 1 的，就需要分配内存、拷贝整个页、安装页表映射、最后将引用计数减一（因为子进程已经有自己新分配的页了，已经不使用父进程的映射了）。值得一提的是，cow 的处理逻辑一定要有一些判断，比如分配大小，页表项属性之类，否则会卡 usertests 某些测试点
异常处理：riscv 硬件会在遭遇 page-fault 的时候抛出异常，并在 scause 寄存器记录错误码，在 stval 记录无法翻译的虚拟地址。利用 stval 记录的虚拟地址我们可以查看这一页是否 cow 页，如果是就丢给 cow 处理函数
释放：每次调用 kfree() 都先减少一页，然后才判断是否等于 0，如果是才真正释放；否则直接返回
引用计数器的逻辑：这应该是最难最头痛的一件事，因为每个物理页都对应一个计数器，而一个物理页可能挂靠了多个虚拟地址，这就使得不同的子进程会访问同一个计数器。所以核心思想是，读和写计数器一定一定要加锁（读也要加，因为有可能两个子进程同时访问同一个计数器，如果稍快访问那个执行写操作，那么稍慢访问那个读到的数据就不对了），否则，以我的经验来看，十有八九会在 usertests 报错 "FAILED -- lost some free pages xxxxx (out of xxxxx)"。把握这个思想加锁就行

Solution

// path: kernel/vm.c
int
mappages(pagetable_t pagetable, uint64 va, uint64 size, uint64 pa, int perm) {
    ...
    //if(*pte & PTE_V)
    //  panic("mappages: remap");
    ...
}
int
uvmcopy(pagetable_t old, pagetable_t new, uint64 sz) {
    pte_t *pte;
    uint64 pa, i;
    uint flags;
    //char *mem;

    for(i = 0; i < sz; i += PGSIZE){
        if((pte = walk(old, i, 0)) == 0)
        panic("uvmcopy: pte should exist");
        if((*pte & PTE_V) == 0)
        panic("uvmcopy: page not present");
        pa = PTE2PA(*pte);
        // lab cow
        *pte &= (~PTE_W);// pte_w 清位
        *pte |= PTE_RSW;// pte_rsw 置位
        flags = PTE_FLAGS(*pte);
        /*if((mem = kalloc()) == 0)
            goto err;
        memmove(mem, (char*)pa, PGSIZE);
        if(mappages(new, i, PGSIZE, (uint64)mem, flags) != 0){
            kfree(mem);
            goto err;
        }*/
        if (mappages(new, i, PGSIZE, (uint64)pa, flags) != 0)// 这个条件第四个参数记得要修改
            goto err;
        // lab cow
        refinc(pa);// 每 fork() 一次都需要自增引用计数器
    }
    return 0;

err:
    uvmunmap(new, 0, i / PGSIZE, 1);
    return -1;
}
// COW 页拷贝处理
// 返回 -1 出现错误，返回 0 表示非 COW，返回 1 才是拷贝 COW 成功
int
cowcopy(pagetable_t pg, uint64 va) {
    // 有一个测试会 fork() 很大内存
    if (va >= MAXVA)    return -1;

    // 以下两句一定一定要加判断，否则会卡 usertests, copyout
    // 将 va 转化为对应的 pte
    pte_t *pte = walk(pg, va, 0);
    if (pte == 0)    return -1;
    if ((*pte & PTE_V) == 0)    return -1;
    if ((*pte & PTE_U) == 0)    return -1;

    // 判断是否 COW 页
    if ((*pte & PTE_RSW) == 0)    return 0;

    // 将 va 转化为对应的 pa
    uint64 pa = walkaddr(pg, va);
    if (pa == 0)    return -1;

    if ((int)refget(pa) == 1) {
        // 如果引用只有一个，那就直接使用
        *pte |= PTE_W;
        *pte &= (~PTE_RSW);
    } else {
        char *mem;
        if((mem = kalloc()) == 0) {
            // 无内存杀死
            return -1;
        } else {
            memmove(mem, (char*)pa, PGSIZE);
            uint flags = PTE_FLAGS(*pte);
            flags |= PTE_W;// set pte_w
            flags &= (~PTE_RSW);// clear pte_rsw
            if(mappages(pg, va, PGSIZE, (uint64)mem, flags) != 0) {
                kfree(mem);
                return -1;
            }
        }
        kfree((uint64 *)pa);
    }
    return 1;
}
int
copyout(pagetable_t pagetable, uint64 dstva, char *src, uint64 len)
{
    uint64 n, va0, pa0;

    while(len > 0){
        va0 = PGROUNDDOWN(dstva);
        // lab cow
        // 如果是 COW 页需要先拷贝一页出来
        // 如果不是 COW 页返回 0，即跳过
        if (cowcopy(pagetable, va0) == -1)
            return -1;

        pa0 = walkaddr(pagetable, va0);
        if(pa0 == 0)
            return -1;
        n = PGSIZE - (dstva - va0);
        if(n > len)
            n = len;
        memmove((void *)(pa0 + (dstva - va0)), src, n);

        len -= n;
        src += n;
        dstva = va0 + PGSIZE;
    }
    return 0;
}

// path: kernel/trap.c
void
usertrap(void) {
    ...
    if(r_scause() == 8) {
        ...
    } else if ((uint64)r_scause() == 15) {// store page-fault
        // lab cow
        // stval 寄存器包含无法翻译的地址
        uint64 va = PGROUNDDOWN(r_stval());
        if (va >= p->sz || cowcopy(p->pagetable, va) <= 0)
            p->killed = 1;
    } else if((which_dev = devintr()) != 0){
        // ok
    } ...
}

// path: kernel/riscv.h
// lab cow
#define PTE_RSW (1L << 8) // RSW 使用两位共四种保留值
// 用于定义存放 reference counter 的数组的长度
#define NPAGE (((PHYSTOP) - (KERNBASE)) / (PGSIZE))
#define INDEX(p) (((p) - (KERNBASE)) / (PGSIZE))

// path: kernel/kalloc.c
// lab cow
// 定义存放 reference counter 的结构体
struct {
    struct spinlock lock;
    int arr[NPAGE];
} refcr;

inline void
refinc(uint64 pa) {
    acquire(&refcr.lock);
    ++refcr.arr[INDEX(pa)];
    release(&refcr.lock);
}
inline void
refdes(uint64 pa) {
    acquire(&refcr.lock);
    --refcr.arr[INDEX(pa)];
    release(&refcr.lock);
}
inline void
refset(uint64 pa, int n) {
    acquire(&refcr.lock);
    refcr.arr[INDEX(pa)] = n;
    release(&refcr.lock);
}
inline uint64
refget(uint64 pa) {
    uint64 ref;
    acquire(&refcr.lock);
    ref = refcr.arr[INDEX(pa)];
    release(&refcr.lock);
    return ref;
}
void
kinit() {
    initlock(&kmem.lock, "kmem");
    freerange(end, (void*)PHYSTOP);
    // lab cow
    memset(refcr.arr, 0, NPAGE);
    initlock(&refcr.lock, "refcr");
}
void
kfree(void *pa) {
    struct run *r;

    if(((uint64)pa % PGSIZE) != 0 || (char*)pa < end || (uint64)pa >= PHYSTOP)
        panic("kfree");

    // lab cow
    if ((int)refget((uint64)pa) > 1) {
        refdes((uint64)pa);
        return;
    }

    // Fill with junk to catch dangling refs.
    memset(pa, 1, PGSIZE);
    // lab cow
    refset((uint64)pa, 0);

    r = (struct run*)pa;

    acquire(&kmem.lock);
    r->next = kmem.freelist;
    kmem.freelist = r;
    release(&kmem.lock);
}
void *
kalloc(void) {
    struct run *r;

    acquire(&kmem.lock);
    r = kmem.freelist;
    // lab cow
    if(r) {
        kmem.freelist = r->next;
        refset((uint64)r, 1);
    }
    release(&kmem.lock);

    if(r)
        memset((char*)r, 5, PGSIZE); // fill with junk
    return (void*)r;
}

// path: kernel/defs.h
// lab cow
void   refinc(uint64);
void   refdes(uint64);
void   refset(uint64, int);
uint64 refget(uint64);
// vm.c
int    cowcopy(pagetable_t, uint64);