⌈xv6-fall2021⌋ lab 6：Multithreading

mit6.828

Word count: 4.3kReading time: 19 min

 2022/04/05 

LAB

传送门：Lab: Multithreading

INTRO

读 xv6-handout-fall 2021, chapter 7 有感，以下是 xv6 线程（进程）切换过程

如何标识一个线程？

PC
寄存器环境（上下文）
线程的栈

xv6 如何组织线程？如图所示：

每个用户进程都有其对应的内核线程。相当于每个用户进程都有两个线程，第一个位于用户空间，有独立的地址；第二个位于内核空间，使用共享的内存地址
所谓对应的内核线程，其实并没有具体的代码文件。我的理解是 trampoline()、usertrap()、usertropret() 这些和 trap 相关的函数（线程切换这个故事也是开始于陷入 trap ！），只不过不同线程执行速度不一样，即上下文不一样，所以才属于不同的线程

一个线程切换的例子，从 CC 命令切换到 LS 命令：

从 CC 用户线程陷入 CC 内核线程（CC_k），陷入过程中在 trampoline() 内将用户线程上下文保存到 trapframe
CC_k 要让出 CPU，进入 swtch() 上半场的处理。先保护 CC_k 现场，存入 context 对象（context 结构体总是保存了内核线程在执行 swtch() 时的状态；而 trapframe 栈帧总是保存了用户线程在执行 trampoline() 时的状态）
swtch() 恢复当前 CPU 调度器的上下文和栈，之后用这个环境执行 scheduler()
scheduler() 先将 CC_k 状态改为 RUNNABLE，然后在进程表中找到 LS_k，最后又调用 swtch()（只不过此时的 swtch() 不是刚才进入的那个）
进入 swtch() 下半场：
- 先将调度器的上下文和栈保存到自己的 context 对象
- 再将 LS_k 的 context 对象恢复，即加载为当前 CPU 寄存器组，从而恢复上次被打断的 swtch()
返回到 LS_k 的 trampoline，恢复 trapframe
恢复 LS 用户线程

值得注意的是，这里的两个 swtch() 调用实际不是同一个。进入调度器的是 sched() 的 swtch()；而调度器返回的却是 scheduler() 的 swtch()。上述例子对应的程序流如下：

陷入：trampoline() -> usertrap() -> devintr()
让出 CPU：-> yield() -> sched()
保护现场并切换至调度器：-> swtch()
调度器进行调度：-> scheduler()
另一个线程登场：-> swtch()

xv6 总共包含三种类型的上下文信息：

用户上下文：由 trapframe 结构体保存
内核上下文：由 context 结构体保存
调度器上下文：由 cpu 结构体保存（cpu 结构体包含 context 成员，所以最终也是保存在 context 对象上。只不过区分不同的调度器线程是依据 cpu 是否相同，只有 cpu 寄存器环境才标识一个调度器）

Uthread: switching between threads (MODERATE)

要求

In this exercise you will design the context switch mechanism for a user-level threading system, and then implement it.

"在这个练习中，你将为一个用户级线程系统设计上下文切换机制，并且实现"

The threading package is missing some of the code to create a thread and to switch between threads.

"线程库缺乏一些创建线程和在不同线程间切换的代码"

Your job is to come up with a plan to create threads and save/restore registers to switch between threads, and implement that plan. When you're done, make grade should say that your solution passes the uthread test.

"你的工作是想出一个策略，用以创建线程、在线程切换间保存和恢复现场，并且实现。当你完成后，make grade 会提示你的解决方案通过了 uthread test"

One goal is ensure that when thread_schedule() runs a given thread for the first time, the thread executes the function passed to thread_create(), on its own stack. Another goal is to ensure that thread_switch saves the registers of the thread being switched away from, restores the registers of the thread being switched to, and returns to the point in the latter thread's instructions where it last left off.

"第一个目标是确保在 thread_schedule() 第一次运行时，可以在自己的栈上执行 thread_create() 传递的参数。第二个目标是确保 thread_switch() 保存了将要切换走的线程的寄存器组，同时恢复将要切换过去的线程的寄存器组，最后返回到后一个线程上次停止的地方"

You'll need to add a call to thread_switch in thread_schedule; you can pass whatever arguments you need to thread_switch, but the intent is to switch from thread t to next_thread.

"你需要在 thread_schedule() 中调用 thread_switch()；传递任何你需要的参数给 thread_switch，只要保证能够从线程 t 切换至 next_thread"

提示

这次提示很简单，就是切换过程中只需要保存 callee 的上下文，并且阅读 user/uthread.asm 对 debugging 非常有参考意义

Summary

主要参考 xv6 线程调度的程序流就行，对比这里，试想当前场景是从线程 t 切换为 next_thread，那么：

核心任务是保存线程 t 的上下文，然后恢复 next_thread 的上下文，最后返回地址改为 next_thread 上次中断处
首先要保存的上下文只有 callee 负责的那些寄存器，这和 kernel/proc.h 的 context 结构体是一样的
然后考虑保存在哪？既然文件给了 thread 结构体，那自然是新增一个上下文字段更方便

这里面还有个小细节要注意：

每个线程在执行时都需要自己的栈，但栈是往低处长的（grow downwards），所以保存栈地址应该是高地址

Solution

// path: user/uthread.c
struct context {
  uint64 ra;
  uint64 sp;

  // callee-saved
  uint64 s0;
  uint64 s1;
  uint64 s2;
  uint64 s3;
  uint64 s4;
  uint64 s5;
  uint64 s6;
  uint64 s7;
  uint64 s8;
  uint64 s9;
  uint64 s10;
  uint64 s11;
};

struct thread {
  char       stack[STACK_SIZE]; /* the thread's stack */
  int        state;             /* FREE, RUNNING, RUNNABLE */
  struct context cont;
};

void 
thread_schedule(void) {
    ...
    if (current_thread != next_thread) {         /* switch threads?  */
        ...
        /* YOUR CODE HERE
         * Invoke thread_switch to switch from t to next_thread:
         * thread_switch(??, ??);
         */
        thread_switch((uint64)&t->cont, (uint64)&next_thread->cont);
    } else
        next_thread = 0;
}
void 
thread_create(void (*func)()) {
    ...
    // YOUR CODE HERE
    t->cont.ra = (uint64)func;
    t->cont.sp = (uint64)&t->stack[STACK_SIZE - 1];
}

	.text

	/*
     * save the old thread's registers,
     * restore the new thread's registers.
     */

	.globl thread_switch
thread_switch:
	/* YOUR CODE HERE */
	sd ra, 0(a0)
	sd sp, 8(a0)
	sd s0, 16(a0)
	sd s1, 24(a0)
	sd s2, 32(a0)
	sd s3, 40(a0)
	sd s4, 48(a0)
	sd s5, 56(a0)
	sd s6, 64(a0)
	sd s7, 72(a0)
	sd s8, 80(a0)
	sd s9, 88(a0)
	sd s10, 96(a0)
	sd s11, 104(a0)

	ld ra, 0(a1)
	ld sp, 8(a1)
	ld s0, 16(a1)
	ld s1, 24(a1)
	ld s2, 32(a1)
	ld s3, 40(a1)
	ld s4, 48(a1)
	ld s5, 56(a1)
	ld s6, 64(a1)
	ld s7, 72(a1)
	ld s8, 80(a1)
	ld s9, 88(a1)
	ld s10, 96(a1)
	ld s11, 104(a1)

	ret    /* return to ra */

Using threads (MODERATE)

要求

The file notxv6/ph.c contains a simple hash table that is correct if used from a single thread, but incorrect when used from multiple threads. In your main xv6 directory (perhaps ~/xv6-labs-2021), type this:
1
2
$ make ph
$ ./ph 1

"文件 notxv6/ph.c 包含一个简单的哈希表，如果用在单线程上是正确的，但使用多线程却不正确。在你的 xv6 主目录下（可能如 ~/xv6-labs-2021），输入上面命令"

The argument to ph specifies the number of threads that execute put and get operations on the the hash table. After running for a little while, ph 1 will produce output similar to this:
1
2
3
100000 puts, 3.991 seconds, 25056 puts/second
0: 0 keys missing
100000 gets, 3.981 seconds, 25118 gets/second

"ph 的参数指出在哈希表上执行 put 和 get 操作的线程数量。运行一段时间后，ph 1 会产生上面那样的输出"

ph runs two benchmarks. First it adds lots of keys to the hash table by calling put(), and prints the achieved rate in puts per second. The it fetches keys from the hash table with get(). It prints the number keys that should have been in the hash table as a result of the puts but are missing (zero in this case), and it prints the number of gets per second it achieved.

"ph 运行两种基准。第一种是通过 put() 往哈希表里增加大量的 key，并且每秒都会打印在 put() 里增加 key 的配置速度；第二种是通过 get() 从哈希表里取出 key。ph 还会打印因 put() 本应在哈希表中存在，实际却丢失的 key 的数量（在上面例子中是 0），并且也会打印 get() 的每秒配置速度"

You can tell ph to use its hash table from multiple threads at the same time by giving it an argument greater than one. Try ph 2:
1
2
3
4
5
$ ./ph 2
100000 puts, 1.885 seconds, 53044 puts/second
1: 16579 keys missing
0: 16579 keys missing
200000 gets, 4.322 seconds, 46274 gets/second

"你可以传入一个大于 1 的参数，让 ph 使用多线程操作哈希表。输入 ph 2 如上面所示"

The first line of this ph 2 output indicates that when two threads concurrently add entries to the hash table, they achieve a total rate of 53,044 inserts per second. That's about twice the rate of the single thread from running ph 1.

"ph 2 输出的第一行，指出并发执行两个线程向哈希表增加条目时，插入总速率达到 53044 项每秒。这大概是 ph 1 运行单线程的两倍"

However, the two lines saying 16579 keys missing indicate that a large number of keys that should have been in the hash table are not there. That is, the puts were supposed to add those keys to the hash table, but something went wrong. Have a look at notxv6/ph.c, particularly at put() and insert().

"然而，（接下来的）丢失 16579 个 key 这两行指出有大量 key 本应在哈希表内但实际却丢失了。即 put() 将这些 key 加入了哈希表，但发生了一些错误。阅读 notxv6/ph.c 特别是 put() 和 insert()"

Why are there missing keys with 2 threads, but not with 1 thread? Identify a sequence of events with 2 threads that can lead to a key being missing. Submit your sequence with a short explanation in answers-thread.txt

"为什么用 2 个线程会丢失 key？而 1 个线程不会？想想导致 key 丢失的 2 个线程的事件发生次序，在 answers-thread.txt 写下简要的解释并提交"

Modify your code so that some put operations run in parallel while maintaining correctness. You're done when make grade says your code passes both the ph_safe and ph_fast tests. The ph_fast test requires that two threads yield at least 1.25 times as many puts/second as one thread.

"修改你的代码以便 put 操作在正确的前提下能并行运行，当 make grade 提示你的代码均通过 ph_safe 和 ph_fast 两个测试时才算完成。其中 ph_fast 测试要求两个线程运行速度至少是单个 put 线程的 1.25 倍"

Summary

首先阅读 main()，可以发现逻辑大约如此：

创建 put 操作线程
put 线程主要调用 put() 和 insert() 两个函数
put() 遍历 table[]，找出和形参 key 相同的一项。之后判断这一项元素是否存在，若存在直接覆盖原值，否则调用 insert() 头插法链入链表
get 操作不会对并行操作造成影响所以不考虑

然后要考虑的一个问题是，为什么多线程场景会出问题？既然是链表，所以可以参考 xv6-handout-fall 2021, section 6.1 这个链表双线程的并发执行例子，大概如此：

[thread 0] 在 put() 中发现是新结点，正准备调用 insert()
-------- schedule() ----------
[thread 1] 在 put() 也发现是新结点，也准备调用 insert()
-------- schedule() ----------
[thread 0] 调用 insert() 结束后，在原本结点 i 这个地方新增了一个 {key, value} 键值对的结点
-------- schedule() ----------
[thread 1] 已经通过了 if(e) 的条件判断，走入了 else() 分支，但实际上 thread 0 已经改变了条件的情况，
            此时 thread 1 理应走 if() 分支。可惜，thread 1 不知道情况仍在相同地址上调用 insert() ，
            最终覆盖了 thread 0 刚才新增的键值对，从而丢失 key

一句话总结就是，共享的数据没有锁保护所以出问题了。这里 table[] 就是共享数据，多个线程都对这块数据写入，那么先写入的数据必然被覆盖（丢失）

所以最简单的想法就是在写入操作，也即是 insert() 头插法执行前后加锁，但如果这么加锁是有问题的：

// 错误的
static void 
insert(int key, int value, struct entry **p, struct entry *n)
{
    pthread_mutex_lock(&lock);
    struct entry *e = malloc(sizeof(struct entry));
    e->key = key;
    e->value = value;
    e->next = n;
    *p = e;
    pthread_mutex_unlock(&lock);
}

理由如下：

[thread 0] 当前索引匹配 table[0]，并检查到不存在结点，所以进入了 insert()，正准备执行
-------- schedule() ----------
[thread 1] 在 put() 循环中也找到 table[0]，也检测到结点不存在，正准备调用 insert()
-------- schedule() ----------
[thread 0] 调用 insert() 结束后，在原本结点 i 这个地方新增了一个 {key, value} 键值对的结点
-------- schedule() ----------
[thread 1] 调用 insert() 最终覆盖了 thread 0 新追加的结点

所以锁的请求应该往回调用，放在循环结束后而判断前比较合适

Solution

// notxv6/ph.c
pthread_mutex_t lock;
...

static void 
insert(int key, int value, struct entry **p, struct entry *n) {
    ...
    *p = e;
    pthread_mutex_unlock(&lock);
}

static 
void put(int key, int value)
{
    ...
    for (e = table[i]; e != 0; e = e->next) {
        ...
    }
    // lab multithreading
    pthread_mutex_lock(&lock);
    if(e) {
        // update the existing key.
        e->value = value;
        pthread_mutex_unlock(&lock);
    } else ...
}

int
main(int argc, char *argv[]) {
    ...
    for (int i = 0; i < NKEYS; i++) {
        keys[i] = random();
    }

    // lab multithreading
    pthread_mutex_init(&lock, NULL);

    //
    // first the puts
    //
    t0 = now();
    ...

Barrier (MODERATE)

要求

In this assignment you'll implement a barrier: a point in an application at which all participating threads must wait until all other participating threads reach that point too. You'll use pthread condition variables, which are a sequence coordination technique similar to xv6's sleep and wakeup.

"在这个任务中，你将要实现一个 barrier：是一个指针，作用是使得所有参与执行的线程必须等，直到其他线程全部到达 barrier 这个指针的地址处才能继续执行。你还需要使用 pthread 条件变量，这是和 xv6 的 sleep() 和 wakeup() 差不多的时序配合机制"

The file notxv6/barrier.c contains a broken barrier.
1
2
3
$ make barrier
$ ./barrier 2
barrier: notxv6/barrier.c:42: thread: Assertion `i == t' failed.

"文件 notxv6/barrier.c 包含了一个未完成的 barrier 程序，像上面那样"

The 2 specifies the number of threads that synchronize on the barrier ( nthread in barrier.c). Each thread executes a loop. In each loop iteration a thread calls barrier() and then sleeps for a random number of microseconds. The assert triggers, because one thread leaves the barrier before the other thread has reached the barrier. The desired behavior is that each thread blocks in barrier() until all nthreads of them have called barrier().

"参数 2 指定在 barrier()（在 barrier.c 里面的 nthread）程序中同步的线程数量。每个线程执行一个循环。每个循环的一次迭代中，线程调用 barrier() 然后沉睡一个随机微秒。assert 命中是因为一个线程在其他线程到达 barrier() 之前离开了。一个可采取的策略是每个线程阻塞在 barrier() 直到所有这些线程的都调用了 barrier()"

You have to deal with a succession of barrier calls, each of which we'll call a round. bstate.round records the current round. You should increment bstate.round each time all threads have reached the barrier.

"你需要解决如何将一个线程的调用组织成一个连续的循环，每个循环称为一个 'round'。bstate.round 记录了当前的 round。你需要在所有线程到达 barrier() 时自增 bstate.round"

———— 这里的循环我是这么理解的：一个线程进入 for() 之后会撞上 barrier，然后停住。这之后一段时间其他线程会发生很多事，但对于我们这个线程来说都无视，直到它被某种条件唤醒，退出 barrier 后又回到 for() 新一次迭代。这便是一个循环，即一个 round

You have to handle the case in which one thread races around the loop before the others have exited the barrier. In particular, you are re-using the bstate.nthread variable from one round to the next. Make sure that a thread that leaves the barrier and races around the loop doesn't increase bstate.nthread while a previous round is still using it.

"你必须处理其他线程退出 barrier() 之前，某一个线程在循环中产生的竞争情况。特别需要注意的是，程序流正在从这一轮到下一轮中重复使用 bstate.nthread 变量的情况。确保离开 barrier() 并在循环中出现竞争情况的一个线程在前一个循环仍然使用 bstate.nthread 变量的时候，不要自增这个变量"

———— 这里的竞争情况我理解为 lost wakeup：即线程 0 正检查条件以进入沉睡（此时可能外界看起来状态为 SLEEPING，但实际还未完全切换过去），却发生了调度。后来其他线程唤醒了线程 0。之后调度又回到线程 0，但是线程 0 先执行的仍然是进入沉睡的检查，最终丢失了唤醒信号从而死锁

———— 回到这个场景里，前半句说的竞争情况，应该是让我们注意不要丢失线程在 barrier() 的 lost wakeup；后半句说的重复使用 bstate.nthread，应该是让我们注意到这是一个共享数据，第二个线程访问的时候，就算 re-using，这需要通过锁来保护

Summary

现在总结以下思路：

根据题意，核心思想是每个线程都需要撞上 barrier，即停在这里。什么时候可以重新继续？答案是所有线程到达的时候。所以 barrier() 里面的逻辑最直接的想法是：当前撞上 barrier 的线程数量未够，就在 condition variables 上挂起；直到所有线程到达
访问 bstate.nthread 需要加锁，理由是上面说的这是一个共享数据
同时需要在 condition variables 释放时，变量 round 自增，同时当前循环的线程数归零

Solution

// path: notxv6/barrier.c
static void 
barrier()
{
    // YOUR CODE HERE
    //
    // Block until all threads have called barrier() and
    // then increment bstate.round.
    //
    pthread_mutex_lock(&bstate.barrier_mutex);
    ++bstate.nthread;
    if (bstate.nthread != nthread) {
        pthread_cond_wait(&bstate.barrier_cond, &bstate.barrier_mutex);
    } else {
        bstate.round++;
        bstate.nthread = 0;
        pthread_cond_broadcast(&bstate.barrier_cond);
    }
    pthread_mutex_unlock(&bstate.barrier_mutex);
}