# 6.824

## Introduction
@20190819

- 容错
    - 需求
        - 高可用
        - duration
    - 方式
        - 复制服务器

- 一致性
    - 困哪
        - "replica" servers are hard to keep identical
        - clients may crash midway through multi-step update
        - servers crash at awkward moments 
        - network may take live servers look dead; risk of "split brain".
    - 一致性与性能的矛盾
        - consistency requires communication, e.g. to get latest Put(). "Strong consistency" often leads to slow systems.
        - High performance often imposes "weak consistency" on applications.

https://pdos.csail.mit.edu/6.824/papers/mapreduce.pdf

### Map-Reduce

编程模型

```
map(k1, v1) -> list(k2, v2)
reduce(k2, list(v2)) -> list(k2, v3)
```

Word-Count示例

```
map(String key, String value):
    // key: document name
    // value: document contents
    for each word w in value:
        EmitIntermediate(w, "1");

reduce(String key, Iterator values):
    // key: a word
    // values: a list of counts
    int result = 0;
    for each v in values:
        result += ParseInt(v);
    Emit(AsString(result));
```

例程

- Distributed Grep
- Count of URL Access Frequency
- Reverse Web-Link Graph
- Term-Vector per Host
- Inverted Index
- Distributed Sort

实现

![](mapreduce.png)

- input file split分片大小
    - 16mb ~ 64mb
- master调度worker进行map和reduce, 谁空闲分配给谁.
- map worker读输入分片, 结果周期存入本地盘, 将位置报告给master, master通知reduce去拿(rpc读, p2p)
- reduce先排序读入的数据, 按顺序发给用户自定义Reduce函数.
- reduce的结果append到一个最终文件(partition)

容错

- worker失败
    - master定时ping worker, 判定worker是否失败
    - 失败后丢弃计算结果重新找worker计算指定数据
- master失败
    - 定期将master内存镜像到checkpoint. 
    - 失败后,拷贝最近checkpoint,新启动master.
    - checkpoint后的任务被重放

局部性

- 网络带宽是稀缺资源
- 数据, map任务, reduce任务就近分配
    - gfs的case是64mb数据块就近

任务粒度

M个map任务,R个reduce任务,原则上, M,R要远大于worker机器数量,便于任务调度. 但是任务数据也有限制, 上下限内的合理值.

master做O(M+R)调度, 在内存中维持O(M*R)个状态.

`straggler`问题:
一台工作机在少量任务上工作耗时超长. 

任务备份来解决`straggler`问题, 一个mapreduce任务快结束时, master将仍在运行的任务做一个副本任务重新执行, 其中有一执行完毕,就标识任务结束.