cache 和 checkpoint 的区别？_大数据

6条回答

我是大脸猫

2楼 · 2021-04-30 10:30

cache()与persist()的区别

会被重复使用的但是不能太大的RDD需要cache,

cache()调用了persist(),区别在于cache只有一个默认的缓存级别MEMORY_ONLY,而persist可以根据情况设置其它的缓存级别,StorageLevel类中有12种缓存级别.

cache 与 checkpoint 的区别

关于这个问题 Tathagata Das 有一段回答:

There is a significant difference between cache and checkpoint. Cache materializes the RDD and keeps it in memory and/or disk（其实只有 memory）.

But the lineage（也就是 computing chain） of RDD (that is, seq of operations that generated the RDD) will be remembered,

so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated.

However, checkpoint saves the RDD to an HDFS file and actually forgets the lineage completely.

This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).

哪些 RDD 需要 checkpoint？

运算时间很长或运算量太大才能得到的 RDD,computing chain过长或依赖其他 RDD 很多的 RDD.

实际上将 ShuffleMapTask 的输出结果存放到本地磁盘也算是 checkpoint，只不过这个 checkpoint 的主要目的是去 partition 输出数据。

cache 机制是每计算出一个要 cache 的 partition 就直接将其 cache 到内存了。

但 checkpoint 没有使用这种第一次计算得到就存储的方法，而是等到 job 结束后另外启动专门的 job 去完成 checkpoint 。

也就是说需要 checkpoint 的 RDD 会被计算两次。因此在使用 rdd.checkpoint() 的时候建议加上 rdd.cache()，

这样第二次运行的 job 就不用再去计算该 rdd 了，直接读取 cache 写磁盘。

persist()与checkpoint()的区别

深入一点讨论，rdd.persist(StorageLevel.DISK_ONLY) 与 checkpoint 也有区别。

前者虽然可以将 RDD 的 partition 持久化到磁盘，但该 partition 由 blockManager 管理。

一旦 driver program 执行结束，也就是 executor 所在进程 CoarseGrainedExecutorBackend stop，blockManager 也会 stop，

被 cache 到磁盘上的 RDD 也会被清空（整个 blockManager 使用的 local 文件夹被删除）。

而 checkpoint 将 RDD 持久化到 HDFS 或本地文件夹，如果不被手动 remove 掉（话说怎么 remove checkpoint 过的 RDD？），是一直存在的，

也就是说可以被下一个 driver program 使用，而 cached RDD 不能被其他 dirver program 使用。

三岁奶猫

3楼 · 2021-04-30 10:48

关于这个问题，Tathagata Das 有一段回答: There is a significant difference between cache and checkpoint. Cache materializes the RDD and keeps it in memory and/or disk（其实只有 memory）. But the lineage（也就是 computing chain） of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated. However, checkpoint saves the RDD to an HDFS file and actually forgets the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).

danganddang999

4楼 · 2021-04-30 17:26

cache 与 checkpoint 的区别:
关于这个问题，Tathagata Das 有一段回答: There is a significant difference between cache and checkpoint. Cache materializes the RDD and keeps it in memory and/or disk（其实只有 memory）. But the lineage（也就是 computing chain） of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated. However, checkpoint saves the RDD to an HDFS file and actually forgets the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).

希希

5楼 · 2021-05-02 00:04

Tathagata Das 有一段回答: There is a significant difference between cache and checkpoint. Cache materializes the RDD and keeps it in memory and/or disk（其实只有 memory）. But the lineage（也就是 computing chain） of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated. However, checkpoint saves the RDD to an HDFS file and actually forgets the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication).

羊羊羊羊

6楼 · 2021-05-31 11:57

checkpoint检查点的说明
如果依赖血缘关系过长，容错成本会很高，可以设置checkpoint检查点存储RDD中间结果。
checkpoint检查点可以将RDD中间结果存储到HDFS等高可用存储介质中。
checkpoint会切断原有血缘关系。
为了保证checkpoint的数据可靠性，程序在执行时会从血缘关系的最开始执行一遍，所以一般和cache配合使用，以提高效率。
checkpoint同样也是遇到action算子才会执行。

cache缓存的说明
RDD通过调用cache方法将前面的计算结果缓存到内存或磁盘，默认缓存到JVM内存中。
如果后面的计算还有需要此RDD的计算结果，可以直接从缓存中获取，而不用重新计算。
cache缓存不会切断原有血缘关系，只会增加血缘关系。
cache不会立即执行，当第一次遇到action算子时才会真正执行。
cache缓存不管存储到内存还是存储到磁盘，都会随着程序执行结束而销毁。

征戰撩四汸

7楼 · 2021-12-06 14:02

cache和persist其实是RDD的两个API，并且cache底层调用的就是persist，区别之一就在于cache不能显示指定缓存方式，只能缓存在内存中，但是persist可以通过指定缓存方式，比如显示指定缓存在内存中、内存和磁盘并且序列化等。通过RDD的缓存，后续可以对此RDD或者是基于此RDD衍生出的其他的RDD处理中重用这些缓存的数据集

本质上是将RDD写入磁盘做检查点(通常是checkpoint到HDFS上，同时利用了hdfs的高可用、高可靠等特征)。上面提到了Spark lineage，但在实际的生产环境中，一个业务需求可能非常非常复杂，那么就可能会调用很多算子，产生了很多RDD，那么RDD之间的linage链条就会很长，一旦某个环节出现问题，容错的成本会非常高。此时，checkpoint的作用就体现出来了。使用者可以将重要的RDD checkpoint下来，出错后，只需从最近的checkpoint开始重新运算即可使用方式也很简单，指定checkpoint的地址[SparkContext.setCheckpointDir("checkpoint的地址")]，然后调用RDD的checkpoint的方法即可。

1、都是lazy操作，只有action算子触发后才会真正进行缓存或checkpoint操作（懒加载操作是Spark任务很重要的一个特性，不仅适用于Spark RDD还适用于Spark sql等组件）

2. cache只是缓存数据，但不改变lineage。通常存于内存，丢失数据可能性更大

3. 改变原有lineage，生成新的CheckpointRDD。通常存于hdfs，高可用且更可

cache 和 checkpoint 的区别？

相关问题推荐

等你来答

热门问答

相关文章

cache 和 checkpoint 的区别？

相关问题推荐

等你来答

热门问答

相关文章

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间