Skip to content

Latest commit

 

History

History
53 lines (32 loc) · 3.17 KB

spark-service-broadcastmanager.adoc

File metadata and controls

53 lines (32 loc) · 3.17 KB

Broadcast Manager

Broadcast Manager is a Spark service to manage broadcast values in Spark jobs. It is created for a Spark application as part of SparkContext’s initialization and is a simple wrapper around BroadcastFactory.

Broadcast Manager tracks the number of broadcast values (using the internal field nextBroadcastId).

The idea is to transfer values used in transformations from a driver to executors in a most effective way so they are copied once and used many times by tasks (rather than being copied every time a task is launched).

When BroadcastManager is initialized an instance of BroadcastFactory is created based on spark.broadcast.factory setting.

BroadcastFactory

BroadcastFactory is a pluggable interface for broadcast implementations in Spark. It is exclusively used and instantiated inside of BroadcastManager to manage broadcast variables.

It comes with 4 methods:

  • def initialize(isDriver: Boolean, conf: SparkConf, securityMgr: SecurityManager): Unit

  • def newBroadcast[T: ClassTag](value: T, isLocal: Boolean, id: Long): Broadcast[T] - called after SparkContext.broadcast() has been called.

  • def unbroadcast(id: Long, removeFromDriver: Boolean, blocking: Boolean): Unit

  • def stop(): Unit

TorrentBroadcast

The BroadcastManager implementation used in Spark by default is org.apache.spark.broadcast.TorrentBroadcast (see spark.broadcast.factory). It uses a BitTorrent-like protocol to do the distribution.

sparkcontext broadcast bittorrent
Figure 1. TorrentBroadcast - broadcasting using BitTorrent

TorrentBroadcastFactory is the factory of TorrentBroadcast-based broadcast values.

When a new broadcast value is created using SparkContext.broadcast() method, a new instance of TorrentBroadcast is created. It is divided into blocks that are put in Block Manager.

sparkcontext broadcast bittorrent newBroadcast
Figure 2. TorrentBroadcast puts broadcast chunks to driver’s BlockManager

Compression

When spark.broadcast.compress is true (default), compression is used.

There are the following compression codec implementations available:

  • lz4 or org.apache.spark.io.LZ4CompressionCodec

  • lzf or org.apache.spark.io.LZFCompressionCodec - a fallback when snappy is not available.

  • snappy or org.apache.spark.io.SnappyCompressionCodec - the default implementation

An implementation of CompressionCodec trait has to offer a constructor that accepts SparkConf.

Settings

  • spark.broadcast.factory (default: org.apache.spark.broadcast.TorrentBroadcastFactory) - the fully-qualified class name for the implementation of BroadcastFactory interface.

  • spark.broadcast.compress (default: true) - a boolean value whether to use compression or not. See Compression.

  • spark.io.compression.codec (default: snappy) - compression codec to use. See Compression.

  • spark.broadcast.blockSize (default: 4m) - the size of a block