The PS-Service feature was introduced in Angel 1.0.0. It can not only run as a complete PS framework, but also a PS-Service that adds the PS capability to distributed frameworks to make them run faster with more powerful features. Spark is the first beneficiary of the PS-Service design.
As a popular in-memory computing framework, Spark revolves around the concept of RDD
, which is immutable to avoid a range of potential problems due to updates from multiple threads at once. The RDD abstraction works just fine for data analytics because it solves the distributed problem with maximum capacity, reduces the complexity of various operators, and provides high-performance, distributed data processing capabilities.
In machine learning domain, however, iteration and parameter updating is the core demand. RDD
is a lightweight solution for iterative algorithms since it keeps data in memory without I/O; however, RDD
's immutability is a barrier for repetitive parameter updates. We believe this tradeoff in RDD's capability is one of the causes of the slow development of Spark MLLib, which lacks substantive innovations and seems to suffer from unsatisfying performance in recent years.
Now, based on its platform design, Angel provides PS-Service to Spark. Spark can take full advantage of parameter updating capabilities. Complex models can be trained efficiently in elegant code with minimal cost of rewriting.
Spark-On-Angel's system architecture is shown below. Notice that
- Spark RDD is the immutable layer, whereas Angel PS is the mutable layer
- Angel and Spark cooperate and communicate via PSAgent and AngelPSClient
Spark-On-Angel is lightweight due to Angel's interface design. The core modules include:
-
PSContext
- uses Spark context and Angel configuration to create PSContext, in charge of overall initializing and starting up on the driver side
-
PSModel
-
PSModel is the general name of PSVector/PSMatrix on PS server, including PSClient object
-
PSModel is the parent class of PSVector and PSMatrix
-
PSVector
-
PSVector application: Applying PSVector via
PSVector.dense(dim: Int, capacity: Int = 50, rowType:RowType.T_DENSE_DOUBLE) will create a dimension of
dimwith a capacity of
capacityand a type of
DoubleVectorPool, two PSVectors in the same VectorPool can do the operation. Apply a PSVector with the same VectorPool as
psVectorvia
PSVector.duplicate(psVector)`. -
PSMatrix
-
PSMatrix creation and destruction: created by
PSMatrix.dense(rows: Int, cols: Int)
, after PSMatrix is no longer used, you need to manually calldestory
to destroy the Matrix.
The simple code to use Spark on Angel is as follows:
PSContext.getOrCreate(spark.sparkContext)
val psVector = PSVector.dense(dim, capacity)
rdd.map { case (label , feature) =>
psVector.increment(feature)
...
}
println("feature sum:" + psVector.pull.mkString(" "))
Spark on Angel is essentially a Spark application. When Spark is started, the driver starts up Angel PS using Angel PS interface, and when necessary, encapsulates part of the data into PSVector to be managed by PS node. Therefore, the execution process of Spark on Angel is similar to that of Spark.
Spark Driver's new execution process:
Driver has an added action of starting up and managing PS node:
- starting up SparkSession
- starting up PSContext
- creating PSVector/PSMatrix
- executing the logic
- stopping PSContext and SparkSession
Spark executor's new execution process:
- data process
- model training