Spark core
GraphX core
Pregel+GraphLab API
GraphX工具包
处理流水线
raw data from FS/DB->initail graph by ETL->subgraph by Slice->Pagerank by Graphlib/Pregel compute->store/ to FS/DB
数据结构(物理数据结构)
RDPG(Resilient Distributed Property Graph),一种点和边都带属性的有向多重图
实际上都会转化为RDD(RDD[VertexPartition]和RDD[EdgePartition])
操作视图(逻辑数据结构)
RDPG,支持两种视图Graph视图和Collection/Table视图
Collection/Table视图由边表和点表组成,使用spark rdd API
Graph视图是直接操作图
图存储
边分割(GraphX采用):每个顶点都存储一次,但有的边会被打断分到两台机器上
点分割:每条边只存储一次,都只会出现在一台机器上
每个图由3个RDD组成
图计算
基础模型:BSP(bulk synchronous parallel),一次计算分多个超步,一个超步含三子步(并发计算,通信,栅栏同步)
BSP扩展模型:
消息通信模型(Pregel,顶点思考模型,实现顶点更新函数),
GAS模型(GraphLab,邻居更新模型,实现gather、apply和scatter三个函数)
GraphX采用GAS
工具包
PageRank
Connected conponents
Strongly connected components
Shortest Path
SVD++
ALS
K-core
Triangle Count
LDA
Example
Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
//设置运行环境
val conf = new SparkConf().setAppName("GX").setMaster("local")
val sc = new SparkContext(conf)
//顶点的数据类型是VD:(String,Int)
val vertexArray = Array(
(1L, ("Alice", 28)),
(2L, ("Bob", 27)),
(3L, ("Charlie", 65)),
(4L, ("David", 42)),
(5L, ("Ed", 55)),
(6L, ("Fran", 50))
)
//边的数据类型ED:Int
val edgeArray = Array(
Edge(2L, 1L, 7),
Edge(2L, 4L, 2),
Edge(3L, 2L, 4),
Edge(3L, 6L, 3),
Edge(4L, 1L, 1),
Edge(5L, 2L, 2),
Edge(5L, 3L, 8),
Edge(5L, 6L, 3)
)
//构造vertexRDD
val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
//构造edgeRDD
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
//构造图Graph[VD,ED]
val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
println("图中年龄大于30的顶点:")
graph.vertices.filter { case (id, (name, age)) => age > 30}.collect.foreach {
case (id, (name, age)) => println(s"$name is $age")
}
来源:oschina
链接:https://my.oschina.net/u/98127/blog/916757