Tutorial: Run GraphX Applications on GraphScope#
Apache Spark is a famous engine for large-scale data analytics. Spark GraphX is Spark’s graph computing module, which provides flexible and efficient graph computation framework.
GraphScope is also developed to be integrated with Spark GraphX. User can easily deploy a graphscope cluster co-located with spark cluster. And by switch SparkSession
to GSSparkSession
, user can experience up to 7 times performance
improvement when running GraphX algorithms.
Deploy GraphScope along with Spark#
We assume you already have a spark cluster deployed. If you don’t have a spark cluster deployed, please refer to spark-cluster-overview to deploy a spark cluster. Spark distributions with version ==3.1.3 has been tested to be compatible with GraphScope.
Also, GraphScope can be easily distributed with python package. Since GraphScope only support python3, you shall upgrade your python environment before proceeding on.
Then, on client side, we will use venv
to create a virtual environment pack which contains graphscope package.
pip3 install virtualenv venv-pack
python -m venv pyspark_venv
source pyspark_venv/bin/activate
pip3 install graphscope
venv-pack -o pyspark_venv_gs.tar.gz
Now, pyspark_venv_gs.tar.gz
contains necessary environments graphscope need. Every time
you submit a job to your spark cluster, remember to upload this pack.
export PYSPARK_DRIVER_PYTHON=python
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_venv_gs.tar.gz#environment ...
Run example GraphX apps#
Several GraphX algorithms are also contained in grape-demo.jar. You can have a try to run these GraphX algorithm on GraphScope.
You can download p2p
dataset and grape-demo.jar
with following command.
wget https://raw.githubusercontent.com/GraphScope/gstest/master/p2p-31.e /home/graphscope/p2p-31.e
wget https://graphscope.oss-cn-beijing.aliyuncs.com/jar/grape-demo-0.19.0-shaded.jar /home/graphscope/grape-demo-0.19.0-shaded.jar
Different from Giraph-on-GraphScope, for GraphX-GraphScope integration, we need to submit jobs to spark cluster, not with GraphScope python client.
Submit to Spark#
# Path to GraphScope jars is need for running graphx algo on GraphScope.
# FIXME(yuansi): Here we assume env var GRAPHSCOPE_HOME available in environment.
export GS_JARS=`ls ${GRAPHSCOPE_HOME}/lib/grape-graphx-*.jar`:`ls ${GRAPHSCOPE_HOME}/lib/grape-runtime-*.jar`
# default port is 7077, for standalone cluster, like spark://${host}:${port}
/bin/spark-submit --verbose --master spark://${master_url} \
--archives pyspark_venv_gs.tar.gz#environment --jars ${GS_JARS} \
--conf spark.executor.instances=2 \
--conf spark.driver.memory=2g \
--conf spark.executor.memory=10g \
--conf spark.scheduler.minRegisteredResourcesRatio=1.0 \
--conf spark.gs.submit.jar=/home/graphscope/grape-demo-0.19.0-shaded.jar \
--class com.alibaba.graphscope.example.graphx.BFSTest
/home/graphscope/grape-demo-0.19.0-shaded.jar /home/graphscope/p2p-31.e 2 1
Remember to replace the placeholders like ${master_url}
with actual cluster url.
Run customized GraphX apps#
To develop your GraphX algorithms which can run on GraphScope, users shall program towards the RDD interfaces provided by Spark GraphX, since all GraphX interfaces are supported by GraphScope.
Include dependency#
Include grape-graphx
dependency in your project’s pom.xml
.
<dependency>
<groupId>com.alibaba.graphscope</groupId>
<artifactId>grape-graphx</artifactId>
<classifier>shaded</classifier>
<scope>provided</scope>
<version>0.19.0</version>
</dependency>
And you also need to configure maven-shaded-plugin
with following configuration to make sure the conflicts can be correctly resolved.
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>shade</goal>
</goals>
<phase>package</phase>
<configuration>
<filters>
<filter>
<artifact>org.apache.spark:*</artifact>
<includes>
<include>org/apache/spark/**</include>
</includes>
</filter>
</filters>
<relocations>
<relocation>
<pattern>org.apache.spark.graphx</pattern>
<shadedPattern>org.apache.spark.gs.graphx</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
Develop customized GraphX algorithm towards GraphScope.#
Other than the interface provided by GraphX, GraphScope also provide some other graphscope-only features
via GSSparkSession
. User shall use GSSparkSession
instead of SparkSession
to make their algorithm runnable on GraphScope.
GSSparkSession
extends SparkSession
with following new methods.
/** GraphScope related param, setting vineyard memory size.
*/
def vineyardMemory(memoryStr: String): Builder =
config("spark.gs.vineyard.memory", memoryStr)
/** GraphScope vineyard socket file. Vineyard process should be bound on this address on all workers.
*/
def vineyardSock(filePath: String): Builder = {
config("spark.gs.vineyard.sock", filePath)
}
/** User need to specify the file path to the jar submitted to spark cluster.
*/
def gsSubmitJar(filePath: String): Builder = {
config("spark.gs.submit.jar", filePath)
}
// convert GraphX Graph to GrapeGraph
def toGSGraph[VD: ClassTag, ED: ClassTag](
graph: Graph[VD, ED]
): GrapeGraphImpl[VD, ED] = {
}
// Load grapeGraph from files.
def loadGraphToGS[VD: ClassTag, ED: ClassTag](
vFilePath: String,
eFilePath: String,
numPartitions: Int
): GrapeGraphImpl[VD, ED] = {
}
Run customized GraphX algorithms on Spark with GraphScope support#
Great performance improvement is observed when running graphx algorithms on GraphScope other than GraphX. To enable GraphScope support, just add necessary arguments to spark-submit shell when submit your job, like Submit example GraphX app to Spark. Just remember to to change jar name, app name and params.