Persistent storage of graphs on the Kubernetes cluster#

If you want to persistently store specific graphs that have been calculated over a long period of time on the Kubernetes cluster and restore them later, this document provides step-by-step instructions on how to do this with the Kubernetes PersistentVolumes.

Prerequisites#

  • You have a Kubernetes cluster on hand. If you don’t have a Kubernetes cluster, please refer to Prepare a Kubernetes cluster for details.

  • You have the graphscope Python library installed. If you don’t have installed it, please refer to Install GraphScope Client for details.

Create a pv and pvc#

$ kubectl create namespace graphscope-system

Then create the pv as follows, the pv will be mounted to /var/vineyard/dump in the Kubernetes node. You can change the path to any other path you want.

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolume
metadata:
  name: graphscope-pv
  labels:
    app.kubernetes.io/name: test-pv
spec:
  capacity:
    storage: 1Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: /var/vineyard/dump
  storageClassName: manual
EOF

Create pvc as follows. Most importantly, the pvc can’t be deleted, otherwise the data will be lost.

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: graphscope-pvc
  namespace: graphscope-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: test-pv
  resources:
    requests:
      storage: 1Gi
  accessModes:
    - ReadWriteOnce
  storageClassName: manual
EOF

Store graphs to the pvc#

After the above preparations are completed, you can deploy the graphscope cluster as follows:

import graphscope
import os
import vineyard
from graphscope.dataset import load_modern_graph

# export the gs_test_dir to the environment variable
k8s_volumes = {
    "data": {
        "type": "hostPath",
        "field": {"path": os.environ["GS_TEST_DIR"], "type": "Directory"},
        "mounts": {"mountPath": "/testingdata"},
    }
}

# create a graphscope session with the external 
# vineyard deployment.
#  
# Notice, the num_workers should not be greater than the
# number of nodes in the kubernetes cluster.  
sess = graphscope.session(
    num_workers=1,
    k8s_image_registry="docker.io",
    k8s_image_tag="ccc",
    k8s_namespace="graphscope-system",
    k8s_vineyard_deployment="vineyardd-sample",
    k8s_volumes=k8s_volumes,
)

# load modern graph 
graph = load_modern_graph(sess, "/testingdata/modern_graph")

# create the gie instance
interactive = sess.gremlin(graph)

# get the subgraph
sub_graph = interactive.subgraph(
    'g.V().hasLabel("person").outE("knows")'
)

# project the projected graph to simple graph.
simple_g = sub_graph.project(vertices={"person": []}, edges={"knows": []})

pr_result = graphscope.pagerank(simple_g, delta=0.8)
tc_result = graphscope.triangles(simple_g)

# add the PageRank and triangle-counting results as new columns to the property graph
sub_graph.add_column(pr_result, {"Ranking": "r"})
sub_graph.add_column(tc_result, {"TC": "r"})

# print the simple graph and subgraph's vineyard_id
# REMEMBER the several vineyard ids, you need to use them to restore the graphs next time.
print(simple_g.vineyard_id)
# REMEMBER THIS: 997255889378630
print(sub_graph.vineyard_id)
# REMEMBER THIS: 997163552113975

# store the simple graph and subgraph to the pvc
# use the previous path of the pv and the pvc name here
sess.store_graphs_to_pvc(
    graphIDs=[vineyard.ObjectID(simple_g.vineyard_id), vineyard.ObjectID(sub_graph.vineyard_id)],
    path="/var/vineyard/dump",
    pvc_name="graphscope-pvc",
)

# check the simple graph's schema
print(simple_g.schema)
# oid_type: LONG
# vid_type: ULONG
# type: VERTEX
# Label: person
# Properties: 
#
# type: EDGE
# Label: knows
# Properties: 
# Relations: [Relation(source='person', destination='person')]

# check the subgraph's schema
print(sub_graph.schema)

# oid_type: LONG
# vid_type: ULONG
# type: VERTEX
# Label: person
# Properties: Property(0, name, STRING), Property(1, age, INT), Property(2, id, LONG)
#
# type: VERTEX
# Label: software
# Properties: Property(0, name, STRING), Property(1, lang, STRING), Property(2, id, LONG)
#
# type: EDGE
# Label: created
# Properties: Property(0, eid, LONG), Property(1, weight, DOUBLE)
# Relations: [Relation(source='person', destination='software')]
# type: EDGE
# Label: knows
# Properties: Property(0, eid, LONG), Property(1, weight, DOUBLE)
# Relations: [Relation(source='person', destination='person')]

# close the session
sess.close()

Retore graphs from the pvc#

Remember the vineyard ids printed above and the pvc name and then you can restore the graphs from the pvc as follows.

import graphscope
import os
import vineyard

# create a graphscope session with the external 
# vineyard deployment.
#  
# Notice, the num_workers should not be greater than the
# number of nodes in the kubernetes cluster.  
sess = graphscope.session(
    num_workers=1,
    k8s_image_registry="docker.io",
    k8s_image_tag="ccc",
    k8s_namespace="graphscope-system",
    k8s_vineyard_deployment="vineyardd-sample",
)

# load graphs from the pvc
sess.restore_graphs_from_pvc(
    path="/var/vineyard/dump",
    pvc_name="graphscope-pvc"
)

# get the simple graph and subgraph
simple_g = sess.g(vineyard.ObjectID(997255889378630))
sub_graph = sess.g(vineyard.ObjectID(997163552113975))

# check the graphs' schema
print(simple_g.schema)
# oid_type: LONG
# vid_type: ULONG
# type: VERTEX
# Label: person
# Properties: 
#
# type: EDGE
# Label: knows
# Properties: 
# Relations: [Relation(source='person', destination='person')]

print(sub_graph.schema)
# oid_type: LONG
# vid_type: ULONG
# type: VERTEX
# Label: person
# Properties: Property(0, name, STRING), Property(1, age, INT), Property(2, id, LONG)
#
# type: VERTEX
# Label: software
# Properties: Property(0, name, STRING), Property(1, lang, STRING), Property(2, id, LONG)
#
# type: EDGE
# Label: created
# Properties: Property(0, eid, LONG), Property(1, weight, DOUBLE)
# Relations: [Relation(source='person', destination='software')]
# type: EDGE
# Label: knows
# Properties: Property(0, eid, LONG), Property(1, weight, DOUBLE)
# Relations: [Relation(source='person', destination='person')]

Clean up#

If you don’t need the graphs anymore, you can delete the pvc and pv as follows.

$ kubectl delete pvc graphscope-pvc -n graphscope-system
$ kubectl delete pv graphscope-pv