Simplifying Complex Graph Loading with Jupyter Notebook

jupyter-notebook Schema construction and graph data loading are usually the complicated steps in graph computing processes. Currently, GraphScope has released a graphscope-notebook plugin which through an interactive way help users complete the graph loading in the Jupyterlab environment. This article will provide a detailed introduction to the use of this plugin, and users can try it in the Playground environment.

Background

For any graph computing product, the loading of graph data is often the first and most important step, as well as a very complicated step, mainly due to the complexity of the graph data itself. Therefore, in order to improve the loading experience, GraphScope has built-in various datasets. For example, for the TinkerPop Modern Graph, users just need one statement to complete the loading operation:

>>> from graphscope.dataset import load_modern_graph
>>> modern_graph = load_modern_graph()

However, for the user’s dataset, the loading process needs to define a very long code, we use ogbn-mag this property graph as an example:

modern-graph-schema

This graph has four kinds of vertices, labeled as paper, author, institution and field_of_study. There are four kinds of edges connecting them, each kind of edges has a label and specifies the vertex labels for its two ends. For example, cites edges connect two vertices labeled paper. Another example is writes, it requires the source vertex is labeled author and the destination is a paper vertex. All the vertices and edges may have properties. e.g., paper vertices have properties like features, publish year, subject label, etc.

Usually, each type of vertex(edge) corresponds to a csv file, which can be downloaded from here.

$ tree
├── author_affiliated_with_institution.csv
├── author.csv
├── author_writes_paper.csv
├── field_of_study.csv
├── institution.csv
├── paper_cites_paper.csv
├── paper.csv
└── paper_has_topic_field_of_study.csv

Finally, for the user, the actual loading code is as follows:

modern-graph-python-loading-code

We can see that even for such a flexible language as Python, the schema definition is very complicated for the above-mentioned property graph containing 4 labels, not to mention hundreds of labels of vertices and edges. Even each vertex(edge) may have thousands of properties.

Therefore, in order to reduce the complexity and error rate of the loading process, GraphScope has developed a graphscope-notebook plugin, which can help GraphScope to complete the loading process of complex graph data interactively in the Jupyterlab environment. Currently, the plugin has been deployed in the GraphScope Playground environment, and everyone is welcome to try it out. Next, this article will use this above-mentioned property graph as an example to introduce in detail how to load graph interactively in the Jupyterlab environment.

Plugin Installation

The plugin requires the following conditions:

  • JupyterLab >= 3.0
  • GraphScope >= 0.12.0

You can install the plugin with the following command:

pip3 install graphscope-notebook

It is worth noting that after the plugin installation is complete, you need to restart Jupyterlab. Finally, if the left sidebar displays as following, it means that the installation is successful; Or you can find it in GraphScope Community to report problems encountered.

plugin-installation-successful

Using the Plugin

First, we run the following code in the Jupyterlab to create a GraphScope Session.

>>> import graphscope
>> graphscope.set_option(show_log=True)
>>> sess = graphscope.session(cluster_type="hosts")

After the Session is created, we can monitor the resource in the left resource panel. Click the “+” button to open the interactive page on the right side of the notebook:

open-interactive-page

1. Create Vertex

Next, we can complete the graph data construction according to the prompts on the interactive page. Take the “paper” type of vertex as an example. The page for creating points is as follows. After the information is filled in, click “Create Vertex” to complete the construction of the point.

Next, we can complete the construction of the graph data according to the prompts on the interactive page. Taking the “paper” type of vertex as an example. After filling in the information, click “Create Vertex” to complete the vertex construction.

create-vertex

The explanations for each field are as follows:

field comment
Label the label of the vertex
Data Source local file represents local data files; online file represents network files, such as OSS, etc.
Location the path of the data source
Header Row If true, the column name will be read from the first row of the source file
Delimiter optional values are “,” “;”, “ “, “\t”
Extra Params additional parameters required by data loading, such as OSS key/secret and endpoint
ID Field which column in the source file is selected as the ID, it can be a number like 0, 1, 2, or a string represents the property name
Select all Properties If true, all properties are loaded, otherwise the properties to be loaded need to be specified

2. Create Edge

Similar to “Create Vertex”, take the “cites” type of edge as an example (paper -> cites -> paper). After the information is filled in, click “Create Edge” to complete the construction of the edge.

create-edge

field comment
Edge Only True for cases where only one edge file and no vertex file
Src Vertex Label the label of source vertex
Dst Vertex Label the label of destination vertex
Src Vertex Field which column used as the ID of the source vertex, it can be a number like 0, 1, 2, or a string represents the property name
Dst Vertex Field which column used as the ID of the destination vertex, it can be a number like 0, 1, 2, or a string represents the property name

define-graph-info

After building all types of vertices/edges, we set the graph-related information:

field comment
Name the name of the graph
OID Type the original vertex type of the graph, optional values are “string” and “int64”
Directed whether the graph is directed or not
Generate EID whether to generate a unique id for each edge. Set True if you need to use the GIE service

4. Generate Code and Load Graph

After all the information above is filled in, select one “Notebook Cell” with the mouse, and click the “Generate Code” button to generate the corresponding graph loading code. Run this cell to finish the data loading process, and monitor the graph resource in the left resource panel:

dataloading-in-cell

So far, we have successfully used the graphscope-notebook plugin to complete the graph data loading process. Next, we can refer to the documentation to play with GraphScope.

Conclusion

The graphscope-notebook jupterlab plugin currently has the functions of 1) monitoring GraphScope runtime resources; 2) loading graph through an interactive way to reduce the complexity during graph loading process. In addition, we also plan to add visualization analysis of graph data in subsequent versions. Stay tuned!