GAIA-IR: Graph Interactive Query Engine in GraphScope

GAIA-IR In this blog, we introduce GAIA-IR, an interactive graph query engine for GraphScope. GAIA-IR not only showcases exceptional efficiency in handling Gremlin queries within a distributed framework but also present a unified intermediate representation layer that offers remarkable adaptability for incorporating additional query languages. This feature makes the engine scalable, efficient, and user-friendly, rendering it an invaluable tool for individuals engaged in graph databases.

Background and Challenges in Graph Query

Graph querying is an pivotal tool in the analysis of massive data. Gremlin, which is an industry-standard graph query language proposed and maintained by Apache Tinkerpop, is widely used in popular graph databases such as OrientDB, JanusGraph, Microsoft Cosmos DB, and Amazon Neptune. The graph query engine GAIA in GraphScope is the pioneering system that embrace Gremlin for large-scale distributed environment in the industry. While the versatility exhibited by Gremlin remains its standout advantage, our exploration during GAIA’s design and implementation to support Gremlin has uncovered certain challenges.

Drawbacks in GAIA System

The GAIA system has the following drawbacks:

D1: There are a large number of Gremlin operators, and there are also many different ways of expressing the same semantics in graph queries by Gremlin operators. For example, if we want to project entries’ properties, we can get similar results using different operators in Gremlin, such as elementMap(), valueMap(), values(), select().valueMap(), project().valueMap(), etc. An example is shown below:

gremlin> g.V().elementMap() 
==>[id:1,label:person,name:marko,age:29] 
==>[id:2,label:person,name:vadas,age:27]

gremlin> g.V().valueMap('name','age') 
==>[name:[marko],age:[29]] 
==>[name:[vadas],age:[27]]

gremlin> g.V().as('a').select('a').by(valueMap('name', 'age')) 
==>[name:[marko], age:[29]] 
==>[name:[vadas], age:[27]]

gremlin> g.V().as('a').project('a').by(valueMap('name', 'age')) 
==>[a:[name:[marko], age:[29]]] 
==>[a:[name:[vadas], age:[27]]]

This leads to the redundancy implementations in GAIA to support such Gremlin operators, which is not developer-friendly and has poor scalability.

D2: GAIA has poor language extensibility. GAIA is a customized implementation of parallel Gremlin queries, while there are now many other commonly used graph querying languages, such as Cypher, GSQL, and the upcoming graph query standard language GQL. If we need to further integrate more querying languages in the future, it is almost impossible to achieve this by extending GAIA.

D3: Gremlin language has poor support for complex expressions. For example, we may want to find people in the two-degree neighborhood of person 'a' that meet certain 'age' property conditions through the following Gremlin query statement:

g.V().as('a').out().as('b').out().as('c')
 .where('c', P.lt('a').or(P.gt('a').and(P.gt('b')))).by('age')

Complex nested conditional filtering, like in where(), is not intuitive and user-friendly.

D4: GAIA lacks well-defined Gremlin syntax specifications and it is difficult to define the scope of support for Gremlin operators and operator combinations in the current system, which is not user-friendly.

Solutions in GAIA-IR System

To address the above issues, we further proposed the language-independent and more general intermediate representation layer GAIA-IR (IR in short) to describe the common semantics in graph queries. The operators we abstracted can be divided into two categories: relational operators and graph-related operators. Relational operators mainly correspond to operations on traditional relational databases, such as Projection, Selection, GroupBy, OrderBy, etc., while graph-related operators correspond to operations on graph data, such as queries on graph vertices, graph edges, etc. Through this query language-independent intermediate representation layer, we can solve the problems in GAIA as follows:

A1. GAIA-IR uses a unified intermediate representation to express operators of similar semantics in Gremlin. For example, we abstract the Project operator to unify the various property acquisition operations in Gremlin mentioned in D1.

A2. GAIA-IR is independent of the graph query language, which makes it easier to integrate more languages. We only need to translate the operators in different languages to the unified IR operators to naturally support the parallelized implementations, without the need to design for each query language.

A3. GAIA-IR also provides rich expression support to meet user needs. For example, compared with the example in D3, adding expression support in the where() operator would be much more intuitive:

g.V().as('a').out().as('b').out().as('c')
 .where(expr("@c.age < @a.age || (@c.age > @a.age && @a.age > @b.age)"))

A4: GAIA-IR introduces the Antlr tool to support Gremlin syntax checking and clarifies the system’s support scope for Gremlin operators and combinations, which is more user-friendly.

Design of GAIA-IR

Next, we will introduce the overall design of GAIA-IR.

Concepts in GAIA-IR

First, we introduce some basic concepts in GAIA-IR. GAIA-IR abstracts the basic computations on graph data, providing a unified, concise, and language-independent intermediate representation layer.

IR Operators

Currently, we abstract operators (Graph-Relational Algebra) into two categories: relational operators and graph-related operators:

  • Relational operators include: Projection, Selection, Join, GroupBy, OrderBy, DeDup, Limit, etc., which are consistent with operations on traditional relational databases.
  • Graph-related operators include: GetV, E(dge)-Join, P(ath)-Join, for searching for vertex properties, neighbors (edges), and paths on the graph respectively.

Through these operator abstractions, we can support most graph query scenarios. At the same time, these operator abstractions are not limited by the querying language, making it easy to extend to other query languages.

IR Data Structure

In GAIA-IR, we defined the data structure GRecord to represent the input and output of each IR Operator. GRecord is a multi-column structure, and each column has its own alias and value:

  • Alias, which is similar to the alias by As in SQL. Specifically, to adapt to Gremlin, we provide an additional unique alias – "HEAD" as an anonymous alias, referring to the output of the previous operator, which is also the input of the current operator.
  • Value, which is of data types includes: 1) Simple data type CommonObject (including int/string/intArray/stringArray, etc.) and 2) Graph data type GraphObject (including Vertex, Edge, and Path).

Next, we give an example to elaborate how to execute a gremlin query in GAIA-IR.

Gremlin Query Example

We support Gremlin by translating Gremlin queries into a series of IR Operators on GRecord. An example is shown below:

g.V().as('a').select('a').by(valueMap('name','age'))

First, g.V().as('a') will produce the following intermediate results, with the alias 'a' and data type Vertex:

GR1 Vertex { name:[marko], age:[29] }, Alias: ‘a’
GR2 Vertex { name:[vadas], age:[27] }, Alias: ‘a’

Then select('a').by(valueMap('name','age')) would be translated into Project('{a.name,a.age}'). With GR1 and GR2 as the input of Project, we can get the output GR1' and GR2', which are the properties we need:

GR1’ CommonObject { a.name:[marko], a.age:[29] }
GR2’ CommonObject { a.name:[vadas], a.age:[27] }

Similarly, for the Gremlin query g.V().valueMap('name','age'), we only need to change the aliases of GR1 and GR2 to the anonymous alias "HEAD" and translate valueMap('name','age') into Project("{HEAD.name,HEAD.age}"), which gives the same result. Thus, we can translate Gremlin queries with similar semantics but different operators into a unified intermediate representation. Moreover, for other languages, such as retrieving properties in SQL, we can also translate into the Project operator in GAIA-IR. This shows that GAIA-IR abstracts a more concise, general, and language-independent intermediate representation layer.

GAIA-IR System Architecture

Next, we present the parallel computing architecture of GAIA-IR for Gremlin, as shown in the following figure:

GAIA-IR-SYSTEM-ARCH

Overall, we are compatible with both the official Gremlin Console and Gremlin SDK. After the user submits a Gremlin query:

  1. The frontend IR Compiler is responsible for syntax checking of the query. For a valid query, the IR Compiler compiles the query syntax tree into a logical plan consisting of IR Operators through the IR Library API, and further calls the IR Library API to generate a physical plan, which is then sent to the distributed backend engine servers.
  2. The backend engine servers pre-launch the distributed graph partitions during the service start-up phase. Upon receiving the physical plan sent by the IR Compiler, the IR Runtime is responsible for parsing the physical plan and constructing an engine-executable execution plan. Specifically, for each IR Operator, the IR Runtime is responsible for generating the corresponding engine-understandable UDF to implement the specific query semantics of the IR Operator. After the calculation is completed, the IR Runtime returns the results back to the IR Compiler, which will be further parsed and returned to the client.

How to Use GAIA-IR

After introducing the overall design of GAIA-IR, we will now explain how to use the GAIA-IR for graph querying.

Deployment

In our previous blogs, we introduced how to deploy GraphScope. As an important implementation of GIE in GraphScope, the overall launch method of GAIA-IR is consistent with GraphScope. Taking Helm deployment as an example, the installation command is as follows:

helm repo add graphscope https://graphscope.oss-cn-beijing.aliyuncs.com/charts/
helm install [RELEASE_NAME] graphscope/graphscope-store

More detailed deployment options can be found in the official document.

Gremlin Query

After successfully launching the service, we can query through the Gremlin Server host and port. Taking Gremlin Console as an example, after the service is successfully launched and the data is imported (the specific data import steps can be found in the official document), we can query by configuring the Gremlin Console. An example is as follows:

  1. First, we modify the host and port in the conf/remote.yaml configuration file of the Gremlin Console.
  2. Open the Gremlin Console, input the configuration of remote.yaml, and then you can start querying.
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182] - type ':remote console' to return to local mode
gremlin> g.V().valueMap('name','age') 
==>[name:[marko],age:[29]] 
==>[name:[vadas],age:[27]]

Conclusion

This article describes the design intention and overall architecture of GAIA-IR, as well as how to use the GAIA-IR engine for graph querying. You can find the current release version on GitHub. As the interactive graph query engine of GraphScope, GAIA-IR provides an efficient implementation of parallel Gremlin queries. At the same time, based on the unified intermediate representation of IR, we are supporting more graph query languages, e.g., Cypher. Besides, we also introduce optimization techniques in supporting important scenarios such as pattern matching in graph queries. In future articles, we will introduce more technical details. We will continue to optimize the GAIA-IR system, and we warmly welcome feedback and contributions from the community.