NeuG v0.1.3: The Era of Graph Computing Inside the Graph Database

banner

NeuG v0.1.3 is out. The theme of this release: giving the graph database native graph computing capabilities. The GDS extension brings 9 graph algorithms, COPY TEMP lets you load external data for temporary analysis without polluting your production database.

Here’s why we built this, and what exactly shipped.


What NeuG Has Always Been Missing: Graph Computing

The core capability of graph databases is graph storage — representing entities and relationships as nodes and edges. With Cypher’s MATCH statement, you can do pattern-matching queries: find a node’s neighbors, match a path, filter subgraphs by conditions.

But there’s a class of problems that Cypher queries struggle to solve efficiently:

  • How many communities are in this graph? Which nodes belong together?
  • Which node is the most influential hub?
  • What are the shortest paths from node A to all other nodes?
  • What is the local clustering coefficient of each node?

These are graph algorithm problems, not graph query problems. The distinction: a graph query says “I know what pattern I’m looking for, find it”; a graph algorithm says “compute some structural property across the entire graph.”

Previously, since NeuG didn’t support graph algorithms, you had to export data and hand it off to NetworkX and other graph analysis libraries, compute results, then import them back. Data movement, format conversion, environment setup — the usual engineering overhead.

The GDS extension in v0.1.3 brings graph algorithms directly into the graph database. No data export, no second environment to maintain — just CALL and go.


GDS: 9 Algorithms, One API

Algorithm Overview

The GDS extension covers the three most common categories of graph algorithms:

Category Algorithms What It Solves
Traversal & Centrality WCC, BFS, SSSP, PageRank What’s reachable from A? Shortest path? How many connected components? Which node is most influential?
Community Detection Louvain, Leiden, CDLP How many communities? Which nodes belong together?
Structural Analysis LCC, K-Core How connected is the graph? Local structure density?

Unified Calling Convention

All algorithms follow the same pattern: project a subgraph, then call the algorithm.

-- 1. Project a subgraph (define which nodes and edges participate)
CALL project_graph(
    'social',
    ['person'],
    {'[person, knows, person]': ''}
);

-- 2. Call the algorithm
CALL page_rank('social', {max_iterations: 20})
RETURN node.fName, rank
ORDER BY rank DESC;

-- 3. Clean up
CALL drop_projected_graph('social');

project_graph defines the computation scope — which node labels and edge types participate. After projection, all algorithms use the unified CALL algo_name('graph_name', {options}) interface. Return values can be processed with standard Cypher clauses like ORDER BY, WHERE, and RETURN.

BFS and SSSP also support returning full paths:

CALL bfs('social', {source: '0'})
YIELD node, distance, path
RETURN node.fName, distance, nodes(path) AS path_nodes;

The path column is a standard Cypher PATH type, supporting nodes(path), relationships(path), and length(path).

Performance: LDBC Graphalytics Benchmark

We evaluated GDS on the LDBC Graphalytics benchmark — the recognized standard for graph analysis. It covers 5 datasets ranging from 1 to 1.8 billion edges, including synthetic R-MAT graphs (graph500-26), weighted social graphs (datagen-9 series), and a real-world social graph (com-friendster, 1.8B edges). We compared against SuiteSparse:GraphBLAS, GeminiGraph, and ladybug (ladybug implements PR and WCC only, with a significant performance gap). All tests ran on a 32-core/64-thread machine with 495GB RAM, all systems compiled with -O3.

Partial results:

Algorithm Dataset NeuG GraphBLAS GeminiGraph
CDLP datagen-9_2-zf 19s 316s
LCC datagen-9_0 21.8s 65.1s
SSSP datagen-9_0 1.14s 6.47s 1.24s
BFS com-friendster 0.84s 1.58s 0.99s
WCC com-friendster 0.95s 1.14s 8.47s

Per-algorithm analysis:

  • CDLP (Label Propagation): fastest on all datasets — 16x faster than GraphBLAS on datagen-9_2-zf (GeminiGraph does not implement this algorithm)
  • LCC (Local Clustering Coefficient): fastest on all datasets — 3x faster than GraphBLAS on datagen-9_0 (GeminiGraph does not implement this algorithm)
  • SSSP (Single-Source Shortest Path): fastest on all three weighted datasets — 6x faster than GraphBLAS on datagen-9_0, slightly faster than GeminiGraph
  • BFS (Breadth-First Search) / WCC (Weakly Connected Components): lead or tie with the best on most datasets
  • PageRank: close to GraphBLAS; GeminiGraph slightly faster on some datasets

Full benchmark documentation: GitHub.

The complete algorithm list, parameters, and usage are in the GDS extension docs.


Temporary Graphs: Load External Data, Compute, Don’t Pollute

GDS solves the “how to compute” problem. But there’s another practical question: where does the data come from?

Suppose you want to use GDS to analyze a batch of external data — relationship data exported from another system, a delta change file, or a temporarily assembled test dataset. You don’t want to create tables in your production database because this is a one-time analysis; but you need the data to exist as a graph in the database to run GDS.

v0.1.3’s COPY TEMP solves this:

-- Load temporary node table (from CSV)
COPY TEMP temp_user FROM 'users.csv' (header=true, primary_key='id');

-- Load temporary edge table (specify from/to endpoint tables)
COPY TEMP temp_knows FROM 'edges.csv' (header=true, from='temp_user', to='temp_user');

Temporary tables have three key properties:

  1. Lifecycle bound to Connection — closing the connection automatically cleans up all temporary tables. No manual DROP TABLE needed.
  2. Not written to checkpoint — gone after restart, never pollutes persistent data.
  3. Mixable with persistent data — standard Cypher JOIN works across temporary and persistent tables.

Supports CSV / JSON / JSONL / Parquet formats. Combined with v0.1.2’s httpfs extension, you can also load directly from OSS / S3 / HTTPS remote paths. See the data import docs for COPY TEMP details.


In Practice: Discovering New Concepts from Code Changes

Combining GDS and COPY TEMP solves a real-world problem: how do code changes impact the knowledge graph?

Scenario

Take the NeuG codebase as an example. We parse the NeuG codebase (up to v0.1.2) into a graph — functions as nodes, function call relationships as edges (the actual code graph schema would be more complex, also including imports and other relationships — simplified here for demonstration). The knowledge graph is overlaid on this same graph: each function is tagged with which concept it belongs to (e.g., execute_query() belongs to the “Query Execution” concept).

NeuG releases v0.1.3, and the code repository has new changes (new feature modules, documentation updates, etc.). We want to know: what impact will these code changes have on the knowledge graph? Do we need to add new concepts?

The approach: parse the new code from v0.1.3 into a delta graph (new function nodes + call edges), load it temporarily via COPY TEMP, then analyze its structure with GDS:

  • Which new code modules are the most influential?
  • How many communities do the new code changes form? Does each community suggest a new concept?

This is a pre-analysis — you don’t want to pollute the production knowledge base before deciding whether to apply the changes. COPY TEMP is the perfect fit.

Code Graph and Delta Data

The production database already has the current code graph — function nodes and call edges parsed from the v0.1.2 codebase, plus concept nodes and function → concept assignment relationships.

The new code from v0.1.3 is parsed into delta CSV files: new function nodes (e.g., project_graph, page_rank, leiden, bfs, etc.) and their call edges. We use COPY TEMP to temporarily load the delta, then use GDS to analyze its community structure — if new functions naturally cluster into communities, that suggests the knowledge graph may need new concepts.

Analysis Flow

from neug import Database
import tempfile, os

# 1. Create database (with persistent code graph, simulating production)
db = Database(db_path=tempfile.mkdtemp(), mode="w")
conn = db.connect()

# [Persistent code graph already exists: function nodes + call edges + concept assignments]

# 2. Load GDS extension
conn.execute("install gds")
conn.execute("load gds")

# 3. New functions and call edges parsed from v0.1.3 code (CSV provided by code analysis tool)
nodes_csv = os.path.join(tempfile.gettempdir(), "delta_funcs.csv")
edges_csv = os.path.join(tempfile.gettempdir(), "delta_calls.csv")
# ... write CSV files ...

# 4. COPY TEMP: load delta code as temporary graph (no pollution to persistent data)
#    Note: in practice, delta call edges may also connect to persistent func nodes
#    (new code calling existing functions, or vice versa); simplified to delta_func internal calls here
conn.execute("COPY TEMP delta_func FROM '{}' (header=true, primary_key='id')".format(nodes_csv))
conn.execute("COPY TEMP delta_call FROM '{}' (header=true, from='delta_func', to='delta_func')".format(edges_csv))

# 5. Project the delta code graph
conn.execute("""
    CALL project_graph(
        'delta_graph',
        ['delta_func'],
        {'[delta_func, delta_call, delta_func]': ''}
    )
""")

# 6. PageRank: which new code modules are most influential?
result = conn.execute("""
    CALL page_rank('delta_graph', {max_iterations: 20})
    RETURN node.name, rank
    ORDER BY rank DESC
""")
# (example results)
# project_graph        -> highest rank (called by page_rank, leiden, wcc, bfs; the delta's hub)
# ...
# NodeDatabase::Init   -> low (called only by InitAll)

# 7. Leiden: how many communities do the new code changes form?
#    Here we run community detection on the delta graph; in practice, you can also run
#    Leiden on the combined persistent + delta graph, comparing with original communities
#    to discover newly emerged communities → suggesting new concepts
result = conn.execute("""
    CALL leiden('delta_graph', {concurrency: 1})
    RETURN node.name, community
    ORDER BY community
""")
# (example results, actual community partitions may vary by algorithm parameters)
# Group A: project_graph, page_rank, leiden, wcc, bfs, parse_subgraph_entries
# Group B: InitAll, NodeDatabase::Init
# -> Two main groups: graph analytics functions (6 nodes) -> may correspond to new concept "GDS Extension"
#    Node.js binding functions (2 nodes) -> may correspond to new concept "Node.js Binding"

# 8. Clean up
conn.execute("CALL drop_projected_graph('delta_graph')")
conn.close()
# -> Temporary tables automatically cleaned up

# 9. Verify: reopen, persistent code graph intact
conn2 = db.connect()
result = conn2.execute("MATCH (n:func) RETURN count(n)")

conn2.close()
db.close()

Analysis Conclusions

PageRank reveals: project_graph is the most influential function among the new code — it is called by page_rank, leiden, and other algorithm functions, making it the core hub of the new code.

Leiden reveals: the new code forms multiple communities — GDS functions (project_graph, page_rank, leiden, etc., 6 nodes) are tightly coupled, suggesting that v0.1.3’s code changes may require adding a new “GDS Extension” concept to the knowledge graph; the Node.js binding functions (InitAll, NodeDatabase::Init, 2 nodes) are relatively independent and can be treated as a separate concept.

This is the core idea of “discovering new concepts from the code graph”: code updates are temporarily loaded via COPY TEMP, GDS analyzes their community structure, and newly emerging communities suggest that the knowledge graph needs new concepts. Throughout the analysis, the production database data was never modified. COPY TEMP’s temporary tables are automatically cleaned up when the connection closes, and persistent data remains intact.

We’ll publish a separate blog post covering the full code-graph-based concept discovery scenario in detail.


Also Worth Noting

A few more improvements in v0.1.3:

  • Node.js Binding (#424): Native C++ addon (N-API), providing Database, Connection, QueryResult high-level API. Supports embedded graph database + Cypher queries. npm install @graphscope-neug/neug.
  • COW Snapshot Isolation (#370): UpdateTransaction switched to Copy-on-Write snapshot isolation for safer read-write concurrency.
  • Unified Type System (#525): Compiler LogicalType and engine DataType unified, reducing type conversion overhead.
  • Database Directory Management Refactor (#148): Introduced Module, Checkpoint, and CheckpointManager for improved database directory management.
  • macOS ARM64 CI (#576): New dedicated macOS ARM64 testing workflow for Apple Silicon quality.
  • Key Bug Fixes: Non-standard column names causing parquet to load 0 rows (#455), macOS ARM64 nightly build failures (#515, #517).

Full release notes: GitHub Release.


Try It

pip install neug==0.1.3

Copy-paste the demo code above and run it. For a complete reproducible script (including CSV generation and assertions), see reproduce.py.

GDS extension docs: load_gds

GitHub: https://github.com/alibaba/neug