Parquet Extension
Apache Parquet is a columnar storage format widely used in data engineering and analytics workloads. NeuG supports both Parquet file import and export functionality through the Extension framework.
- Import: Load external Parquet files using
LOAD FROMsyntax - Export: Export query results to Parquet files using
COPY TOsyntax
Install Extension
INSTALL PARQUET;Load Extension
LOAD PARQUET;Using Parquet Extension
LOAD FROM reads Parquet files and exposes their columns for querying. Schema is automatically inferred from the Parquet file metadata by default.
Parquet Format Options
The following options control how Parquet files are read:
| Option | Type | Default | Description |
|---|---|---|---|
buffered_stream | bool | true | Enable buffered I/O stream for improved sequential read performance. |
pre_buffer | bool | false | Pre-buffer column data before decoding. Recommended for high-latency filesystems such as S3. |
enable_io_coalescing | bool | true | Enable Arrow I/O read coalescing (hole-filling cache) to reduce I/O overhead when reading non-contiguous byte ranges. When true, uses lazy coalescing; when false, uses eager coalescing. |
parquet_batch_rows | int64 | 65536 | Number of rows per Arrow record batch when converting Parquet row groups into in-memory batches. |
Query Examples
Basic Parquet Loading
Load all columns from a Parquet file:
LOAD FROM "person.parquet"
RETURN *;Specifying Batch Size
Tune memory usage by adjusting the number of rows read per batch:
LOAD FROM "person.parquet" (parquet_batch_rows=8192)
RETURN *;Enabling I/O Coalescing
Enable eager I/O coalescing for workloads that benefit from pre-fetching contiguous data:
LOAD FROM "person.parquet" (enable_io_coalescing=false)
RETURN *;Column Projection
Return only specific columns from Parquet data:
LOAD FROM "person.parquet"
RETURN fName, age;Column Aliases
Use AS to assign aliases to columns:
LOAD FROM "person.parquet"
RETURN fName AS name, age AS years;Note: All relational operations supported by
LOAD FROM— including type conversion, WHERE filtering, aggregation, sorting, and limiting — work the same way with Parquet files. See the LOAD FROM reference for the complete list of operations.
Export to Parquet
NeuG supports exporting query results to Parquet files using the COPY TO command. This is useful for:
- Data archiving: Store query results in efficient columnar format
- Data sharing: Exchange data with other analytics tools (Spark, Pandas, DuckDB, etc.)
- Performance: Parquet’s columnar format provides excellent compression and query performance
Basic Export Syntax
Export query results to a Parquet file:
COPY (
MATCH (p:person)
RETURN p.ID, p.fName, p.age
) TO 'output.parquet';Export Options
The following options control how Parquet files are written:
| Option | Type | Default | Description |
|---|---|---|---|
compression | string | snappy | Compression codec: snappy, gzip, zstd, or none |
row_group_size | int64 | 1048576 | Number of rows per row group (1,048,576 = 1M rows). Larger values improve compression but use more memory. |
dictionary_encoding | bool | true | Enable dictionary encoding for string columns. Reduces file size for columns with repeated values. |
Export Examples
Export with ZSTD Compression
COPY (
MATCH (p:person)
RETURN p.*
) TO 'person.parquet' (compression='zstd');Export with Custom Row Group Size
COPY (
MATCH (v:node)
RETURN v.*
) TO 'nodes.parquet' (row_group_size=500000);Export Without Compression
COPY (
MATCH (p:person)-[k:knows]->(p2:person)
RETURN p.fName, p2.fName, k.since
) TO 'relationships.parquet' (compression='none');Export with Dictionary Encoding Disabled
COPY (
MATCH (p:person)
RETURN p.fName, p.email
) TO 'contacts.parquet' (dictionary_encoding=false);Supported Data Types
Parquet export supports all NeuG data types:
Primitive Types:
- INT32, INT64, UINT32, UINT64
- FLOAT, DOUBLE, BOOLEAN
- STRING, DATE, TIMESTAMP, INTERVAL
Complex Types:
- List
: Variable-length arrays (e.g., list<string>,list<int64>) - Struct: Nested structures with named fields
- Vertex: Graph vertices exported as JSON string (due to mixed-type schema conflicts)
- Edge: Graph edges exported as JSON string (due to mixed-type schema conflicts)
- Path: Graph paths exported as JSON string (due to mixed-type schema conflicts)
Note on Vertex/Edge/Path export: These graph types are exported as JSON strings rather than Parquet StructArrays. This design choice is necessary because Parquet StructArrays require all rows to have the same schema, but mixed-type vertices/edges (e.g., person vs. organisation) have different properties, which would cause schema conflicts and sparse data.
Export Vertex and Edge Data
Export complete vertex objects:
COPY (
MATCH (p:person)
RETURN p
) TO 'vertices.parquet';This creates a Parquet file with a JSON string column containing serialized vertex data:
p: string (JSON)
e.g. {"_ID": 1, "_LABEL": "person", "fName": "Alice", "age": 30, ...}Export complete edge objects:
COPY (
MATCH (p:person)-[k:knows]->(p2:person)
RETURN k
) TO 'edges.parquet';This creates a Parquet file with a JSON string column containing serialized edge data:
k: string (JSON)
e.g. {"_ID": 100, "_LABEL": "knows", "_SRC_ID": 1, "_DST_ID": 2, "since": "2020-01-01", ...}Performance Tips
-
Choose appropriate compression:
snappy: Good balance of speed and compression (default)zstd: Best compression ratio, slightly slowernone: Fastest, but larger files
-
Adjust row group size based on your use case:
- Large datasets (>10M rows): Use default 1M rows per group
- Medium datasets (100K-10M rows): Use 500K rows per group
- Small datasets (<100K rows): Use 100K rows per group
-
Enable dictionary encoding for string columns with repeated values (e.g., categories, status codes)
-
Export only needed columns to reduce file size:
COPY ( MATCH (p:person) RETURN p.ID, p.fName -- Not p.* ) TO 'subset.parquet';
Round-Trip Example
Export data and verify by loading it back:
-- Step 1: Export to Parquet
COPY (
MATCH (p:person)
RETURN p.ID, p.fName, p.age
) TO 'export.parquet';
-- Step 2: Load it back to verify
LOAD FROM "export.parquet"
RETURN *;