Skip to Content
ExtensionsParquet Extension

Parquet Extension

Apache Parquet is a columnar storage format widely used in data engineering and analytics workloads. NeuG supports both Parquet file import and export functionality through the Extension framework.

  • Import: Load external Parquet files using LOAD FROM syntax
  • Export: Export query results to Parquet files using COPY TO syntax

Install Extension

INSTALL PARQUET;

Load Extension

LOAD PARQUET;

Using Parquet Extension

LOAD FROM reads Parquet files and exposes their columns for querying. Schema is automatically inferred from the Parquet file metadata by default.

Parquet Format Options

The following options control how Parquet files are read:

OptionTypeDefaultDescription
buffered_streambooltrueEnable buffered I/O stream for improved sequential read performance.
pre_bufferboolfalsePre-buffer column data before decoding. Recommended for high-latency filesystems such as S3.
enable_io_coalescingbooltrueEnable Arrow I/O read coalescing (hole-filling cache) to reduce I/O overhead when reading non-contiguous byte ranges. When true, uses lazy coalescing; when false, uses eager coalescing.
parquet_batch_rowsint6465536Number of rows per Arrow record batch when converting Parquet row groups into in-memory batches.

Query Examples

Basic Parquet Loading

Load all columns from a Parquet file:

LOAD FROM "person.parquet" RETURN *;

Specifying Batch Size

Tune memory usage by adjusting the number of rows read per batch:

LOAD FROM "person.parquet" (parquet_batch_rows=8192) RETURN *;

Enabling I/O Coalescing

Enable eager I/O coalescing for workloads that benefit from pre-fetching contiguous data:

LOAD FROM "person.parquet" (enable_io_coalescing=false) RETURN *;

Column Projection

Return only specific columns from Parquet data:

LOAD FROM "person.parquet" RETURN fName, age;

Column Aliases

Use AS to assign aliases to columns:

LOAD FROM "person.parquet" RETURN fName AS name, age AS years;

Note: All relational operations supported by LOAD FROM — including type conversion, WHERE filtering, aggregation, sorting, and limiting — work the same way with Parquet files. See the LOAD FROM reference for the complete list of operations.

Export to Parquet

NeuG supports exporting query results to Parquet files using the COPY TO command. This is useful for:

  • Data archiving: Store query results in efficient columnar format
  • Data sharing: Exchange data with other analytics tools (Spark, Pandas, DuckDB, etc.)
  • Performance: Parquet’s columnar format provides excellent compression and query performance

Basic Export Syntax

Export query results to a Parquet file:

COPY ( MATCH (p:person) RETURN p.ID, p.fName, p.age ) TO 'output.parquet';

Export Options

The following options control how Parquet files are written:

OptionTypeDefaultDescription
compressionstringsnappyCompression codec: snappy, gzip, zstd, or none
row_group_sizeint641048576Number of rows per row group (1,048,576 = 1M rows). Larger values improve compression but use more memory.
dictionary_encodingbooltrueEnable dictionary encoding for string columns. Reduces file size for columns with repeated values.

Export Examples

Export with ZSTD Compression

COPY ( MATCH (p:person) RETURN p.* ) TO 'person.parquet' (compression='zstd');

Export with Custom Row Group Size

COPY ( MATCH (v:node) RETURN v.* ) TO 'nodes.parquet' (row_group_size=500000);

Export Without Compression

COPY ( MATCH (p:person)-[k:knows]->(p2:person) RETURN p.fName, p2.fName, k.since ) TO 'relationships.parquet' (compression='none');

Export with Dictionary Encoding Disabled

COPY ( MATCH (p:person) RETURN p.fName, p.email ) TO 'contacts.parquet' (dictionary_encoding=false);

Supported Data Types

Parquet export supports all NeuG data types:

Primitive Types:

  • INT32, INT64, UINT32, UINT64
  • FLOAT, DOUBLE, BOOLEAN
  • STRING, DATE, TIMESTAMP, INTERVAL

Complex Types:

  • List: Variable-length arrays (e.g., list<string>, list<int64>)
  • Struct: Nested structures with named fields
  • Vertex: Graph vertices exported as JSON string (due to mixed-type schema conflicts)
  • Edge: Graph edges exported as JSON string (due to mixed-type schema conflicts)
  • Path: Graph paths exported as JSON string (due to mixed-type schema conflicts)

Note on Vertex/Edge/Path export: These graph types are exported as JSON strings rather than Parquet StructArrays. This design choice is necessary because Parquet StructArrays require all rows to have the same schema, but mixed-type vertices/edges (e.g., person vs. organisation) have different properties, which would cause schema conflicts and sparse data.

Export Vertex and Edge Data

Export complete vertex objects:

COPY ( MATCH (p:person) RETURN p ) TO 'vertices.parquet';

This creates a Parquet file with a JSON string column containing serialized vertex data:

p: string (JSON) e.g. {"_ID": 1, "_LABEL": "person", "fName": "Alice", "age": 30, ...}

Export complete edge objects:

COPY ( MATCH (p:person)-[k:knows]->(p2:person) RETURN k ) TO 'edges.parquet';

This creates a Parquet file with a JSON string column containing serialized edge data:

k: string (JSON) e.g. {"_ID": 100, "_LABEL": "knows", "_SRC_ID": 1, "_DST_ID": 2, "since": "2020-01-01", ...}

Performance Tips

  1. Choose appropriate compression:

    • snappy: Good balance of speed and compression (default)
    • zstd: Best compression ratio, slightly slower
    • none: Fastest, but larger files
  2. Adjust row group size based on your use case:

    • Large datasets (>10M rows): Use default 1M rows per group
    • Medium datasets (100K-10M rows): Use 500K rows per group
    • Small datasets (<100K rows): Use 100K rows per group
  3. Enable dictionary encoding for string columns with repeated values (e.g., categories, status codes)

  4. Export only needed columns to reduce file size:

    COPY ( MATCH (p:person) RETURN p.ID, p.fName -- Not p.* ) TO 'subset.parquet';

Round-Trip Example

Export data and verify by loading it back:

-- Step 1: Export to Parquet COPY ( MATCH (p:person) RETURN p.ID, p.fName, p.age ) TO 'export.parquet'; -- Step 2: Load it back to verify LOAD FROM "export.parquet" RETURN *;