Skip to content

GFFBase

PyPI Python License Tests Coverage Validated


What is GFFBase?

GFFBase is a high-performance genomic-annotation engine combining a SIMD Rust parser, a DuckDB columnar backend, and a zero-copy PyArrow interface — purpose-built for whole-genome-scale ingest and bulk machine-learning feature extraction, while remaining a drop-in successor to gffutils.

A SIMD Rust+PyO3 parser feeds DuckDB's columnar storage through record-batch Arrow handoffs. A smart query router auto-picks an R-tree or B-tree spatial index per query, and a closure-cache / recursive-CTE relational dispatcher selects the right strategy based on the corpus's actual hierarchy depth. The full FeatureDB / Feature / create_db / DataIterator / GFFWriter / merge_criteria legacy API is preserved verbatim.

Three reasons it matters

  1. 🚀 ≥ 32× faster GENCODE GTF ingest (v49, 6.07 M lines) — set-based DuckDB GROUP BY synthesis + recursive-CTE closure replaces legacy's millions of Python ↔ SQLite round-trips spent inventing the missing gene/transcript rows.
  2. ⚡ 36.68× faster bulk ML extractionchildren_batched(format='arrow') returns 50 000 transcripts → 1.6 M exons as a zero-copy PyArrow table in 1.16 s. No Python Feature objects, ever.
  3. 🛡️ Validated NCBI compliance — all four canonical human-genome annotations (GENCODE / RefSeq / MANE / CHESS 3) ingest cleanly with zero strict-mode warnings. RefSeq's split-CDS duplicate-ID convention is handled automatically.

⚡ Comprehensive Human Genome Annotations — validated across every canonical corpus

Head-to-head benchmark against legacy gffutils on the four canonical human-genome annotation sources, with the v0.1.0 GFF3 ingest pipeline optimizations applied:

Corpus Format Lines gffbase ingest legacy ingest speedup spatial qps batched (5 k anchors)
GENCODE v49 (basic) GTF 6,068,892 4 min 37 s ≥ 2 hr 30 min 🚀 ≥ 32× 1,204 172 ms / 596 k desc
GENCODE v49 (basic) GFF3 6,066,054 6 min 7 s 11 min 23 s 1.86× 1,292 422 ms / 1.93 M desc
RefSeq GRCh38.p14 GFF3 4,932,571 4 min 12 s 6 min 5 s 1.45× 1,011 263 ms / 999 k desc
MANE v1.5 (Ensembl) GFF3 524,834 21.6 s 45.1 s 2.09× 1,766 78 ms / 156 k desc
CHESS 3.1.3 GFF3 2,761,061 53.6 s 2 min 13.1 s 2.48× 1,175 91 ms / 161 k desc

Every corpus ingests with zero strict-mode warnings from the NCBI-spec-hardened Rust parser. RefSeq's duplicate-ID=cds-NP_xxx convention (split CDS segments) is handled transparently via the duplicates table. Full reproducible numbers + per-corpus root-cause analysis: see Performance Comparison.


🚀 The Killer Feature — zero-copy PyArrow for ML pipelines

Modern ML genomics pipelines have one shape: pull every exon for 50 000 transcripts, push the column-oriented table into a tensor, train. Legacy gffutils forces a per-feature Python loop — constructing 1.6 M throwaway Feature objects per pull, which crushes both wall time and memory. gffbase bypasses Python entirely:

exons = db.children_batched(
    transcript_ids,                # 50 000 IDs
    featuretype="exon",
    format="arrow",                # zero-copy pyarrow.Table
)

import torch
starts = torch.from_numpy(exons.column("start").to_numpy())
ends   = torch.from_numpy(exons.column("end").to_numpy())
Path Wall on 50 k transcripts vs legacy
db.children_batched(format='arrow') 1.16 s 36.68× faster
legacy gffutils row-by-row loop 42.55 s 1.0×
gffbase row-by-row loop ≥ 642 s 0.07× (slower!)

This is the reason GFFBase exists. See the Machine Learning Workflows Cookbook for end-to-end pipelines.


📦 Installation

pip install gffbase

Universal abi3-py39 wheels — one binary per arch covers CPython 3.9 → 3.13.


🏃 Quick start (row-by-row)

from gffbase import create_db

db = create_db("gencode.v49.chr_patch_hapl_scaff.basic.annotation.gtf.gz",
               "gencode.duckdb", force=True)

for tx in db.children("ENSG00000139618", level=1, featuretype="transcript"):
    print(tx.id, tx.start, tx.end)

for f in db.region("chr17:43044295-43125483", featuretype="exon"):
    print(f)

🤖 Quick start (vectorized for ML)

from gffbase import FeatureDB

db = FeatureDB("gencode.duckdb")
exons = db.children_batched(transcript_ids, featuretype="exon", format="arrow")

# Hand off to PyTorch / Hugging Face / JAX / Lance — no Python copies.

✨ What's inside

  • Rust + PyO3 parser — SIMD splitting, lazy URL-decoding, GTF semicolon-in-quotes safe, gzipped input transparent. Hardened against the NCBI GFF3 spec.
  • DuckDB columnar storage — set-based GTF synthesis, recursive-CTE closure, per-seqid-banded R-tree built inline during ingest.
  • Smart routing — R-tree / B-tree spatial; closure-cache / dynamic-CTE relational.
  • Vectorized batched APIpyarrow.Table / pandas.DataFrame / polars.DataFrame, directly out of DuckDB's buffer pool.
  • Drop-in legacy APIFeatureDB, Feature, create_db, DataIterator, GFFWriter, merge_criteria, bed12, execute(), export_sqlite().
  • abi3 wheels — single binary per arch covers CPython 3.9–3.13.

📚 Where to next

Page What's there
Usage Gallery Copy-pasteable snippets for every public API method
Performance Head-to-head numbers across every canonical human-genome annotation + the v0.1.0 ingest optimization story
Migration from gffutils Drop-in compatibility + the one OLAP gotcha
Cookbooks GENCODE/Ensembl, RefSeq, MANE, ML workflows
API Reference Every public method, full signatures + docstrings

🧪 Testing

pip install -e .[test]
pytest                  # 523 passed, 7 skipped, 99.19% coverage

CI runs the full matrix on Linux + macOS + Windows, both R-tree and B-tree fallback paths, on Python 3.9 / 3.11 / 3.13.

🤝 Contributing

Pull requests welcome. See CONTRIBUTING.md for development setup (Rust ≥ 1.69, Python 3.9–3.13, maturin develop --release), the test-and-coverage gates, and the full PR checklist.

🪪 License

Apache License 2.0.