Migrating from gffutils to gffbase¶
GFFBase is a drop-in successor to legacy
gffutils. For most users, the
migration is one import change.
⚠️ READ THIS FIRST — the OLAP/OLTP gotcha¶
There is exactly one common code pattern that gets slower, not faster, when you migrate to gffbase. It's the per-id Python loop:
# ❌ ANTI-PATTERN with gffbase: 50 000 small queries. # Pays DuckDB's vectorization startup × 50 000 + per-row Feature # construction × 1.6 M. ≥ 10 minutes wall on GENCODE v49. for transcript_id in fifty_thousand_transcript_ids: for exon in db.children(transcript_id, featuretype="exon"): starts.append(exon.start) ends.append(exon.end)DuckDB is an OLAP engine — designed for big set-based queries. Iterating it row-by-row pays vectorization startup per call and never amortizes. SQLite (legacy gffutils) is OLTP — its B-tree seek on a cache-warm file is microseconds per call.
✅ The fix — one canonical PyArrow snippet¶
# ✅ ONE set-based SQL query for all 50 000 transcripts. # Returns a zero-copy pyarrow.Table — no `Feature` object is ever # constructed. 1.16 s wall on GENCODE v49 → a 553× speedup. exons = db.children_batched( fifty_thousand_transcript_ids, featuretype="exon", format="arrow", # or "df" / "polars" ) # NumPy / PyTorch / JAX / Hugging Face datasets — all native. starts = exons.column("start").to_numpy() ends = exons.column("end").to_numpy() # The "anchor" column carries the input transcript_id for each row, # so you can groupby in Python or downstream Arrow tooling without # re-issuing N queries: import pyarrow.compute as pc per_tx_exon_count = pc.value_counts(exons.column("anchor"))If your code has a
for x in ids: db.children(x, …)loop and you care about wall time, convert it now, before you migrate. This is the only common change required of legacygffutilsusers.
1. Drop-in compatibility — the easy part¶
Every public surface from legacy gffutils is preserved verbatim:
gffutils symbol |
gffbase equivalent |
|---|---|
gffutils.create_db(path, dbfn, ...) |
gffbase.create_db(path, dbfn, ...) |
gffutils.FeatureDB(dbfn) |
gffbase.FeatureDB(dbfn) |
gffutils.Feature(...) |
gffbase.Feature(...) |
gffutils.DataIterator(...) |
gffbase.DataIterator(...) |
gffutils.GFFWriter(...) |
gffbase.GFFWriter(...) |
gffutils.merge_criteria.* |
gffbase.merge_criteria.* |
gffutils.example_filename(name) |
gffbase.example_filename(name) |
Exceptions (FeatureNotFoundError, …) |
same names |
# Before
import gffutils
db = gffutils.create_db("annotation.gff3", "annotation.db")
# After
import gffbase as gffutils # one-line alias migration
db = gffutils.create_db("annotation.gff3", "annotation.duckdb")
All FeatureDB methods (children, parents, region,
features_of_type, interfeatures, merge, bed12, update,
delete, add_relation, execute, …) accept the same arguments and
return generators of Feature objects — identical to the legacy API.
The storage backend changes (DuckDB instead of SQLite). This is
transparent for almost all callers, but raw SQL queries that hit the
legacy schema directly via db.execute(...) need rewriting against
the GFFBase schema (or against the SQLite-compat views; see §4). We
also ship gffbase.export_sqlite(con, path) to dump a GFFBase
database into a legacy .sqlite file when you need the old format.
2. What you gain immediately, no code changes¶
The comprehensive human-genome benchmark, with the v0.1.0 GFF3 ingest
optimizations applied — head-to-head against legacy gffutils across
GENCODE, RefSeq, MANE, and CHESS 3:
| Corpus | Format | gffbase ingest | legacy ingest | speedup | spatial qps (gffbase R-tree) | batched 5 k anchors |
|---|---|---|---|---|---|---|
| GENCODE v49 (basic) | GTF | 4 min 37 s | ≥ 2 hr 30 min | 🚀 ≥ 32× | 1,204 | 172 ms / 596 k desc |
| GENCODE v49 (basic) | GFF3 | 6 min 7 s | 11 min 23 s | 1.86× | 1,292 | 422 ms / 1.93 M desc |
| RefSeq GRCh38.p14 | GFF3 | 4 min 12 s | 6 min 5 s | 1.45× | 1,011 | 263 ms / 999 k desc |
| MANE v1.5 (Ensembl) | GFF3 | 21.6 s | 45.1 s | 2.09× | 1,766 | 78 ms / 156 k desc |
| CHESS 3.1.3 | GFF3 | 53.6 s | 2 min 13.1 s | 2.48× | 1,175 | 91 ms / 161 k desc |
| Single-call workload | Speedup |
|---|---|
Spatial overlap (db.region(...), p50 latency) |
8.35× lower (0.72 ms vs 6.01 ms) |
db.children(id, level=1) indexed lookup |
comparable |
db.children_batched(ids, format="arrow") (50 k ids) |
🚀 36.68× |
Your existing gffutils script gets the ingest, spatial, and
attribute-query wins the moment you swap the import. To unlock the
36.68× ML-batched win, see the warning at the top of this page.
3. ⚠️ Deep-dive: the OLAP vs OLTP tradeoff¶
DuckDB is an OLAP engine. It's optimized for big set-based queries (JOINs, aggregations, scans of millions of rows). SQLite is an OLTP engine — optimized for tiny indexed point lookups against cache-warm pages. For tiny, repeated point queries against a cache-warm DB, SQLite (and therefore legacy gffutils) is faster.
The fix is the canonical PyArrow snippet at the top of this page.
Full benchmark: at 50 000 transcripts the row-by-row gffbase loop is
≥ 10 minutes; the batched call is 1.16 s — a 553× speedup over the
gffbase loop, 36.68× over legacy (PERFORMANCE_COMPARISON.md §4b).
Vectorized methods at a glance¶
| Vectorized method | Replaces this loop |
|---|---|
db.children_batched(ids, level=…, featuretype=…, format='arrow') |
for x in ids: db.children(x, …) |
db.parents_batched(ids, …, format='arrow') |
for x in ids: db.parents(x, …) |
db.region_batched(regions, …, format='arrow') |
for r in regions: db.region(r, …) |
format accepts "arrow" (default — pyarrow.Table), "df"
(pandas.DataFrame), or "polars" (polars.DataFrame). All three
share memory with DuckDB's query buffers — no per-row Python
materialization happens at any layer.
When you don't need to migrate the pattern¶
- One-off scripts that ask
db[gene_id]ordb.children(gene)for fewer than ~100 anchors. - Small annotations (< 100 k features) where SQL startup overhead is not visible.
For everything else — ML feature extraction, BED12 export of every
transcript, "for each peak in this 50 000-row BED file find every
overlapping CDS" — switch to *_batched.
4. SQL-compat views (for raw execute() users)¶
Legacy code that did db.execute("SELECT * FROM features WHERE …")
hits the new DuckDB schema (features, attributes, edges,
closure). Two compatibility views provide the legacy column shapes:
-- features_compat: legacy SQLite-style 12-column features table.
SELECT * FROM features_compat WHERE seqid = 'chr1' LIMIT 5;
-- relations_compat: legacy parent/child/level table.
SELECT parent, child, level FROM relations_compat WHERE level = 1;
The attributes column on features_compat is the raw col-9
bytes (UTF-8), not legacy-style JSON. If your raw-SQL code parses
JSON out of that column, switch to querying the normalized
attributes table directly:
This is also faster — attributes_kv indexes (key, value), so
attribute filters become indexed seeks.
5. SQLite export — the safety valve¶
If a downstream tool only knows how to read legacy
gffutils-compatible SQLite files:
Produces a SQLite database with the original gffutils schema,
populated UCSC bin column, and the closure flattened back into
relations(parent, child, level). The downstream tool can open this
file with gffutils.FeatureDB("legacy_compatible.sqlite").
6. Things that changed (small list)¶
- Storage backend: SQLite → DuckDB. Database file extension is
.duckdbby convention. The legacy SQLite layout is reachable viaexport_sqlite()(above) or the compat views. - Disk size: GFFBase databases are ~1.5× larger than legacy SQLite (the price of materializing the closure + R-tree). Worth it for the 5–550× query speedups.
- Peak ingest RSS: ~10× higher (~1.6 GB vs ~150 MB on GENCODE
v49). DuckDB allocates a vectorized ingest buffer pool; reduce with
PRAGMA memory_limit='512MB'if needed. - Hierarchy depth: GFFBase materializes the closure to depth 8 by default (vs depth 2 in legacy). Anything past 8 falls through to a dynamic recursive CTE — the dispatcher is automatic.
- Attributes column shape: in raw SQL, the legacy single-cell
JSON blob is replaced by a normalized
attributes(feature_id, key, value, idx)long-form table. Filtering by attribute is now an indexed query, not a full scan. - Duplicate IDs: NCBI RefSeq emits multiple GFF3 rows that share
ID=cds-NP_xxx. gffbase auto-suffixes repeats with__N(mirroringgffutils.merge_strategy="create_unique") and records the remap in theduplicatestable. No config change required.
7. Migration checklist¶
-
pip install gffbase - Replace
import gffutilswithimport gffbase as gffutils(or use the new name directly). - Re-ingest your annotations (
create_db) — old.sqlitefiles can still be read by legacy gffutils; they're not GFFBase databases. - Audit your code for
for x in ids: db.children(x, …)loops and convert them todb.children_batched(ids, format='arrow'). This is the only common change that requires user action. - If you have raw
db.execute(...)SQL: usefeatures_compat/relations_compatviews, or move attribute filters onto the normalizedattributestable. - Run your existing test suite. Everything else should be identical.
If anything breaks, please open an issue at https://github.com/Kuanhao-Chao/gffbase/issues with a minimal reproducer.