Machine Learning Workflows Cookbook¶
The flagship cookbook. The patterns here are the reason
children_batched(format='arrow') exists: bulk feature extraction for
ML pipelines with zero Python Feature object construction and
zero per-row boundary crossings between DuckDB and the consuming
ML framework.
The benchmark numbers behind every claim here are documented in
PERFORMANCE_COMPARISON.md §4b: 50 000 transcripts in 1.16 s
(36.7× faster than legacy gffutils, 553× faster than gffbase row-by-row).
1. The anti-pattern (don't do this for ML)¶
The "obvious" Pythonic loop is the slow path. It instantiates one
Feature object per row × N anchors:
# DON'T — this hits OLAP point-query overhead AND pays per-row Feature
# construction. At 50 000 genes it takes 10+ minutes.
for gene_id in fifty_thousand_gene_ids:
for exon in db.children(gene_id, featuretype="exon"):
starts.append(exon.start)
ends.append(exon.end)
...
2. The vectorized pattern¶
One SQL query for the entire batch. One zero-copy Arrow buffer back.
Never materializes a Feature:
import pyarrow as pa
from gffbase import FeatureDB
db = FeatureDB("gencode.duckdb")
# Pull 50 000 transcript IDs (any predicate works — here, MANE_Select).
tx_ids = [r[0] for r in db.execute("""
SELECT f.id FROM features f
JOIN attributes a ON a.feature_id = f.id
WHERE f.featuretype = 'transcript'
AND a.key = 'tag' AND a.value = 'MANE_Select'
LIMIT 50000
""").fetchall()]
# One call, one query, one Arrow table.
exons: pa.Table = db.children_batched(
tx_ids,
featuretype="exon",
format="arrow",
)
print(exons.num_rows, "exons in", exons.num_columns, "columns")
print(exons.schema)
# anchor: string ← input transcript_id (for groupby)
# descendant_id: string ← exon_id
# seqid, source, featuretype: string
# start, end: int64
# score, strand, frame: string
# file_order: int64
# depth: int16
The columns are real pyarrow chunks backed by the same memory DuckDB
used during query execution. exons.column("start").to_numpy() is a
view on that buffer — no copy.
3. Hand-off to Hugging Face datasets¶
datasets.Dataset.from_pyarrow_table() consumes the Arrow buffer
directly. The dataset shares memory with DuckDB until you write it to
disk:
from datasets import Dataset
ds = Dataset(exons)
print(ds)
# Dataset({
# features: ['anchor','descendant_id','seqid','source','featuretype',
# 'start','end','score','strand','frame','file_order','depth'],
# num_rows: 287456
# })
# Group exons by their parent transcript (the anchor column):
ds = ds.sort("anchor")
# Add a sequence-extraction column via a UDF — vectorized over Arrow:
import pyfaidx
fa = pyfaidx.Fasta("GRCh38.primary_assembly.genome.fa")
def add_seq(batch):
seqs = []
for seqid, s, e, strand in zip(batch["seqid"], batch["start"],
batch["end"], batch["strand"]):
seq = str(fa[seqid][s - 1 : e])
if strand == "-":
seq = seq.translate(str.maketrans("ACGTN", "TGCAN"))[::-1]
seqs.append(seq)
batch["sequence"] = seqs
return batch
ds = ds.map(add_seq, batched=True, batch_size=4096)
# Ready to feed a tokenizer + model.
ds.save_to_disk("mane_select_exons")
4. Hand-off to PyTorch DataLoader¶
Two equivalent paths — both bypass per-row Python:
4a. Direct from Arrow → NumPy → tensor¶
import torch
from torch.utils.data import TensorDataset, DataLoader
starts = torch.from_numpy(exons.column("start").to_numpy())
ends = torch.from_numpy(exons.column("end").to_numpy())
strands = exons.column("strand").to_pylist() # small string column
ds = TensorDataset(starts, ends)
loader = DataLoader(ds, batch_size=512, shuffle=True, num_workers=4)
for batch_starts, batch_ends in loader:
... # train step
4b. Polars DataFrame for ad-hoc feature engineering¶
exons_df = db.children_batched(tx_ids, featuretype="exon", format="polars")
exon_lengths = (exons_df["end"] - exons_df["start"] + 1).to_numpy()
format='polars' returns a polars.DataFrame built directly from the
Arrow buffers — same zero-copy property.
5. Spatial bulk feature extraction for ML¶
region_batched is the spatial counterpart. Want every CDS overlapping
a 50 000-row ATAC-seq peak BED file?
peaks = [
("chr1", 1_000_000, 1_000_500),
("chr1", 1_002_000, 1_002_300),
... # 50 000 entries
]
cds = db.region_batched(peaks, featuretype="CDS", format="arrow")
# query_idx column maps each result row back to its peak.
print(cds.num_rows, "CDS overlaps across", len(peaks), "peaks")
# Per-peak counts in a single Arrow operation:
import pyarrow.compute as pc
peak_overlap_counts = pc.value_counts(cds.column("query_idx"))
6. Why this is fast¶
| Layer | Anti-pattern | Vectorized |
|---|---|---|
| SQL | 50 000 separate SELECT calls |
one SELECT … WHERE id IN (?, ?, …) |
| Python objects | 1.6 M Feature instances + dialect dicts + _LazyAttributes |
zero — Arrow buffers held by reference |
| Result hand-off | per-row Python iteration | column-wise NumPy view |
| Memory | ~800 MB (Feature graveyard + GC pressure) | ~520 MB (DuckDB query buffer pool) |
| Wall (50 k transcripts → 1.6 M exons) | ≥ 10 minutes | 1.16 s |
The benchmark script at benchmarks/05_vectorized.py reproduces these
numbers head-to-head against legacy gffutils.
7. Tips for production ML pipelines¶
- Pin
format='arrow'unless you specifically need a pandas DataFrame — Arrow → pandas conversion is the only Python-side cost in the path. - Use
db.children_batched(level=1, featuretype="exon")when you only want direct children — DuckDB's planner can prune the closure table to depth-1 entries via the index on(ancestor, depth). - Reuse the
FeatureDBinstance across batches. The per-DB-open cost is dominated byLOAD spatialand meta-table reads (~50 ms); amortize it across manychildren_batchedcalls. - Partition by chromosome for embarrassingly parallel work:
for seqid in db.seqids():
ids_on_seqid = [r[0] for r in db.execute(
"SELECT id FROM features WHERE featuretype='transcript' AND seqid=?",
[seqid],
).fetchall()]
table = db.children_batched(ids_on_seqid, featuretype="exon",
format="arrow")
yield table # one Arrow table per chromosome → multiprocessing-ready
- Multiple workers: open the DB read-only with
duckdb.connect(path, read_only=True)per worker — DuckDB's MVCC allows unlimited concurrent readers.