GENCODE / Ensembl GTF Cookbook¶

GENCODE and Ensembl ship full mammalian annotations as gzipped GTF. Their hierarchy is gene → transcript → exon / CDS / UTR / start_codon / stop_codon — three levels deep, with several million features per annotation and ~100 features per gene on average.

This cookbook covers the four most common tasks downstream pipelines need.

1. Ingest¶

from gffbase import create_db

db = create_db(
    "gencode.v49.chr_patch_hapl_scaff.basic.annotation.gtf.gz",
    "gencode.duckdb",
    force=True,
)
print(f"{db.count_features_of_type():,} features")    # ~6.5 M (incl. synth parents)
print(db.fmt)                                          # 'gtf'

GTF input is auto-detected; disable_infer_genes and disable_infer_transcripts default to False so missing parent rows are synthesized by a single set-based GROUP BY over the attribute-normalized transcript_id / gene_id columns. (See Performance Comparison — "GTF Synthesis Bottleneck" — for why this matters.)

2. Walk a single gene's hierarchy¶

gene = db["ENSG00000139618"]                # BRCA2
print(gene.featuretype, gene.start, gene.end)

for tx in db.children(gene, level=1, featuretype="transcript"):
    n_exons = sum(1 for _ in db.children(tx, level=1, featuretype="exon"))
    print(f"  {tx.id}  {tx.start}-{tx.end}  ({n_exons} exons)")

# All descendants (exons + CDSs + UTRs) in one shot
for f in db.children(gene, level=None):
    pass

children(level=1) is a closure-cache point lookup; children(level=None) returns the full descendant set. The relational dispatcher auto-routes between the materialized closure and a recursive CTE.

3. Filter by attribute (e.g. all protein-coding genes)¶

GENCODE attributes are stored in a normalized long-form table indexed on (key, value). Attribute filtering is therefore an indexed query, not a JSON scan:

rows = db.execute("""
    SELECT f.id, f.seqid, f.start, f."end"
    FROM features f
    JOIN attributes a ON a.feature_id = f.id
    WHERE f.featuretype = 'gene'
      AND a.key = 'gene_type'
      AND a.value = 'protein_coding'
""").fetchall()
print(f"{len(rows):,} protein-coding genes")

4. BED12 export for genome-browser tracks¶

with open("gencode.bed", "w") as fout:
    for tx in db.features_of_type("transcript", limit="chr1"):
        fout.write(db.bed12(tx, name_field="transcript_id") + "\n")

5. Bulk extraction into PyArrow (the ML path)¶

For ML pipelines that need exons for many transcripts at once, prefer the vectorized API — see machine_learning_workflows.md for the full pattern:

gene_ids = [r[0] for r in db.execute(
    "SELECT id FROM features WHERE featuretype='gene' LIMIT 50000"
).fetchall()]
table = db.children_batched(gene_ids, featuretype="exon", format="arrow")
print(table.num_rows, "exons,", len(table.column_names), "columns")

Performance notes (real GENCODE v49 numbers)¶

Task	Wall	Source
Full ingest (6.07 M lines)	~277 s	`PERFORMANCE_COMPARISON.md` §0
`children(g, level=1)` (single gene)	<1 ms	materialized closure cache
50 000 × `children_batched()`	1.16 s	`PERFORMANCE_COMPARISON.md` §4b
Random `region(seqid:start-end)`	~0.7 ms	R-tree path