MANE (Matched Annotation from NCBI and EBI) Cookbook¶
The MANE project picks one transcript per protein-coding gene as
the canonical reference (MANE_Select) and adds a small
MANE_Plus_Clinical set for clinically important alternates. Both
labels are exposed as tag attributes in the GENCODE GTF / Ensembl
GFF3 stream:
9 HAVANA transcript ... gene_id "ENSG..."; transcript_id "ENST..."; tag "MANE_Select"; tag "Ensembl_canonical";
This cookbook shows how to filter for these tags efficiently using GFFBase's normalized attribute index.
1. Ingest GENCODE/Ensembl, then filter¶
from gffbase import create_db
db = create_db("gencode.v49.chr_patch_hapl_scaff.basic.annotation.gtf.gz",
"gencode.duckdb", force=True)
# Every MANE_Select transcript:
mane_select = db.execute("""
SELECT f.id, f.seqid, f.start, f."end", f.strand
FROM features f
JOIN attributes a ON a.feature_id = f.id
WHERE f.featuretype = 'transcript'
AND a.key = 'tag'
AND a.value = 'MANE_Select'
ORDER BY f.seqid, f.start
""").fetchall()
print(f"{len(mane_select):,} MANE_Select transcripts")
The attributes_kv index on (key, value) makes this an indexed seek,
not a full-table scan. On GENCODE v49 the query returns ~20 000 rows in
under 100 ms.
2. Combine MANE_Select and MANE_Plus_Clinical¶
mane_any = db.execute("""
SELECT f.id, a.value AS mane_label
FROM features f
JOIN attributes a ON a.feature_id = f.id
WHERE f.featuretype = 'transcript'
AND a.key = 'tag'
AND a.value IN ('MANE_Select', 'MANE_Plus_Clinical')
""").fetchall()
print(f"{len(mane_any):,} MANE-tagged transcripts")
3. Pull the MANE transcript for a specific gene¶
gene_id = "ENSG00000139618" # BRCA2
row = db.execute("""
SELECT f.id
FROM features f
JOIN attributes a_tag ON a_tag.feature_id = f.id
JOIN attributes a_gene ON a_gene.feature_id = f.id
WHERE f.featuretype = 'transcript'
AND a_tag.key = 'tag' AND a_tag.value = 'MANE_Select'
AND a_gene.key = 'gene_id' AND a_gene.value = ?
""", [gene_id]).fetchone()
mane_tx_id = row[0] if row else None
print(f"BRCA2 MANE_Select transcript: {mane_tx_id}")
4. Get every exon of every MANE_Select transcript (vectorized)¶
This is the bulk pattern that scales — pass all MANE transcript IDs
to children_batched for a zero-copy Arrow result:
mane_ids = [r[0] for r in db.execute("""
SELECT f.id FROM features f
JOIN attributes a ON a.feature_id = f.id
WHERE f.featuretype = 'transcript' AND a.key = 'tag'
AND a.value = 'MANE_Select'
""").fetchall()]
exons = db.children_batched(mane_ids, featuretype="exon", format="arrow")
print(f"{exons.num_rows:,} exons across {len(mane_ids):,} MANE_Select transcripts")
# Anchor column ('anchor') lets you group by transcript ID without a
# Python loop:
import pyarrow.compute as pc
exon_count_per_tx = pc.value_counts(exons.column("anchor"))
5. Build a MANE-only sub-database for downstream tooling¶
If you want a smaller .duckdb containing only the MANE subset, copy
through DuckDB's COPY ... TO:
db.execute("""
ATTACH 'mane_only.duckdb' AS mane_db (TYPE DUCKDB);
CREATE TABLE mane_db.transcripts AS
SELECT f.*, a.value AS mane_tag
FROM features f
JOIN attributes a ON a.feature_id = f.id
WHERE a.key = 'tag' AND a.value IN ('MANE_Select','MANE_Plus_Clinical');
""")
Why this is fast¶
GENCODE's tag attribute appears on ~3 M attribute rows in the
normalized table. The attributes_kv index turns the
WHERE key='tag' AND value='MANE_Select' predicate into a single
B-tree seek that returns ~19 k feature_ids. Compare to the legacy
gffutils model where the entire col-9 JSON blob has to be parsed for
every feature row before the tag can even be examined.