VEP high memory usage for BRCA1 transcripts work around #1228

davmlaw · 2025-01-16T01:45:21Z

Latest RefSeq for BRCA2 has 368 transcripts, which causes massive memory usage.

Ensembl/ensembl-vep#1732 (comment)

nakib suggests:

--transcript_filter "stable_id in ENST1,ENST2,..."

I wonder if you could use

--filter "not stable_id in /data/files/transcript_block_list.txt"

Would be interested in just testing the block list out, could trivially solve the issue

The text was updated successfully, but these errors were encountered:

davmlaw · 2025-01-16T03:52:08Z

Generated BRCA1 transcripts: blocklist_brca1_new_transcripts.txt

Use by adding to VEP:

--transcript_filter "not stable_id in blocklist_brca1_new_transcripts.txt"

Generated via:

wget https://github.com/SACGF/cdot/releases/download/data_v0.2.27/cdot-0.2.27.GCF_000001405.40_GRCh38.p14_genomic.110.gff.json.gz https://github.com/SACGF/cdot/releases/download/data_v0.2.27/cdot-0.2.27.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_10.gff.json.gz

Then Python:

import json
import gzip
data_110 = json.load(gzip.open("cdot-0.2.27.GCF_000001405.40_GRCh38.p14_genomic.110.gff.json.gz"))
data_rs_2023_10 = json.load(gzip.open("cdot-0.2.27.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_10.gff.json.gz"))

previous_brca2_transcripts = set()
for transcript_accession, tdata in data_110["transcripts"].items():
    if tdata.get("gene_name") == 'BRCA1':
        previous_brca2_transcripts.add(transcript_accession.split(".")[0])

kept_brca1_transcripts = set()
new_brca1_transcript_accessions = set()
for transcript_accession, tdata in data_rs_2023_10["transcripts"].items():
    if tdata.get("gene_name") == 'BRCA1':
        transcript_id, version = transcript_accession.split(".") 
        if transcript_id not in previous_brca2_transcripts:
            new_brca1_transcript_accessions.add(transcript_accession)
        else:
            kept_brca1_transcripts.add(transcript_accession)

filename = "blocklist_brca1_new_transcripts.txt"
print(f"Writing {len(new_brca1_transcript_accessions)} transcripts to '{filename}'")
with open(filename, "wt") as f:
    f.writelines([s + '\n' for s in new_brca1_transcript_accessions])

print(f"Kept {','.join(kept_brca1_transcripts)}")

davmlaw changed the title ~~VEP high memory usage for BRCA2 transcripts work around~~ VEP high memory usage for BRCA1 transcripts work around Jan 16, 2025

davmlaw mentioned this issue Jan 16, 2025

High memory usage due to 368 BRCA1 RefSeq transcripts - Transcript blocklist / allowlists or option for max transcripts Ensembl/ensembl-vep#1732

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VEP high memory usage for BRCA1 transcripts work around #1228

VEP high memory usage for BRCA1 transcripts work around #1228

davmlaw commented Jan 16, 2025

davmlaw commented Jan 16, 2025 •

edited

Loading

VEP high memory usage for BRCA1 transcripts work around #1228

VEP high memory usage for BRCA1 transcripts work around #1228

Comments

davmlaw commented Jan 16, 2025

davmlaw commented Jan 16, 2025 • edited Loading

davmlaw commented Jan 16, 2025 •

edited

Loading