You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import json
import gzip
data_110 = json.load(gzip.open("cdot-0.2.27.GCF_000001405.40_GRCh38.p14_genomic.110.gff.json.gz"))
data_rs_2023_10 = json.load(gzip.open("cdot-0.2.27.GCF_000001405.40_GRCh38.p14_genomic.RS_2023_10.gff.json.gz"))
previous_brca2_transcripts = set()
for transcript_accession, tdata in data_110["transcripts"].items():
if tdata.get("gene_name") == 'BRCA1':
previous_brca2_transcripts.add(transcript_accession.split(".")[0])
kept_brca1_transcripts = set()
new_brca1_transcript_accessions = set()
for transcript_accession, tdata in data_rs_2023_10["transcripts"].items():
if tdata.get("gene_name") == 'BRCA1':
transcript_id, version = transcript_accession.split(".")
if transcript_id not in previous_brca2_transcripts:
new_brca1_transcript_accessions.add(transcript_accession)
else:
kept_brca1_transcripts.add(transcript_accession)
filename = "blocklist_brca1_new_transcripts.txt"
print(f"Writing {len(new_brca1_transcript_accessions)} transcripts to '{filename}'")
with open(filename, "wt") as f:
f.writelines([s + '\n' for s in new_brca1_transcript_accessions])
print(f"Kept {','.join(kept_brca1_transcripts)}")
davmlaw
changed the title
VEP high memory usage for BRCA2 transcripts work around
VEP high memory usage for BRCA1 transcripts work around
Jan 16, 2025
Latest RefSeq for BRCA2 has 368 transcripts, which causes massive memory usage.
Ensembl/ensembl-vep#1732 (comment)
nakib suggests:
--transcript_filter "stable_id in ENST1,ENST2,..."
I wonder if you could use
--filter "not stable_id in /data/files/transcript_block_list.txt"
Would be interested in just testing the block list out, could trivially solve the issue
The text was updated successfully, but these errors were encountered: