Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sc_long_pipeline.py--> ValueError: invalid contig chr1 #37

Open
jeneaadams opened this issue Jan 16, 2023 · 10 comments
Open

sc_long_pipeline.py--> ValueError: invalid contig chr1 #37

jeneaadams opened this issue Jan 16, 2023 · 10 comments

Comments

@jeneaadams
Copy link

Hi, both my genome.fa and gff3 files use contig chr1. Is there support for this format or parameters I can set to solve this error?

Traceback (most recent call last):
File "PATH/TO/FLAMES/python/sc_long_pipeline.py", line 240, in
sc_long_pipeline(args)
File "PATH/TO/FLAMES/python/sc_long_pipeline.py", line 193, in sc_long_pipeline
raw_gff3=raw_splice_isoform if config_dict["global_parameters"]["generate_raw_isoform"] else None)
File "PATH/TO/FLAMES/python/sc_longread.py", line 1123, in group_bam2isoform
it_region = bamfile.fetch(ch, bl.s, bl.e)
File "pysam/libcalignmentfile.pyx", line 1081, in pysam.libcalignmentfile.AlignmentFile.fetch
File "pysam/libchtslib.pyx", line 686, in pysam.libchtslib.HTSFile.parse_region
ValueError: invalid contig `chr1

@LuyiTian
Copy link
Owner

Oh I think it is a pysam error which happens when reading the bam file. It is likely that your bam file is generated with a different reference without chr1 or there is no chr1 reads in your bam files so it give this error. you can check your bam file header using samtools view -H $your_bam_file.

@jeneaadams
Copy link
Author

Here's my output. There seem to be chr1 reads.

@hd VN:1.6 SO:coordinate
@sq SN:chr1 LN:248956422
@sq SN:chr10 LN:133797422
@sq SN:chr11 LN:135086622
@sq SN:chr12 LN:133275309
@sq SN:chr13 LN:114364328
@sq SN:chr14 LN:107043718
@sq SN:chr15 LN:101991189
@sq SN:chr16 LN:90338345
@sq SN:chr17 LN:83257441
@sq SN:chr18 LN:80373285
@sq SN:chr19 LN:58617616
@sq SN:chr2 LN:242193529
@sq SN:chr20 LN:64444167
@sq SN:chr21 LN:46709983
@sq SN:chr22 LN:50818468
@sq SN:chr3 LN:198295559
@sq SN:chr4 LN:190214555
@sq SN:chr5 LN:181538259
@sq SN:chr6 LN:170805979
@sq SN:chr7 LN:159345973
@sq SN:chr8 LN:145138636
@sq SN:chr9 LN:138394717
@sq SN:chrM LN:16569
@sq SN:chrX LN:156040895
@sq SN:chrY LN:57227415
@sq SN:KI270728.1 LN:1872759
@sq SN:KI270727.1 LN:448248
@sq SN:KI270442.1 LN:392061
@sq SN:KI270729.1 LN:280839
@sq SN:GL000225.1 LN:211173
@sq SN:KI270743.1 LN:210658
@sq SN:GL000008.2 LN:209709
@sq SN:GL000009.2 LN:201709
@sq SN:KI270747.1 LN:198735
@sq SN:KI270722.1 LN:194050
@sq SN:GL000194.1 LN:191469
@sq SN:KI270742.1 LN:186739
@sq SN:GL000205.2 LN:185591
@sq SN:GL000195.1 LN:182896
@sq SN:KI270736.1 LN:181920
@sq SN:KI270733.1 LN:179772
@sq SN:GL000224.1 LN:179693
@sq SN:GL000219.1 LN:179198
@sq SN:KI270719.1 LN:176845
@sq SN:GL000216.2 LN:176608
@sq SN:KI270712.1 LN:176043
@sq SN:KI270706.1 LN:175055
@sq SN:KI270725.1 LN:172810
@sq SN:KI270744.1 LN:168472
@sq SN:KI270734.1 LN:165050
@sq SN:GL000213.1 LN:164239
@sq SN:GL000220.1 LN:161802
@sq SN:KI270715.1 LN:161471
@sq SN:GL000218.1 LN:161147
@sq SN:KI270749.1 LN:158759
@sq SN:KI270741.1 LN:157432
@sq SN:GL000221.1 LN:155397
@sq SN:KI270716.1 LN:153799
@sq SN:KI270731.1 LN:150754
@sq SN:KI270751.1 LN:150742
@sq SN:KI270750.1 LN:148850
@sq SN:KI270519.1 LN:138126
@sq SN:GL000214.1 LN:137718
@sq SN:KI270708.1 LN:127682
@sq SN:KI270730.1 LN:112551
@sq SN:KI270438.1 LN:112505
@sq SN:KI270737.1 LN:103838
@sq SN:KI270721.1 LN:100316
@sq SN:KI270738.1 LN:99375
@sq SN:KI270748.1 LN:93321
@sq SN:KI270435.1 LN:92983
@sq SN:GL000208.1 LN:92689
@sq SN:KI270538.1 LN:91309
@sq SN:KI270756.1 LN:79590
@sq SN:KI270739.1 LN:73985
@sq SN:KI270757.1 LN:71251
@sq SN:KI270709.1 LN:66860
@sq SN:KI270746.1 LN:66486
@sq SN:KI270753.1 LN:62944
@sq SN:KI270589.1 LN:44474
@sq SN:KI270726.1 LN:43739
@sq SN:KI270735.1 LN:42811
@sq SN:KI270711.1 LN:42210
@sq SN:KI270745.1 LN:41891
@sq SN:KI270714.1 LN:41717
@sq SN:KI270732.1 LN:41543
@sq SN:KI270713.1 LN:40745
@sq SN:KI270754.1 LN:40191
@sq SN:KI270710.1 LN:40176
@sq SN:KI270717.1 LN:40062
@sq SN:KI270724.1 LN:39555
@sq SN:KI270720.1 LN:39050
@sq SN:KI270723.1 LN:38115
@sq SN:KI270718.1 LN:38054
@sq SN:KI270317.1 LN:37690
@sq SN:KI270740.1 LN:37240
@sq SN:KI270755.1 LN:36723
@sq SN:KI270707.1 LN:32032
@sq SN:KI270579.1 LN:31033
@sq SN:KI270752.1 LN:27745
@sq SN:KI270512.1 LN:22689
@sq SN:KI270322.1 LN:21476
@sq SN:GL000226.1 LN:15008
@sq SN:KI270311.1 LN:12399
@sq SN:KI270366.1 LN:8320
@sq SN:KI270511.1 LN:8127
@sq SN:KI270448.1 LN:7992
@sq SN:KI270521.1 LN:7642
@sq SN:KI270581.1 LN:7046
@sq SN:KI270582.1 LN:6504
@sq SN:KI270515.1 LN:6361
@sq SN:KI270588.1 LN:6158
@sq SN:KI270591.1 LN:5796
@sq SN:KI270522.1 LN:5674
@sq SN:KI270507.1 LN:5353
@sq SN:KI270590.1 LN:4685
@sq SN:KI270584.1 LN:4513
@sq SN:KI270320.1 LN:4416
@sq SN:KI270382.1 LN:4215
@sq SN:KI270468.1 LN:4055
@sq SN:KI270467.1 LN:3920
@sq SN:KI270362.1 LN:3530
@sq SN:KI270517.1 LN:3253
@sq SN:KI270593.1 LN:3041
@sq SN:KI270528.1 LN:2983
@sq SN:KI270587.1 LN:2969
@sq SN:KI270364.1 LN:2855
@sq SN:KI270371.1 LN:2805
@sq SN:KI270333.1 LN:2699
@sq SN:KI270374.1 LN:2656
@sq SN:KI270411.1 LN:2646
@sq SN:KI270414.1 LN:2489
@sq SN:KI270510.1 LN:2415
@sq SN:KI270390.1 LN:2387
@sq SN:KI270375.1 LN:2378
@sq SN:KI270420.1 LN:2321
@sq SN:KI270509.1 LN:2318
@sq SN:KI270315.1 LN:2276
@sq SN:KI270302.1 LN:2274
@sq SN:KI270518.1 LN:2186
@sq SN:KI270530.1 LN:2168
@sq SN:KI270304.1 LN:2165
@sq SN:KI270418.1 LN:2145
@sq SN:KI270424.1 LN:2140
@sq SN:KI270417.1 LN:2043
@sq SN:KI270508.1 LN:1951
@sq SN:KI270303.1 LN:1942
@sq SN:KI270381.1 LN:1930
@sq SN:KI270529.1 LN:1899
@sq SN:KI270425.1 LN:1884
@sq SN:KI270396.1 LN:1880
@sq SN:KI270363.1 LN:1803
@sq SN:KI270386.1 LN:1788
@sq SN:KI270465.1 LN:1774
@sq SN:KI270383.1 LN:1750
@sq SN:KI270384.1 LN:1658
@sq SN:KI270330.1 LN:1652
@sq SN:KI270372.1 LN:1650
@sq SN:KI270548.1 LN:1599
@sq SN:KI270580.1 LN:1553
@sq SN:KI270387.1 LN:1537
@sq SN:KI270391.1 LN:1484
@sq SN:KI270305.1 LN:1472
@sq SN:KI270373.1 LN:1451
@sq SN:KI270422.1 LN:1445
@sq SN:KI270316.1 LN:1444
@sq SN:KI270340.1 LN:1428
@sq SN:KI270338.1 LN:1428
@sq SN:KI270583.1 LN:1400
@sq SN:KI270334.1 LN:1368
@sq SN:KI270429.1 LN:1361
@sq SN:KI270393.1 LN:1308
@sq SN:KI270516.1 LN:1300
@sq SN:KI270389.1 LN:1298
@sq SN:KI270466.1 LN:1233
@sq SN:KI270388.1 LN:1216
@sq SN:KI270544.1 LN:1202
@sq SN:KI270310.1 LN:1201
@sq SN:KI270412.1 LN:1179
@sq SN:KI270395.1 LN:1143
@sq SN:KI270376.1 LN:1136
@sq SN:KI270337.1 LN:1121
@sq SN:KI270335.1 LN:1048
@sq SN:KI270378.1 LN:1048
@sq SN:KI270379.1 LN:1045
@sq SN:KI270329.1 LN:1040
@sq SN:KI270419.1 LN:1029
@sq SN:KI270336.1 LN:1026
@sq SN:KI270312.1 LN:998
@sq SN:KI270539.1 LN:993
@sq SN:KI270385.1 LN:990
@sq SN:KI270423.1 LN:981
@sq SN:KI270392.1 LN:971
@sq SN:KI270394.1 LN:970
@pg ID:minimap2 PN:minimap2 VN:2.24-r1122 CL:minimap2 -ax splice -k14 -t 1 --junc-bed /PATH/TO/gencode.v39.annotation.sorted.bed --secondary=no --sam-hit-only /PATH/TO/genome.fa PATH/TO//RPL41_204_205_5.fq.gz
@pg ID:samtools PN:samtools PP:minimap2 VN:1.15.1 CL:samtools view -hb PATH/TO//RPL41_204_205_5.sam
@pg ID:samtools.1 PN:samtools PP:samtools VN:1.15.1 CL:samtools sort -o PATH/TO//RPL41_204_205_5.sort.bam PATH/TO//RPL41_204_205_5.bam

@jeneaadams
Copy link
Author

jeneaadams commented Jan 19, 2023

Hello,

Here's the structure of our input fastq from the match_cell_barcode step:
image

Here's the BAM file the full pipeline starts to output:
image

The transcript ID replaced the chromosome ID in the name file, and there's no chromosome contig. Also there's no read name, just a cell barcode as a read name in the bam file.

Usage for sc_long_pipeline.py:
~/FLAMES/python/sc_long_pipeline.py --gff3 ~/gencode.v39.annotation.gff3 --infq ~/out_pcr5_hq_fastq.gz --outdir ~/flames_test_MYL6/ --genomefa ~/gencode.v39.transcripts.fa --minimap2_dir ~/anaconda3/envs/FLAMES3/bin/

Original fastq before barcode match:
image

@LuyiTian
Copy link
Owner

The genomefa should be the fasta file of genome instead of transcript. For human the most recent primary assembly should be: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/GRCh38.primary_assembly.genome.fa.gz. other versions have similar naming convention.

The FLAMES software will generate the transcript fasta file, which contains the known transcript and the new transcript called from the bam file.

@jeneaadams
Copy link
Author

jeneaadams commented Jan 25, 2023

I see, thanks. With the updated input, I get a new error:

Traceback (most recent call last):
File "~FLAMES_analysis/FLAMES/python/count_tr.py", line 159, in parse_realigned_bam
bc, umi = r.split("#")[0].split("_") # assume cleaned barcode
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "~FLAMES/python/sc_long_pipeline.py", line 240, in
sc_long_pipeline(args)
File "~FLAMES/python/sc_long_pipeline.py", line 225, in sc_long_pipeline
realign_bam, transcript_fa_idx, config_dict["isoform_parameters"]["Min_sup_cnt"], config_dict["transcript_counting"]["min_tr_coverage"], config_dict["transcript_counting"]["min_read_coverage"])
File "~FLAMES/python/count_tr.py", line 163, in parse_realigned_bam
"Please check if barcode and UMI are delimited by "_"")
ValueError: Please check if barcode and UMI are delimited by "__"

Our barcode is the first 16nt and the umi is the next 10nt that follows.

The fastq I provided for --infq is the binary fastq file from the first barcode matching step. Can you also confirm this should be a binary file generated from match_cell_barcode?

@ChangqingW
Copy link
Collaborator

Yes, it is expecting fastq files outputted by match_cell_barcode, where the identifier field has the format of @[barcode]_[umi]#[original id]

@yuntianf
Copy link

yuntianf commented Feb 3, 2023

But this fastq file couldn't be viewed by less, it's a biniary file. And it couldn't be aligned with minimap2 either, here is the error log:

Input parameters:
        gene annotation: /home/reference/annotation/human/gencode.v39.annotation.gff3
        genome fasta: /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa
        input fastq: /home/FLAMES/out.fastq
        output directory: /home/FLAMES/
        directory contains minimap2: /home/software/minimap2-2.24_x64-linux/
### align reads to genome using minimap2 2023-02-03 11:25:22
b''
Traceback (most recent call last):
  File "./python/sc_long_pipeline.py", line 240, in <module>
    sc_long_pipeline(args)
  File "./python/sc_long_pipeline.py", line 168, in sc_long_pipeline
    seed=config_dict["alignment_parameters"]["seed"])
  File "/home/software/FLAMES/python/minimap2_align.py", line 41, in minimap2_align
    shell=True, stderr=subprocess.STDOUT))
  File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/home/software/minimap2-2.24_x64-linux/minimap2 -ax splice -t 12 --junc-bed /home/FLAMES/tmp.splice_anno.bed12 --junc-bonus 1  -k14 --secondary=no --seed 2022 /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa /home/FLAMES//out.fastq | samtools view -bS -@ 4 -m 2G -o /home/FLAMES/tmp.alig
n.bam -  ']' returned non-zero exit status 1.

Here is the output statistic for match_cell_barcode, which may help:

###total read: 26292
###barcode hm match: 13191
###barcode fuzzy match: 2127
###barcode not match: 10974
###too short: 0

@LuyiTian
Copy link
Owner

@dontwantcode there seem to be an extra / in the minimap code: /home/FLAMES//out.fastq

subprocess.CalledProcessError: Command '['/home/software/minimap2-2.24_x64-linux/minimap2 -ax splice -t 12 --junc-bed /home/FLAMES/tmp.splice_anno.bed12 --junc-bonus 1  -k14 --secondary=no --seed 2022 /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa /home/FLAMES//out.fastq | samtools view -bS -@ 4 -m 2G -o /home/FLAMES/tmp.alig
n.bam -  ']'

@yuntianf
Copy link

Hi Luyi,
I don't think it should be the problem of \\, which is equivalent as \ in linux path.
Here is a verification:

Input parameters:
        gene annotation: /home/reference/annotation/human/gencode.v39.annotation.gff3
        genome fasta: /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa
        input fastq: /home/FLAMES/out.fastq
        output directory: /home/FLAMES
        directory contains minimap2: /home/software/minimap2-2.24_x64-linux/
### align reads to genome using minimap2 2023-02-11 22:09:24
b''
Traceback (most recent call last):
  File "./python/sc_long_pipeline.py", line 240, in <module>
    sc_long_pipeline(args)
  File "./python/sc_long_pipeline.py", line 168, in sc_long_pipeline
    seed=config_dict["alignment_parameters"]["seed"])
  File "/home/software/FLAMES/python/minimap2_align.py", line 41, in minimap2_align
    shell=True, stderr=subprocess.STDOUT))
  File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/home/software/minimap2-2.24_x64-linux/minimap2 -ax splice -t 12 --junc-bed /home/FLAMES/tmp.splice_anno.bed12 --junc-bonus 1  -k14 --secondary=no --seed 2022 /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa /home/FLAMES/out.fastq | samtools view -bS -@ 4 -m 2G -o /home/FLAMES/tmp.align
.bam -  ']' returned non-zero exit status 1.

@ChangqingW
Copy link
Collaborator

But this fastq file couldn't be viewed by less, it's a biniary file. And it couldn't be aligned with minimap2 either, here is the error log:

Input parameters:
        gene annotation: /home/reference/annotation/human/gencode.v39.annotation.gff3
        genome fasta: /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa
        input fastq: /home/FLAMES/out.fastq
        output directory: /home/FLAMES/
        directory contains minimap2: /home/software/minimap2-2.24_x64-linux/
### align reads to genome using minimap2 2023-02-03 11:25:22
b''
Traceback (most recent call last):
  File "./python/sc_long_pipeline.py", line 240, in <module>
    sc_long_pipeline(args)
  File "./python/sc_long_pipeline.py", line 168, in sc_long_pipeline
    seed=config_dict["alignment_parameters"]["seed"])
  File "/home/software/FLAMES/python/minimap2_align.py", line 41, in minimap2_align
    shell=True, stderr=subprocess.STDOUT))
  File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/home/software/anaconda3/envs/py3.7/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/home/software/minimap2-2.24_x64-linux/minimap2 -ax splice -t 12 --junc-bed /home/FLAMES/tmp.splice_anno.bed12 --junc-bonus 1  -k14 --secondary=no --seed 2022 /home/reference/refdata-gex-GRCh38-2020-A/fasta/genome.fa /home/FLAMES//out.fastq | samtools view -bS -@ 4 -m 2G -o /home/FLAMES/tmp.alig
n.bam -  ']' returned non-zero exit status 1.

Here is the output statistic for match_cell_barcode, which may help:

###total read: 26292
###barcode hm match: 13191
###barcode fuzzy match: 2127
###barcode not match: 10974
###too short: 0

Should be a gzipped fastq, could you check with zcat? Or maybe post the first line of xxd out.fastq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants