Error with Salmon build: It removes identical transcript sequences #214

kvittingseerup · 2018-04-16T12:09:39Z

We just discovered that Salmon build removes/collapses identical transcripts. This is very problematic that Salmon does this as many genes are duplicated throughout the genome. By concatenating them in the build index one of these is arbitrary selected (the others removed) meaning all downstream analysis will assume all expression originate from one genomic region instead of many.

In the most recent Gencode mouse release this problem affects 1563 sequences annotated as 13812 and covers all transcript types (incl 840 protein coding - although the major once are lincRNA (n=3658) and snoRNAs (n=2622)).

We strongly believe that if one want to analyse these duplicated regions jointly this should be done just like one would sum all transcripts from a particular gene to get the gene expression.

rob-p · 2018-04-16T12:21:31Z

Hi Kristoffer,

The duplicate transcript issue is a frustrating one. It came to our attention when we noticed that ensembl often annotated transcripts on patch / haplotype contigs that were identical and unlikely to be different from more "canonical" transcripts in any way. Further, these transcripts are indistinguishable from the quantification perspective. That being said, the removal of sequence duplicate transcripts is optional in Salmon. If you pass --keepDuplicates to the indexer, it wont remove them. Also, Salmon does record, in the index directory, the "collapsing map". Specifically, there is a tsv file that record, for every collapsed transcript, the transcript that was sequence identical and retained in the index. You can use this map to recover the abundances for the collapsed transcripts, since they are all sequence identical, they should all have an abundance of x / num duplicates (where x is the abundance of the retained transcript). I hope this info helps. Let me know if there is anything else i can clarify or help with.

Best,
Rob

kvittingseerup · 2018-04-16T12:36:28Z

That is frustrating. But I have to agree the haplotype problem is probably the large one of the two...

rob-p · 2018-04-16T12:38:57Z

Yea. Both are frustrating, which is why we spam warning messages to the console when we remove duplicates. Sorry if this default behavior caused you any trouble, but hopefully its easy to recover these quants without rerunning anything using the map of collapsed transcripts.

kvittingseerup · 2018-04-16T14:50:23Z

Actually it's fairly easy with GENCODE annotation as they do not have haplotypes in the general annotation - so we can just use the --keepDuplicates :-)

rbenel · 2018-08-05T11:02:07Z

Hi,
I am writing here, because I think this issue is relevant to both @rob-p and @kvittingseerup. I ran my salmon analysis twice with the most recent gencode annotation https://www.gencodegenes.org/releases/current.html -> PRI. Once with the --keepDuplicates option in the indexing and once without (bec I read this post late..).
When loadind the data into IsoformSwithcAnalyzer the first time (w/o --keepDuplicates), I received the following warning message, "The annotation (count matrix and isoform annotation) contain differences in which isoforms are analyzed... 875 more isoforms than the count matrix...". Following the run with --keepDuplicates, I now receive "67 more isoforms than the count matrix". If I am using the --keepDuplicates option, what exactly are there 67 isforms?

rob-p · 2018-08-05T13:06:22Z

Hi @rbenel,

Can you post here the output of salmon's indexing phase? Does it mention discarding anything? Presumably, we can just do a set difference on the iskform sets to see qhats happening.

--Rob

rbenel · 2018-08-06T06:22:53Z

Hi @rob-p,
Sure, I am posting the output of the indexing phase with the --keepDuplicates option.

[Step 1 of 4] : counting k-mers
counted k-mers for 40000 transcripts[2018-08-02 16:23:28.827] [jointLog] [warning] Entry with header [ENST00000473810.1|ENSG00000239255.1|OTTHUMG00000157482.1|OTTHUMT00000348942.1|RP11-145M9.2-001|RP11-145M9.2|25|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:28.909] [jointLog] [warning] Entry with header [ENST00000603775.1|ENSG00000271544.1|OTTHUMG00000184300.1|OTTHUMT00000468575.1|AC006499.9-001|AC006499.9|23|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 80000 transcripts[2018-08-02 16:23:29.870] [jointLog] [warning] Entry with header [ENST00000632684.1|ENSG00000282431.1|OTTHUMG00000190602.2|OTTHUMT00000485301.2|RP11-520H11.10-001|TRBD1|12|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 120000 transcripts[2018-08-02 16:23:31.098] [jointLog] [warning] Entry with header [ENST00000626826.1|ENSG00000281344.1|OTTHUMG00000189570.1|OTTHUMT00000479989.1|RP11-210L7.2-001|HELLPAR|205012|macro_lncRNA|] was longer than 200000 nucleotides.  Are you certain that we are indexing a transcriptome and not a genome?
[2018-08-02 16:23:31.151] [jointLog] [warning] Entry with header [ENST00000543745.1|ENSG00000255972.1|OTTHUMG00000168883.1|OTTHUMT00000401485.1|RP11-324E6.8-001|RP11-324E6.8|28|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 130000 transcripts[2018-08-02 16:23:31.291] [jointLog] [warning] Entry with header [ENST00000415118.1|ENSG00000223997.1|OTTHUMG00000170844.2|OTTHUMT00000410670.2|AE000661.52-001|TRDD1|8|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.291] [jointLog] [warning] Entry with header [ENST00000434970.2|ENSG00000237235.2|OTTHUMG00000170845.2|OTTHUMT00000410671.2|AE000661.53-001|TRDD2|9|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.291] [jointLog] [warning] Entry with header [ENST00000448914.1|ENSG00000228985.1|OTTHUMG00000170846.2|OTTHUMT00000410672.2|AE000661.54-001|TRDD3|13|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000439842.1|ENSG00000236597.1|OTTHUMG00000152435.2|OTTHUMT00000326213.2|AL122127.38-001|IGHD7-27|11|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390567.1|ENSG00000211907.1|OTTHUMG00000152429.2|OTTHUMT00000326207.2|AL122127.37-001|IGHD1-26|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000452198.1|ENSG00000225825.1|OTTHUMG00000152436.2|OTTHUMT00000326214.2|AL122127.36-001|IGHD6-25|18|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390569.1|ENSG00000211909.1|OTTHUMG00000152427.2|OTTHUMT00000326205.2|AL122127.35-001|IGHD5-24|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000437320.1|ENSG00000227196.1|OTTHUMG00000152438.2|OTTHUMT00000326216.2|AL122127.34-001|IGHD4-23|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390572.1|ENSG00000211912.1|OTTHUMG00000152428.2|OTTHUMT00000326206.2|AL122127.32-001|IGHD2-21|28|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000450276.1|ENSG00000237020.1|OTTHUMG00000152432.2|OTTHUMT00000326210.2|AL122127.31-001|IGHD1-20|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390574.1|ENSG00000211914.1|OTTHUMG00000152431.2|OTTHUMT00000326209.2|AL122127.30-001|IGHD6-19|21|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390575.1|ENSG00000211915.1|OTTHUMG00000152433.2|OTTHUMT00000326211.2|AL122127.29-001|IGHD5-18|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000431870.1|ENSG00000227800.1|OTTHUMG00000152437.2|OTTHUMT00000326215.2|AL122127.28-001|IGHD4-17|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000451044.1|ENSG00000227108.1|OTTHUMG00000152369.2|OTTHUMT00000326003.2|AB019441.47-001|IGHD1-14|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390580.1|ENSG00000211920.1|OTTHUMG00000152370.2|OTTHUMT00000326004.2|AB019441.46-001|IGHD6-13|21|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390581.1|ENSG00000211921.1|OTTHUMG00000152367.2|OTTHUMT00000326001.2|AB019441.45-001|IGHD5-12|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000431440.2|ENSG00000232543.2|OTTHUMG00000152368.2|OTTHUMT00000326002.2|AB019441.44-001|IGHD4-11|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000430425.1|ENSG00000237197.1|OTTHUMG00000152357.2|OTTHUMT00000325963.2|AB019441.40-001|IGHD1-7|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000454691.1|ENSG00000228131.1|OTTHUMG00000152353.2|OTTHUMT00000325959.2|AB019441.39-001|IGHD6-6|18|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000390588.1|ENSG00000211928.1|OTTHUMG00000152360.2|OTTHUMT00000325966.2|AB019441.38-001|IGHD5-5|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000414852.1|ENSG00000233655.1|OTTHUMG00000152355.2|OTTHUMT00000325961.2|AB019441.37-001|IGHD4-4|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.545] [jointLog] [warning] Entry with header [ENST00000454908.1|ENSG00000236170.1|OTTHUMG00000152359.2|OTTHUMT00000325965.2|AB019441.34-001|IGHD1-1|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.546] [jointLog] [warning] Entry with header [ENST00000518246.1|ENSG00000254045.1|OTTHUMG00000152060.1|OTTHUMT00000325154.1|AB019439.71-001|IGHVIII-22-2|28|IG_V_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.547] [jointLog] [warning] Entry with header [ENST00000604642.1|ENSG00000270961.1|OTTHUMG00000184622.2|OTTHUMT00000468982.2|RP11-1360M22.8-001|IGHD5OR15-5A|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.547] [jointLog] [warning] Entry with header [ENST00000603326.1|ENSG00000271317.1|OTTHUMG00000184621.3|OTTHUMT00000468981.3|RP11-1360M22.7-001|IGHD4OR15-4A|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.547] [jointLog] [warning] Entry with header [ENST00000605284.1|ENSG00000271336.1|OTTHUMG00000184580.2|OTTHUMT00000468908.2|RP11-1360M22.3-001|IGHD1OR15-1A|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.549] [jointLog] [warning] Entry with header [ENST00000604446.1|ENSG00000270824.1|OTTHUMG00000184624.2|OTTHUMT00000468984.2|RP11-810K23.15-001|IGHD5OR15-5B|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.549] [jointLog] [warning] Entry with header [ENST00000603693.1|ENSG00000270451.1|OTTHUMG00000184611.3|OTTHUMT00000468945.3|RP11-810K23.14-001|IGHD4OR15-4B|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-02 16:23:31.549] [jointLog] [warning] Entry with header [ENST00000604838.1|ENSG00000270185.1|OTTHUMG00000184585.2|OTTHUMT00000468915.2|RP11-1360M22.4-001|IGHD1OR15-1B|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 150000 transcripts[2018-08-02 16:23:32.097] [jointLog] [warning] Entry with header [ENST00000579054.1|ENSG00000266416.1|OTTHUMG00000179204.1|OTTHUMT00000445280.1|RP1-66C13.2-001|RP1-66C13.2|28|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 170000 transcripts[2018-08-02 16:23:32.554] [jointLog] [warning] Entry with header [ENST00000634174.1|ENSG00000282732.1|OTTHUMG00000191398.1|OTTHUMT00000487783.1|RP11-157B13.10-001|RP11-157B13.10|28|unprocessed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
counted k-mers for 200000 transcriptsElapsed time: 5.76935s

[2018-08-02 16:23:33.248] [jointLog] [warning] There were 808 transcripts that would need to be removed to avoid duplicates.
Replaced 4 non-ATCG nucleotides
Clipped poly-A tails from 1586 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.0169059s
Writing sequence data to file . . . done
Elapsed time: 0.13359s
[info] Building 32-bit suffix array (length of generalized text is 309778559)
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 6.96499s
done
Elapsed time: 33.5821s
processed 309000000 positions
khash had 130317526 keys
saving hash to disk . . . done
Elapsed time: 34.8185s
[2018-08-02 16:26:58.153] [jLog] [info] done building index

I reproduced the warnings from the initial run w/o the --keepDuplicates argument.

[Step 1 of 4] : counting k-mers
[2018-08-06 09:29:02.061] [jointLog] [warning] Entry with header [ENST00000473810.1|ENSG00000239255.1|OTTHUMG00000157482.1|OTTHUMT00000348942.1|RP11-145M9.2-001|RP11-145M9.2|25|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:02.143] [jointLog] [warning] Entry with header [ENST00000603775.1|ENSG00000271544.1|OTTHUMG00000184300.1|OTTHUMT00000468575.1|AC006499.9-001|AC006499.9|23|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:03.084] [jointLog] [warning] Entry with header [ENST00000632684.1|ENSG00000282431.1|OTTHUMG00000190602.2|OTTHUMT00000485301.2|RP11-520H11.10-001|TRBD1|12|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.306] [jointLog] [warning] Entry with header [ENST00000626826.1|ENSG00000281344.1|OTTHUMG00000189570.1|OTTHUMT00000479989.1|RP11-210L7.2-001|HELLPAR|205012|macro_lncRNA|] was longer than 200000 nucleotides.  Are you certain that we are indexing a transcriptome and not a genome?
[2018-08-06 09:29:04.359] [jointLog] [warning] Entry with header [ENST00000543745.1|ENSG00000255972.1|OTTHUMG00000168883.1|OTTHUMT00000401485.1|RP11-324E6.8-001|RP11-324E6.8|28|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.496] [jointLog] [warning] Entry with header [ENST00000415118.1|ENSG00000223997.1|OTTHUMG00000170844.2|OTTHUMT00000410670.2|AE000661.52-001|TRDD1|8|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.496] [jointLog] [warning] Entry with header [ENST00000434970.2|ENSG00000237235.2|OTTHUMG00000170845.2|OTTHUMT00000410671.2|AE000661.53-001|TRDD2|9|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.496] [jointLog] [warning] Entry with header [ENST00000448914.1|ENSG00000228985.1|OTTHUMG00000170846.2|OTTHUMT00000410672.2|AE000661.54-001|TRDD3|13|TR_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000439842.1|ENSG00000236597.1|OTTHUMG00000152435.2|OTTHUMT00000326213.2|AL122127.38-001|IGHD7-27|11|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000390567.1|ENSG00000211907.1|OTTHUMG00000152429.2|OTTHUMT00000326207.2|AL122127.37-001|IGHD1-26|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000452198.1|ENSG00000225825.1|OTTHUMG00000152436.2|OTTHUMT00000326214.2|AL122127.36-001|IGHD6-25|18|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000390569.1|ENSG00000211909.1|OTTHUMG00000152427.2|OTTHUMT00000326205.2|AL122127.35-001|IGHD5-24|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000437320.1|ENSG00000227196.1|OTTHUMG00000152438.2|OTTHUMT00000326216.2|AL122127.34-001|IGHD4-23|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.748] [jointLog] [warning] Entry with header [ENST00000390572.1|ENSG00000211912.1|OTTHUMG00000152428.2|OTTHUMT00000326206.2|AL122127.32-001|IGHD2-21|28|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000450276.1|ENSG00000237020.1|OTTHUMG00000152432.2|OTTHUMT00000326210.2|AL122127.31-001|IGHD1-20|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000390574.1|ENSG00000211914.1|OTTHUMG00000152431.2|OTTHUMT00000326209.2|AL122127.30-001|IGHD6-19|21|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000390575.1|ENSG00000211915.1|OTTHUMG00000152433.2|OTTHUMT00000326211.2|AL122127.29-001|IGHD5-18|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000431870.1|ENSG00000227800.1|OTTHUMG00000152437.2|OTTHUMT00000326215.2|AL122127.28-001|IGHD4-17|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000451044.1|ENSG00000227108.1|OTTHUMG00000152369.2|OTTHUMT00000326003.2|AB019441.47-001|IGHD1-14|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000390580.1|ENSG00000211920.1|OTTHUMG00000152370.2|OTTHUMT00000326004.2|AB019441.46-001|IGHD6-13|21|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000390581.1|ENSG00000211921.1|OTTHUMG00000152367.2|OTTHUMT00000326001.2|AB019441.45-001|IGHD5-12|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000431440.2|ENSG00000232543.2|OTTHUMG00000152368.2|OTTHUMT00000326002.2|AB019441.44-001|IGHD4-11|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000430425.1|ENSG00000237197.1|OTTHUMG00000152357.2|OTTHUMT00000325963.2|AB019441.40-001|IGHD1-7|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000454691.1|ENSG00000228131.1|OTTHUMG00000152353.2|OTTHUMT00000325959.2|AB019441.39-001|IGHD6-6|18|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000390588.1|ENSG00000211928.1|OTTHUMG00000152360.2|OTTHUMT00000325966.2|AB019441.38-001|IGHD5-5|20|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000414852.1|ENSG00000233655.1|OTTHUMG00000152355.2|OTTHUMT00000325961.2|AB019441.37-001|IGHD4-4|16|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000454908.1|ENSG00000236170.1|OTTHUMG00000152359.2|OTTHUMT00000325965.2|AB019441.34-001|IGHD1-1|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.749] [jointLog] [warning] Entry with header [ENST00000518246.1|ENSG00000254045.1|OTTHUMG00000152060.1|OTTHUMT00000325154.1|AB019439.71-001|IGHVIII-22-2|28|IG_V_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.750] [jointLog] [warning] Entry with header [ENST00000604642.1|ENSG00000270961.1|OTTHUMG00000184622.2|OTTHUMT00000468982.2|RP11-1360M22.8-001|IGHD5OR15-5A|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.750] [jointLog] [warning] Entry with header [ENST00000603326.1|ENSG00000271317.1|OTTHUMG00000184621.3|OTTHUMT00000468981.3|RP11-1360M22.7-001|IGHD4OR15-4A|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.750] [jointLog] [warning] Entry with header [ENST00000605284.1|ENSG00000271336.1|OTTHUMG00000184580.2|OTTHUMT00000468908.2|RP11-1360M22.3-001|IGHD1OR15-1A|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.752] [jointLog] [warning] Entry with header [ENST00000604446.1|ENSG00000270824.1|OTTHUMG00000184624.2|OTTHUMT00000468984.2|RP11-810K23.15-001|IGHD5OR15-5B|23|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.752] [jointLog] [warning] Entry with header [ENST00000603693.1|ENSG00000270451.1|OTTHUMG00000184611.3|OTTHUMT00000468945.3|RP11-810K23.14-001|IGHD4OR15-4B|19|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:04.752] [jointLog] [warning] Entry with header [ENST00000604838.1|ENSG00000270185.1|OTTHUMG00000184585.2|OTTHUMT00000468915.2|RP11-1360M22.4-001|IGHD1OR15-1B|17|IG_D_gene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:05.304] [jointLog] [warning] Entry with header [ENST00000579054.1|ENSG00000266416.1|OTTHUMG00000179204.1|OTTHUMT00000445280.1|RP1-66C13.2-001|RP1-66C13.2|28|processed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
[2018-08-06 09:29:05.761] [jointLog] [warning] Entry with header [ENST00000634174.1|ENSG00000282732.1|OTTHUMG00000191398.1|OTTHUMT00000487783.1|RP11-157B13.10-001|RP11-157B13.10|28|unprocessed_pseudogene|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)
Elapsed time: 5.65811s

[2018-08-06 09:29:06.451] [jointLog] [warning] Removed 808 transcripts that were sequence duplicates of indexed transcripts.
[2018-08-06 09:29:06.451] [jointLog] [warning] If you wish to retain duplicate transcripts, please use the `--keepDuplicates` flag
Replaced 4 non-ATCG nucleotides
Clipped poly-A tails from 1586 transcripts
Building rank-select dictionary and saving to disk done
Elapsed time: 0.0178594s
Writing sequence data to file . . . done
Elapsed time: 0.702003s
[info] Building 32-bit suffix array (length of generalized text is 308972089)
Building suffix array . . . success
saving to disk . . . done
Elapsed time: 8.62493s
done
Elapsed time: 35.9517s
processed 308000000 positions
khash had 130317526 keys
saving hash to disk . . . done
Elapsed time: 29.414s
[2018-08-06 09:34:12.370] [jLog] [info] done building index

kvittingseerup · 2018-08-06T13:28:45Z

Could you also specify exactly which of the GENCODE files you are using?

rbenel · 2018-08-06T13:32:59Z

Yes, it is in the previous post.. https://www.gencodegenes.org/releases/current.html -> PRI.

rob-p · 2018-08-06T13:40:39Z

Could you post one of your output quant.sf files? I can investigate.

rbenel · 2018-08-06T14:02:32Z

Hi,

Here is link to dropbox, https://www.dropbox.com/s/herbw9te1g9sgv2/quant.sf?dl=0

rob-p · 2018-08-06T14:22:10Z

Hi @rbenel,

This is quite interesting. So I downloaded both the Gencode transcriptome (all transcript sequences) and the annotation you point out (PRI --- comprehensive gene annotation). There are a few transcripts present in the latter but not the former:

-ENST00000618686.1
-ENST00000613230.1
-ENST00000400754.4
-ENST00000618679.1
-ENST00000612465.1
-ENST00000611619.1
-ENST00000620032.1
-ENST00000621382.1
-ENST00000616049.4
-ENST00000616157.1
-ENST00000616468.1
-ENST00000611062.1
-ENST00000612565.1
-ENST00000612919.1
-ENST00000619317.1
-ENST00000611446.1
-ENST00000614535.1
-ENST00000619779.1
-ENST00000621409.1
-ENST00000611690.1
-ENST00000620265.1
-ENST00000614336.4
-ENST00000612640.4
-ENST00000612721.4
-ENST00000616361.1
-ENST00000619109.1
-ENST00000618083.1
-ENST00000612315.1
-ENST00000601199.2
-ENST00000612848.1
-ENST00000612801.1
-ENST00000617089.1
-ENST00000614351.1
-ENST00000619729.1
-ENST00000618003.1
-ENST00000615005.1
-ENST00000516246.2
-ENST00000621137.1
-ENST00000614604.4
-ENST00000620810.1
-ENST00000613373.1
-ENST00000612882.1
-ENST00000622674.1
-ENST00000616048.1
-ENST00000616638.1
-ENST00000618201.1
-ENST00000621028.1
-ENST00000619806.1
-ENST00000611339.1
-ENST00000613216.4
-ENST00000619130.1
-ENST00000612243.1
-ENST00000614110.1
-ENST00000611746.1
-ENST00000619792.1
-ENST00000620795.1
-ENST00000618675.1
-ENST00000616292.1
-ENST00000615130.1
-ENST00000618998.1
-ENST00000615362.1
-ENST00000617983.1
-ENST00000613204.1
-ENST00000615165.1
-ENST00000621424.4
-ENST00000616830.1
-ENST00000612925.1

Specifically, these are not dropped by salmon. They are not in the input reference transcriptome file. So it looks like Gencode includes these in the GTF, but not in the transcriptome fasta. I looked at the first few, and nothing immediately jumps out as to why Gencode would have dropped them from the fasta file. Do these transcript names have any special significance to you?

If you really want to include them, one option would be to use the GTF + the genome, and a tool like gffread to extract the transcriptome sequences from the genome and annotation. However, I might first try to investigate what these transcripts are, and if they are something that you want to quantify / consider.

kvittingseerup · 2018-08-07T09:21:15Z

GENCODE provide 1 FASTA File called "Transcript sequences" which "only" contains the "CHR" (chromosomal) regions.

GENCODE provide many GTF files (specifically 9). The GTF file corresponding to the FASTA file mentioned above is the "Comprehensive gene annotation" from the "CHR" regions (aka chromosomal regions) (which is the first on their list).

You have downloaded the "Pri" (third entry) which is the normal chromosomes (Chr) as well as as well as scaffolds. which explain the 68 extra transcripts. Specifically the scaffolds included in "Pri" but not in "Chr" are:

"GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2" "GL000213.1"
"GL000216.2" "GL000218.1" "GL000219.1" "GL000220.1" "GL000225.1"
"KI270442.1" "KI270711.1" "KI270713.1" "KI270721.1" "KI270726.1"
"KI270727.1" "KI270728.1" "KI270731.1" "KI270733.1" "KI270734.1"
"KI270744.1" "KI270750.1"

So the solution is as @rob-p suggested:

Use gffread to make your own fasta file
Remove those extra transcripts (or the "Chr" GTF file)

Cheers
Kristoffer

rbenel · 2018-08-08T08:09:17Z

Thank you both! I need to look into those transcripts, to see if anything looks important.

Tima-Ze · 2020-12-26T13:18:36Z

Hi all,
Just an update:
I also got same warning message (as @rbenel talk about it here) when creating index along with decoy sequences I took @kvittingseerup's advice and made a transcripts.fa file by gffread command. Here is my input files and commend:
All gtf and genome references were downloaded from GENCODE: GRCh38.primary_assembly.genome.fa.gz, gencode.v36.annotation.gtf (CHR) and gencode.v36.transcripts.fa.gz.
commends:
grep "^>" <(gunzip -c GRCh38.primary_assembly.genome.fa.gz) | cut -d " " -f 1 > decoys.txt
sed -i.bak -e 's/>//g' decoys.txt
cat salmon_transcripts.fa.gz GRCh38.primary_assembly.genome.fa.gz > gentrome.fa.gz
salmon index -t gentrome.fa.gz -d decoys.txt -p 12 -i salmon-decoy-sa-index --gencode
warnings:

**So using gffread I created a transcripts.fa file:
gffread -w salmon_transcripts.fa -g GRCh38.primary_assembly.genome.fa gencode.v36.annotation.gtf

using this new transcripts.fa I run again the above mentioned salmon index with decoy command, but the warning message was shown up again:**

[Step 1 of 4] : counting k-mers
[2020-12-26 11:30:08.799] [puff::index::jointLog] [warning] Entry with header [ENST00000473810.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:08.951] [puff::index::jointLog] [warning] Entry with header [ENST00000603775.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:10.751] [puff::index::jointLog] [warning] Entry with header [ENST00000632684.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:12.936] [puff::index::jointLog] [warning] Entry with header [ENST00000543745.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.188] [puff::index::jointLog] [warning] Entry with header [ENST00000415118.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.188] [puff::index::jointLog] [warning] Entry with header [ENST00000434970.2], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.188] [puff::index::jointLog] [warning] Entry with header [ENST00000448914.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000439842.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390567.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000452198.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390569.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000437320.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390571.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390572.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000450276.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390574.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390575.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000431870.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390578.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000451044.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390580.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390581.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000431440.2], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390583.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390584.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390585.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000430425.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000454691.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390588.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000414852.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390590.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.483] [puff::index::jointLog] [warning] Entry with header [ENST00000390591.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.484] [puff::index::jointLog] [warning] Entry with header [ENST00000454908.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.484] [puff::index::jointLog] [warning] Entry with header [ENST00000518246.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.486] [puff::index::jointLog] [warning] Entry with header [ENST00000604642.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.486] [puff::index::jointLog] [warning] Entry with header [ENST00000603326.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.486] [puff::index::jointLog] [warning] Entry with header [ENST00000604950.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.486] [puff::index::jointLog] [warning] Entry with header [ENST00000603077.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.486] [puff::index::jointLog] [warning] Entry with header [ENST00000605284.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.489] [puff::index::jointLog] [warning] Entry with header [ENST00000604446.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.489] [puff::index::jointLog] [warning] Entry with header [ENST00000603693.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.489] [puff::index::jointLog] [warning] Entry with header [ENST00000603935.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.489] [puff::index::jointLog] [warning] Entry with header [ENST00000604102.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:13.489] [puff::index::jointLog] [warning] Entry with header [ENST00000604838.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:14.411] [puff::index::jointLog] [warning] Entry with header [ENST00000579054.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:30:15.280] [puff::index::jointLog] [warning] Entry with header [ENST00000634174.1], had length less than equal to the k-mer length of 31 (perhaps after poly-A clipping)
[2020-12-26 11:31:24.590] [puff::index::jointLog] [warning] Removed 829 transcripts that were sequence duplicates of indexed transcripts.
[2020-12-26 11:31:24.590] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the --keepDuplicates flag
[2020-12-26 11:31:24.641] [puff::index::jointLog] [info] Replaced 151,122,967 non-ATCG nucleotides
[2020-12-26 11:31:24.641] [puff::index::jointLog] [info] Clipped poly-A tails from 1,829 transcripts
wrote 231443 cleaned references
[2020-12-26 11:31:28.118] [puff::index::jointLog] [info] Filter size not provided; estimating from number of distinct k-mers
[2020-12-26 11:31:58.286] [puff::index::jointLog] [info] ntHll estimated 2628436199 distinct k-mers, setting filter size to 2^36
Threads = 12
Vertex length = 31
Hash functions = 5
Filter size = 68719476736
Capacity = 2
Files:
salmon-decoy-sa-index/ref_k31_fixed.fa

**My concern is would it make problem for rest of downstream analysis?

Thanks,
Tima**

rob-p · 2020-12-26T15:25:48Z

Hi @Tima-Ze,

This should not cause any trouble with downstream analysis. The indexing procedure is simply informing you that these transcripts (about which you are being warned) are shorter than the seed length used for alignment. This means that it simply won't be possible for fragments to align to these transcripts, and so they will always have a 0 abundance in the resulting quant.sf files. This isn't a problem, as these transcripts are too short to be measured via RNA-seq anyway. The indexing messages just let you know this in advance. You can safely ignore these warnings for your downstream analysis.

kvittingseerup closed this as completed Apr 16, 2018

This comment has been minimized.

Sign in to view

jiazhou0116 mentioned this issue Jan 31, 2024

salmon quantmerge skipped the nucleotide IDs that have multiple sequences - Metagenome dataset #910

Open

This was referenced Mar 12, 2024

nf-test quantify pseudoalignment nf-core/rnaseq#1246

Merged

Salmon --keepDuplicates by default nf-core/rnaseq#1259

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with Salmon build: It removes identical transcript sequences #214

Error with Salmon build: It removes identical transcript sequences #214

kvittingseerup commented Apr 16, 2018

rob-p commented Apr 16, 2018

kvittingseerup commented Apr 16, 2018

rob-p commented Apr 16, 2018

kvittingseerup commented Apr 16, 2018

rbenel commented Aug 5, 2018

rob-p commented Aug 5, 2018

rbenel commented Aug 6, 2018 •

edited

Loading

kvittingseerup commented Aug 6, 2018

rbenel commented Aug 6, 2018

rob-p commented Aug 6, 2018

rbenel commented Aug 6, 2018

rob-p commented Aug 6, 2018

kvittingseerup commented Aug 7, 2018

rbenel commented Aug 8, 2018

This comment has been minimized.

Tima-Ze commented Dec 26, 2020 •

edited

Loading

rob-p commented Dec 26, 2020

Error with Salmon build: It removes identical transcript sequences #214

Error with Salmon build: It removes identical transcript sequences #214

Comments

kvittingseerup commented Apr 16, 2018

rob-p commented Apr 16, 2018

kvittingseerup commented Apr 16, 2018

rob-p commented Apr 16, 2018

kvittingseerup commented Apr 16, 2018

rbenel commented Aug 5, 2018

rob-p commented Aug 5, 2018

rbenel commented Aug 6, 2018 • edited Loading

kvittingseerup commented Aug 6, 2018

rbenel commented Aug 6, 2018

rob-p commented Aug 6, 2018

rbenel commented Aug 6, 2018

rob-p commented Aug 6, 2018

kvittingseerup commented Aug 7, 2018

rbenel commented Aug 8, 2018

This comment has been minimized.

Tima-Ze commented Dec 26, 2020 • edited Loading

rob-p commented Dec 26, 2020

rbenel commented Aug 6, 2018 •

edited

Loading

Tima-Ze commented Dec 26, 2020 •

edited

Loading