Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MD5SUM from picard and samtools do not match #1814

Closed
1 of 2 tasks
fgvieira opened this issue Jun 12, 2022 · 8 comments · Fixed by #1884
Closed
1 of 2 tasks

MD5SUM from picard and samtools do not match #1814

fgvieira opened this issue Jun 12, 2022 · 8 comments · Fixed by #1884
Assignees
Labels

Comments

@fgvieira
Copy link

fgvieira commented Jun 12, 2022

Instructions

  • Use a concise yet descriptive title;
  • Determine whether your issue is a bug report, a feature request, or a documentation request;
  • Choose the corresponding template block below and fill it in, replacing or deleting text in italics (surrounded by _) as appropriate;
  • Delete the other template blocks and this header.

Bug Report

Affected tool(s)

CreateSequenceDictionary

Affected version(s)

  • Latest public release version [2.27.2]
  • Latest development/master branch as of [date of test?]

Description

When creating a dictionary with samtools dict, I get:

$ samtools dict Oegenome10scaffoldC3G.fasta
@HD     VN:1.0  SO:unsorted
@SQ     SN:scaffold1    LN:114512629    M5:f6412f880b27671e3789d5836f5803f1     UR:file:///Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold2    LN:111878775    M5:67b1df915e157c1b1486862a810eb4f6     UR:file:///Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold3    LN:97763355     M5:3d1af6b898da3f3c017bf0a6a105ccc7     UR:file:///Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold4    LN:96428841     M5:23694b6ad3ef81aa4f891e5c7bee80e2     UR:file:///Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold5    LN:96132218     M5:1a846b8a41a7d622c9af78465232c99f     UR:file:///Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold6    LN:95771753     M5:310f1513181b44d338486d331271f5aa     UR:file:///Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold7    LN:83323684     M5:08e47a0f3d60a7abcb436663fc3d52dc     UR:file:///Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold8    LN:58502465     M5:546e6a679d3e736713a3b383ab312723     UR:file:///Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold9    LN:52333144     M5:539d8ed1fcbf4a67bd26780a1ea0dc5b     UR:file:///Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold10   LN:51426335     M5:91f5884fd589e4bae7d6178b63380d0a     UR:file:///Oegenome10scaffoldC3G.fasta

Also if it is from the gzipped genome:

$ samtools dict Oegenome10scaffoldC3G.fasta.gz
@HD     VN:1.0  SO:unsorted
@SQ     SN:scaffold1    LN:114512629    M5:f6412f880b27671e3789d5836f5803f1     UR:file:///Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold2    LN:111878775    M5:67b1df915e157c1b1486862a810eb4f6     UR:file:///Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold3    LN:97763355     M5:3d1af6b898da3f3c017bf0a6a105ccc7     UR:file:///Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold4    LN:96428841     M5:23694b6ad3ef81aa4f891e5c7bee80e2     UR:file:///Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold5    LN:96132218     M5:1a846b8a41a7d622c9af78465232c99f     UR:file:///Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold6    LN:95771753     M5:310f1513181b44d338486d331271f5aa     UR:file:///Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold7    LN:83323684     M5:08e47a0f3d60a7abcb436663fc3d52dc     UR:file:///Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold8    LN:58502465     M5:546e6a679d3e736713a3b383ab312723     UR:file:///Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold9    LN:52333144     M5:539d8ed1fcbf4a67bd26780a1ea0dc5b     UR:file:///Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold10   LN:51426335     M5:91f5884fd589e4bae7d6178b63380d0a     UR:file:///Oegenome10scaffoldC3G.fasta.gz

If I use picard CreateSequenceDictionary on the gzipped genome, I get the same:

$ picard CreateSequenceDictionary -R Oegenome10scaffoldC3G.fasta.gz -O gz.dict
19:35:35.294 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/conda/78615535f0c003a62a8c9be0ecbbbab4/share/picard-2.27.2-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Jun 12 19:35:35 CEST 2022] CreateSequenceDictionary --OUTPUT gz.dict --REFERENCE Oegenome10scaffoldC3G.fasta.gz --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET \
false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false -\
-showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Sun Jun 12 19:35:35 CEST 2022] Executing as fgvieira@RH-GM on Linux 5.15.41-1-MANJARO amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard\
 version: Version:2.27.2
[Sun Jun 12 19:35:44 CEST 2022] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.15 minutes.
Runtime.totalMemory()=514850816

$ cat gz.dict
@HD     VN:1.6
@SQ     SN:scaffold1    LN:114512629    M5:f6412f880b27671e3789d5836f5803f1     UR:file:/Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold2    LN:111878775    M5:67b1df915e157c1b1486862a810eb4f6     UR:file:/Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold3    LN:97763355     M5:3d1af6b898da3f3c017bf0a6a105ccc7     UR:file:/Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold4    LN:96428841     M5:23694b6ad3ef81aa4f891e5c7bee80e2     UR:file:/Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold5    LN:96132218     M5:1a846b8a41a7d622c9af78465232c99f     UR:file:/Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold6    LN:95771753     M5:310f1513181b44d338486d331271f5aa     UR:file:/Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold7    LN:83323684     M5:08e47a0f3d60a7abcb436663fc3d52dc     UR:file:/Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold8    LN:58502465     M5:546e6a679d3e736713a3b383ab312723     UR:file:/Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold9    LN:52333144     M5:539d8ed1fcbf4a67bd26780a1ea0dc5b     UR:file:/Oegenome10scaffoldC3G.fasta.gz
@SQ     SN:scaffold10   LN:51426335     M5:91f5884fd589e4bae7d6178b63380d0a     UR:file:/Oegenome10scaffoldC3G.fasta.gz

But with the plain genome, the md5sum is different:

$ picard CreateSequenceDictionary -R Oegenome10scaffoldC3G.fasta -O plain.dict
19:36:20.235 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/conda/78615535f0c003a62a8c9be0ecbbbab4/share/picard-2.27.2-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Sun Jun 12 19:36:20 CEST 2022] CreateSequenceDictionary --OUTPUT plain.dict --REFERENCE Oegenome10scaffoldC3G.fasta --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET \
false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 5 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false -\
-showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Sun Jun 12 19:36:20 CEST 2022] Executing as fgvieira@RH-GM on Linux 5.15.41-1-MANJARO amd64; OpenJDK 64-Bit Server VM 1.8.0_112-b16; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard\
 version: Version:2.27.2
[Sun Jun 12 19:36:24 CEST 2022] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=514850816

$ cat plain.dict
@HD     VN:1.6
@SQ     SN:scaffold1    LN:114512629    M5:cbd6e01c9f6f65b8c2c0aca4d94ace7f     UR:file:/Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold2    LN:111878775    M5:e812ff18e081b5a63b3cb22bbce32277     UR:file:/Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold3    LN:97763355     M5:cc356e6e23e06d6fa88c98e65fc97a44     UR:file:/Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold4    LN:96428841     M5:a854288adeccc524f1a68069fa5514a8     UR:file:/Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold5    LN:96132218     M5:1f97fe64c0bcacaf08b2c7a6d2930f3a     UR:file:/Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold6    LN:95771753     M5:9b619d84e78bdc163b29e78c5903bb45     UR:file:/Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold7    LN:83323684     M5:3e2d90074e37ac9b3172ad18e06654f1     UR:file:/Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold8    LN:58502465     M5:530fa79a23418b01c16040d8ad6a5ab4     UR:file:/Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold9    LN:52333144     M5:271d029039e1f3a48319a6ccb85cdabe     UR:file:/Oegenome10scaffoldC3G.fasta
@SQ     SN:scaffold10   LN:51426335     M5:b3de28c8bcde29f4d1802a914b1c835a     UR:file:/Oegenome10scaffoldC3G.fasta

If I calculate the md5sum manually for (e.g.) scaffold01, I get f6412f880b27671e3789d5836f5803f1. It seems that there is something wrong with picard with plain genomes.

And both genome files are equal:

$ md5sum Oegenome10scaffoldC3G.fasta
f25ede3532b9a04f29ccb5796414184c  Oegenome10scaffoldC3G.fasta

$ zcat Oegenome10scaffoldC3G.fasta.gz | md5sum
f25ede3532b9a04f29ccb5796414184c  -

Expected behavior

I'd expect all dicts to be identical.

Actual behavior

The md5sums are different.

@fgvieira
Copy link
Author

fgvieira commented Jun 13, 2022

Some extra info. It seems the problem is that the genome file has different line breaks. After fixing them with dos2unix, all worked fine.

But shouldn't picard deal with these automatically (just like samtools)?

@lbergelson
Copy link
Member

This definitely seems like a bug we should look into.

@lbergelson lbergelson added the bug label Feb 14, 2023
@droazen
Copy link
Contributor

droazen commented Mar 21, 2023

After looking at the relevant code, we believe that this issue only affects files with multi-byte line endings (eg., Windows line endings).

@kachulis kachulis self-assigned this Mar 21, 2023
@kachulis
Copy link
Contributor

@fgvieira I am having trouble reproducing this behavior. Are you able to share the fasta file with which you saw this issue?

@fgvieira
Copy link
Author

It was a while ago so don't think I still have it, but you should be able to just create a multi-seq fasta file on Windows.

@cmnbroad
Copy link
Contributor

cmnbroad commented May 9, 2023

@cmnbroad will take a look (@kachulis was unable to reproduce).

@cmnbroad
Copy link
Contributor

I can't repro this either - AFAICT CreateSequenceDictionary respects the line endings correctly (I've included my tests in a PR here). Since I've already done the work, we might as well keep them.

@fgvieira
Copy link
Author

Have been trying to find the original file, but to no avail.
Feel free to close it now, and I'll re-open it if I find the original file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants