Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash.exceptions.Io: Invalid checksum #3233

Closed
musquita opened this issue Jul 1, 2024 · 12 comments
Closed

sourmash.exceptions.Io: Invalid checksum #3233

musquita opened this issue Jul 1, 2024 · 12 comments

Comments

@musquita
Copy link

musquita commented Jul 1, 2024

Hi!

I am sequencing environmental and animal host samples to identify potential cause of infection and I intended to test sourmash.
Most times I am unabel to assemble the non-host reads, thus I intend to classify my long reads.
I downloaded the genbank databases from here and this is what I tried running:

gbkdb="/media/DBs/smash/genbank-2022.03/genbank-2022.03
gbktaxdb="/media/DBs/smash/genbank-2022.03/genbank-2022.03-lineages.db

for file in ./*.fastq; do
   sample=$(basename "$file" | sed -E 's/\.fastq$|\._notHost.fastq$|\._notHuman.fastq$//')
   sourmash gather sourmash/"${sample}.sig" "$gbkdb"-{archaea,bacteria,protozoa,fungi,viral}-k51.zip -o sourmash/"${sample}.smash.gbk.csv" --save-prefetch-csv sourmash/"${sample}.prefetch.gbk.csv" -k 51 --scaled 1000 --threshold-bp 5000 --estimate-ani-ci --no-fail-on-empty-database
   if [[ -s sourmash/"${sample}.smash.gbk.csv" ]]; then
      sourmash tax metagenome --gather-csv sourmash/"${sample}.smash.gbk.csv" --taxonomy-csv "$gbktaxdb" --output-dir sourmash --output-base "${sample}.tax.gbk.txt" -F human --rank species
   else
      echo "The genbank profile file for sample ${sample} is empty. Skipping taxonomic processing."
   fi
done

But i get:

Traceback (most recent call last):
  File "/home/cris/soft/miniconda3/envs/ont-diag/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/__main__.py", line 20, in main
    retval = mainmethod(args)
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/cli/gather.py", line 204, in main
    return sourmash.commands.gather(args)
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/commands.py", line 910, in gather
    counter = db.counter_gather(prefetch_query, args.threshold_bp)
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/index/__init__.py", line 316, in counter_gather
    for result in self.prefetch(prefetch_query, threshold_bp, **kwargs):
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/index/__init__.py", line 256, in prefetch
    yield from self.find(search_fn, query, **kwargs)
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/index/__init__.py", line 151, in find
    for subj, location in self.signatures_with_location():
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/index/__init__.py", line 85, in signatures_with_location
    for ss in self.signatures():
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/index/__init__.py", line 656, in signatures
    data = self.storage.load(filename)
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/sbt_storage.py", line 158, in load
    rawbuf = self._methodcall(
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/utils.py", line 25, in _methodcall
    return rustcall(func, self._get_objptr(), *args)
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/utils.py", line 78, in rustcall
    raise exc
sourmash.exceptions.Io: Invalid checksum

I've managed to get the results if using the gtdb-rs214 database available, but really wish to sort this for genbank.
Running the command separately for the databases, i get the error for protozoa, fungi and bacteria.

Sorry for not getting this at once, pressed enter twice before :/

@ctb
Copy link
Contributor

ctb commented Jul 1, 2024

aieee that's not good!

can you share the shell command(s) you are running, please? thank you!

@musquita

This comment was marked as outdated.

@ctb
Copy link
Contributor

ctb commented Jul 1, 2024

ok, thanks!

nothing wrong with the commands, AFAICT!
I ran for i in *k51.zip; do sourmash sig summarize $i; done and that all worked. So it's fine with the version of sourmash (as expected, but I wanted to verify).

I'm wondering if maybe the downloads are corrupt?

I ran

% md5sum *-k51.zip
95270f1371bdc196b36d5e47dd4a13a4  genbank-2022.03-archaea-k51.zip
c7ef7c815a00337a7252ab49ffab3e8b  genbank-2022.03-bacteria-k51.zip
8abe5e76b484024b93a2302c6dc39e76  genbank-2022.03-fungi-k51.zip
abc0da0103c9f44ae0bbe460b1e475f1  genbank-2022.03-protozoa-k51.zip
bd7b29e8beb1518a38075b28b678c395  genbank-2022.03-viral-k51.zip

can you verify those results? Your md5sum should be the same.

(You can also try running unzip -v $i | head for each file; that will tell you if they are corrupted zip files.)

@musquita
Copy link
Author

musquita commented Jul 1, 2024

My md5sum for fungi and bacteria do not match yours:

95270f1371bdc196b36d5e47dd4a13a4  genbank-2022.03-archaea-k51.zip
a0a15206e08717f864156cb233862898  genbank-2022.03-bacteria-k51.zip
502d9795ddb93c00610972210ef12e8d  genbank-2022.03-fungi-k51.zip
abc0da0103c9f44ae0bbe460b1e475f1  genbank-2022.03-protozoa-k51.zip
bd7b29e8beb1518a38075b28b678c395  genbank-2022.03-viral-k51.zip

Running your other commands only for these two i get:

== This is sourmash version 4.8.9. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'genbank-2022.03-bacteria-k51.zip'
path filetype: ZipFileLinearIndex
location: /media/DBs/smash/genbank-2022.03/teste/genbank-2022.03-bacteria-k51.zip
is database? yes
has manifest? yes
num signatures: 1148010
** examining manifest...
total hashes: 4657991678
summary of sketches:
   1148010 sketches with DNA, k=51, scaled=1000, abund 4657991678 total hashes

== This is sourmash version 4.8.9. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'genbank-2022.03-fungi-k51.zip'
path filetype: ZipFileLinearIndex
location: /media/DBs/smash/genbank-2022.03/teste/genbank-2022.03-fungi-k51.zip
is database? yes
has manifest? yes
num signatures: 10285
** examining manifest...
total hashes: 343918794
summary of sketches:
   10285 sketches with DNA, k=51, scaled=1000, abund  343918794 total hashes

Archive:  genbank-2022.03-bacteria-k51.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
   21207  Stored    21207   0% 2022-03-29 08:19 15914876  signatures/46bd9dca15ad7b06b85e8c6aa1dfc5bf.sig.gz
   56312  Stored    56312   0% 2022-03-29 08:19 992a43ed  signatures/1eb5a47f92db27c870944837cfe56186.sig.gz
   15472  Stored    15472   0% 2022-03-29 08:19 95afa1fe  signatures/f0bef8a28010a54c50d02b884cab1e8d.sig.gz
   39033  Stored    39033   0% 2022-03-29 08:19 f1489e45  signatures/84aad26da19a645b20a8adfc31863e37.sig.gz
   40071  Stored    40071   0% 2022-03-29 08:19 a461bbe3  signatures/3df28795f09062da9901160bbc86cb06.sig.gz
   37123  Stored    37123   0% 2022-03-29 08:19 e321a654  signatures/c4f2d6791ab63a8a4a4991b6591e2f4f.sig.gz
   44645  Stored    44645   0% 2022-03-29 08:19 a35000d3  signatures/a34c7043ea61d9eb6c2891b3797c27e2.sig.gz

Archive:  genbank-2022.03-fungi-k51.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
  388785  Stored   388785   0% 2022-03-29 07:02 51ba631f  signatures/057948456438cd0aa12bf61e40b8039b.sig.gz
  435298  Stored   435298   0% 2022-03-29 07:02 152340bf  signatures/abea28849bb6414d5c9b1bd28fcf6b37.sig.gz
  436168  Stored   436168   0% 2022-03-29 07:02 d13ab8c5  signatures/fb947834a1bb7d894d8f00f1ae3062b3.sig.gz
  286460  Stored   286460   0% 2022-03-29 07:02 199cce7a  signatures/482a29a85175f85e6605d21f90f5d542.sig.gz
  435411  Stored   435411   0% 2022-03-29 07:02 818e574e  signatures/8686182d8684a5bf3757f7527e687151.sig.gz
  433058  Stored   433058   0% 2022-03-29 07:02 49161ad7  signatures/00393cf1f654cbe641f2d5cf433ed229.sig.gz
  399624  Stored   399624   0% 2022-03-29 07:02 6ed18404  signatures/703e51a45d1f2218736c964c46492319.sig.gz

@ctb
Copy link
Contributor

ctb commented Jul 1, 2024

but your md5sum for protozoa match? Weird...

This is all consistent with your .zip files having some small amount of corruption; I would suggest just re-downloading them and trying again. Sorry for the hassle!

Two other notes:

first, we are about to release a new updated GenBank! I'll try to remember to mention it here when we do!

second, we have much faster versions of gather now available in the branchwater plugin. fastgather in particular seems like it might be a good investment for you - you should be able to just replace your sourmash gather command with sourmash scripts fastgather.

You'll need to install with something like mamba install -y sourmash_plugin_branchwater.

then, nhe revised command that should work:

   sourmash scripts fastgather sourmash/"${sample}.sig" "$gbkdb"-{archaea,bacteria,protozoa,fungi,viral}-k51.zip -o sourmash/"${sample}.smash.gbk.csv" --output-prefetch sourmash/"${sample}.prefetch.gbk.csv" -k 51 --scaled 1000 --threshold-bp 5000 

if you get a chance to try it out (maybe on a small .zip file first?) and it doesn't work, please let me know!

@ctb
Copy link
Contributor

ctb commented Jul 1, 2024

oh! sorry, no, you can only currently run fastgather against a single .zip file. My apologies. Still, if you try it and like the speed, I can give you some suggestions for applying it to multiple, but it's more complicated ;(.

@musquita
Copy link
Author

musquita commented Jul 1, 2024

I'll be glad to download a new updated GenBank when it is available !
My protozoa matches, but I was in the process of redownloading files, so I verified that one. Running it separately, it now works. Not the same luck with fungi and bacteria: after my 'new' fungi download, I got yet another md5sum that also fails (2c071d29f713a2f97eb80288d77106b8). The bacteria download will still take a while...

I've tested your fastgather command with the GTDB database, and I should've found it sooner. It is much faster (at least 10x). If you can provide suggestions for applying it to multiple .zip files, I would greatly appreciate it.

@musquita
Copy link
Author

musquita commented Jul 2, 2024

Update: the new download of the bacteria file also fails. The md5sum I got this time was: 16b027bd1d3f934e4f1b84769f3d6a59
Is there any other way to get these files?

Not sure if it helps, but is the message when using this .zip with fastgather identifying the potentially corrupted signatures? (but I might not be interpreting this correctly)

ksize: 51 / scaled: 1000 / moltype: DNA / threshold bp: 5000.0
gathering all sketches in 'test.sig' against '/media/smash/new/genbank-2022.03-bacteria-k51.zip' using 32 threads
Reading query(s) from: 'test.sig'
Loaded 1 query signature(s)
Reading search(s) from: '/media/DBs/smash/new/genbank-2022.03-bacteria-k51.zip'
Loaded 1148010 search signature(s)
using threshold overlap: 5 5000
WARNING: could not load sketches for record 'signatures/d3a3aa7140280708b174bf6f305c320e.sig.gz'
WARNING: could not load sketches for record 'signatures/ee1b9013b80350d1872f657ea7905e03.sig.gz'
WARNING: could not load sketches for record 'signatures/603f0d1583b2af7d08973fbb31503d90.sig.gz'
WARNING: could not load sketches for record 'signatures/96759e2b3113ea4799ed2472783bbb81.sig.gz'
WARNING: could not load sketches for record 'signatures/d759808294e0b5d4b81c515707524a33.sig.gz'
WARNING: could not load sketches for record 'signatures/c60319d3fe13d408a90fde475842b5ab.sig.gz'
WARNING: could not load sketches for record 'signatures/6b83c20b6f96a8cf1c89775feb5b7166.sig.gz'
WARNING: could not load sketches for record 'signatures/5060876cc8ec206edd3f7ef3cb613be6.sig.gz'
WARNING: could not load sketches for record 'signatures/45ede8fda5548842d323d472afc06c30.sig.gz'
WARNING: could not load sketches for record 'signatures/2ec5df438b265dc3a5651a4fff204142.sig.gz'
WARNING: could not load sketches for record 'signatures/519cc3ba36fa3ae718567c0c6ac8063a.sig.gz'
WARNING: could not load sketches for record 'signatures/08c6addb2a44f948cd7fd34450521519.sig.gz'
WARNING: could not load sketches for record 'signatures/0d92e4258be7d47510d614b515e6ac45.sig.gz'
WARNING: could not load sketches for record 'signatures/5a1bfa79c6815a1bbb16913d21b53632.sig.gz'
WARNING: could not load sketches for record 'signatures/db07b298dd199dd12663965fd260ca10.sig.gz'
thread '<unnamed>' panicked at /home/conda/feedstock_root/build_artifacts/sourmash_plugin_branchwater_1718836602865/_build_env/.cargo/registry/src/index.crates.io-6f17d22bba15001f/piz-0.5.1/src/spec.rs:660:9:
assertion `left == right` failed
  left: [0, 0, 0, 0]
 right: [80, 75, 3, 4]
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARNING: could not load sketches for record 'signatures/108a2eae27e5e7b0d2ebbc4fbbb632fb.sig.gz'
WARNING: could not load sketches for record 'signatures/5951285ed572438077ee204c09a6d888.sig.gz'
WARNING: could not load sketches for record 'signatures/8e56761327235bbbc89ce1f345d3bb90.sig.gz'
thread '<unnamed>' panicked at /home/conda/feedstock_root/build_artifacts/sourmash_plugin_branchwater_1718836602865/_build_env/.cargo/registry/src/index.crates.io-6f17d22bba15001f/piz-0.5.1/src/spec.rs:660:9:
assertion `left == right` failed
  left: [0, 0, 0, 0]
 right: [80, 75, 3, 4]
WARNING: could not load sketches for record 'signatures/8f986e6cb302037fcb3707e869a1ce18.sig.gz'
WARNING: could not load sketches for record 'signatures/6d6e69a5a815d45f3f21b31d06fb7ed1.sig.gz'
WARNING: could not load sketches for record 'signatures/e78e09e748fd031214a427d5a132fb19.sig.gz'
WARNING: could not load sketches for record 'signatures/a3b234d28aa9ae1b7af6add1ddc7eaf0.sig.gz'
WARNING: could not load sketches for record 'signatures/9805a87efd1c1d25c8f318164667f9f5.sig.gz'
WARNING: could not load sketches for record 'signatures/cf47b730fd38078aa9f6d778c79d5867.sig.gz'
WARNING: could not load sketches for record 'signatures/85d74e1edcd6d7ac7eee6301c2a4be47.sig.gz'
WARNING: could not load sketches for record 'signatures/b67624a22227eaf549c504b57da6a2c3.sig.gz'
thread '<unnamed>' panicked at /home/conda/feedstock_root/build_artifacts/sourmash_plugin_branchwater_1718836602865/_build_env/.cargo/registry/src/index.crates.io-6f17d22bba15001f/piz-0.5.1/src/spec.rs:660:9:
assertion `left == right` failed
  left: [0, 0, 0, 0]
 right: [80, 75, 3, 4]
WARNING: could not load sketches for record 'signatures/b0e82200d3f411da0a2abe96b8fc8a54.sig.gz'
Traceback (most recent call last):
  File "/home/cris/soft/miniconda3/envs/ont-diag/bin/sourmash", line 11, in <module>
    sys.exit(main())
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash/__main__.py", line 20, in main
    retval = mainmethod(args)
  File "/home/cris/soft/miniconda3/envs/ont-diag/lib/python3.10/site-packages/sourmash_plugin_branchwater/__init__.py", line 117, in main
    status = sourmash_plugin_branchwater.do_fastgather(args.query_sig,
pyo3_runtime.PanicException: assertion `left == right` failed
  left: [0, 0, 0, 0]
 right: [80, 75, 3, 4]

@ctb
Copy link
Contributor

ctb commented Jul 2, 2024

yikes, I don't know what to do about the downloads! This is extremely strange, I've never in my life had repeated problems with downloads 😓

The errors above are just the typical errors of "wow this file is corrupted, I don't know what to do about it."

This is very likely to be connected to either your Internet connection or the computer to which you are downloading it.

Three thoughts -

  • try using a different Internet connection, if possible. like, on-campus vs off-campus. Coffee shops probably won't have the bandwidth (or will throttle your download speed), unf.
  • if you're downloading to your laptop via clicking on the link, try using a command-line approach like curl -JLO https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/genbank-2022.03/genbank-2022.03-bacteria-k51.zip
  • if you're already using curl or wget, try switching to wget or curl - e.g. wget https://farm.cse.ucdavis.edu/~ctbrown/sourmash-db/genbank-2022.03/genbank-2022.03-bacteria-k51.zip should work
  • you might also try asking your IT support to download it for you, if you have such. You can point them to this issue if you like.

If there is a public FTP site where I can drop a file, or you want me to upload it via dropbox or box, lmk. I think I can do the latter? Not sure how they like such big files these days.

The last option I can think of is sending a USB stick - drop me an e-mail at [email protected] with a shipping address, and I'll see what I can do. I'd probably wait until the new genbank is out tho.

@musquita
Copy link
Author

musquita commented Jul 2, 2024

That did it! I'd tried downloading with wget and by clicking the link, but didn't remember to use curl... Now, the files have finished downloading with the correct md5sum!

Thank you for your help, and I'm sorry for taking your time with this trivial issue. I will be marking this as closed.

I'll be sure to keep an eye out for the release of the updated GenBank and will read more about fastgather and how I could apply it to multiple .zip files.

@musquita musquita closed this as completed Jul 2, 2024
@ctb
Copy link
Contributor

ctb commented Jul 2, 2024

No worries, glad it wasn't our server!

I'll post more answers in a separate issue soon! But the short version with fastgather is you have two options -

First,

  • run each fastgather separately
  • combine the resulting CSVs with something like csvtk concat
  • use the resulting combined CSV as a picklist with sourmash gather

Second/alternative,

  • run each fastgather separately
  • build a combined database
  • use the resulting combined database for another round of fastgather

There's a few annoying mechanical steps in there that I need to work through. Will not take me very long.

@ctb
Copy link
Contributor

ctb commented Jul 3, 2024

posted a tutorial here: #3239

and I learned something new - that I could use tail +2 to get all of a CSV file but the header row. cool!

ask questions as you have them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants