Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disc-hashing of non-PSX discs #1075

Open
adelikat opened this issue Dec 9, 2017 · 19 comments
Open

Disc-hashing of non-PSX discs #1075

adelikat opened this issue Dec 9, 2017 · 19 comments
Labels
App: EmuHawk Relating to EmuHawk frontend Repro: Patch pending Potentially fixed in dev build, see readme for download Request: Feature/Enhancement For feature requests or possible improvements

Comments

@adelikat
Copy link
Contributor

adelikat commented Dec 9, 2017

Open a saturn game and click the rom status icon in the staus bar, nothing happens.

This is also true of any core that takes a list of roms (for some there's both a single rom and multi-rom constructors and it only happens for multi-rom).

Either we need the client to take a stab at implementing RomAnnotationDetails for multi-roms or each core affected needs to implement that logic

@Asnivor
Copy link
Contributor

Asnivor commented Mar 22, 2018

see #1152

adelikat pushed a commit that referenced this issue Jun 22, 2018
@Asnivor
Copy link
Contributor

Asnivor commented Aug 23, 2018

This now works as of a few months ago.

The last thing is that disc-based 'ROMS' just show their hashes as 'N/A' (because thats what I did at the time).

Can we have a discussion on what we should be showing here in this instance?

  • cue hashes are pointless
  • bin/img/iso hashes can be expensive I would imagine
  • iirc PSX has some blackmagic internal hash thing that @zeromus mentioned to me a while back. Do we think about using this?
  • is there such a thing as a 'known good dump' when it comes to disc images?
  • do we just not bother....

@zeromus
Copy link
Contributor

zeromus commented Aug 23, 2018

cue is the same situation as bin/img/iso. It's the mounted disc which is hashed. All of the disc images have a black magic hashing method (although some like PSX are far blacker than others). I did generate a known good list of PSX roms by running my blacker magic hashing method on all of redump. It's in our gamedb.

We should be able to show a disc "hash" in the rom status. These are useful for identification purposes. Thats what the movie tools should be doing to confirm the "correct" game is probably being used. Of course, it can't be known for 100% sure unless the entire disc's contents are hashed. For now, a weaker identifying hash is all we can do.

For all discs we should (maybe we arent, I dont know) using a hash based on the TOC, plus the first several sectors. First several sectors was chosen by vecna originally. I determined how many sectors were required to uniquely identify all the PSX games (part of blacker magic) although it would be possible to trick it.

A more robust identification would involve parsing the filesystem, hashing that, AND identifying the executable, and hashing that. The goal is to hash as much as we can, except for the last bit of hashing everything. While keeping in mind romhacks are likely to only change the EXE, and bad rips are likely to have different TOCs before anything else is wrong.

However, we're probably not changing any hashing methods now. I'm just telling you what my grand plan was.

At any rate we DO have a disc identifying hash. We should display that. How we indicate that it's a disc identifying hash and not a complete verification hash, I don't know. I had thought we should offer to compute a complete verification hash, and people should be able to put that in movies--people that want to be sure can run the verification hash on their media to check. Without that, only the cursory identification hash would be used.

@Asnivor
Copy link
Contributor

Asnivor commented Sep 11, 2018

@zeromus

So should we be using DiscHasher.Calculate_PSX_BizIDHash() for non-PSX discs as well?
Or DiscHasher.OldHash()?

@zeromus
Copy link
Contributor

zeromus commented Sep 11, 2018

The right way to do this is to do the same analysis I did for the PSX: use the PSX hash but adjust the 26 as needed while running it on the whole saturn corpus until an appropriate value is found (or just use 26 again if that works, so that we can use the same hash for both systems). At that time the gamedb also needs to be built. I already know how to do that, so you can either assign this to me in order to make me do it, or tackle it yourself.

note: DBMan's DiscHash.cs seems to be what I used to perform this analysis. I think it would be nice if we can improve that tool to formalize this process, so that the analysis can be performed again on an updated corpus. Although keep in mind that if the hashes ever change, people's movies would contain obsolete disc ID hashes. This isn't the end of the world, though.

Also, if we DO ever change the hashing approach, we should go ahead and try the filesystem and EXE thing. However, since I just thought of the 'obsoleting hashes in movies' argument, that may be justification for doing the work now, for saturn. I think this cdfs-and-exe thing may suit your interests well, but it's up to you.

@Asnivor Asnivor changed the title Rom status icon doesn't work on Saturn Disc-hashing of non-PSX discs Sep 13, 2018
@Asnivor Asnivor added Request: Feature/Enhancement For feature requests or possible improvements App: EmuHawk Relating to EmuHawk frontend labels Sep 13, 2018
@Asnivor
Copy link
Contributor

Asnivor commented Sep 13, 2018

I think i'll have to push it to you for now. The 'whole ss corpus' thing is where I will struggle at the moment.

@CasualPokePlayer
Copy link
Member

Since 32e8afc, it's now not super expensive to hash the entire disc. Generally it should take less than a second to hash the entirety of the disc. We probably can just do full disc hashes for every disc and scrap the identifying hashes.

@YoshiRulz
Copy link
Member

Keep the one which matches redump.org still

@zeromus
Copy link
Contributor

zeromus commented Aug 28, 2022

Just reading 700MB should take more than 300ms. I'm skeptical of the analysis. You need to run it on "cold" discs. Or do analysis that incorporated the reading time in the first place. The scenario of waiting seconds while a game is opened to make 100% sure it's something our database has is not something that casual users care about, only TASers and such. Maybe it's as simple as having an option to do complete verification of the discs at load time. Although I am still in favor of doing it in the background somehow and offering it as an advisory info-tip "! complete disc verification failed !"

@CasualPokePlayer
Copy link
Member

To clarify, around 270-275 ms of that 369 ms is spent actually reading the disc from what I measured. And the old CRC32 algo took over 4 seconds itself. What do you mean exactly by cold discs?

@CasualPokePlayer
Copy link
Member

CasualPokePlayer commented Aug 28, 2022

On the doing it in the background, I assume that is possible (with a separate thread given a new Disc object for that disc). (EDIT: No we can't. Probably. Hash is used for obtaining entry in gamedb, so that's blocking). Even so, we could still parallelize the reading and hashing and that should offer a bit of a performance boost (although I'm not so sure how worth it that is).

Also, due to some sloppy code ROMs are guarenteed to be hashed at least 3 times with SHA1 and at least 1 time with MD5 (possibly additional times for both in some cases). This itself is a bit annoying if you have a rather large file for a ROM. A worst case scenario can be seen with Pokemon Black/White 2, which is 512MiB big, and that takes 1.3 seconds to hash each time. Granted however, booting the core up takes much more time anyways if partially due to waterboxing and probably other sloppy code elsewhere. We should probably do something about that too.

Keep the one which matches redump.org still

Those are the full disc hashes.

@zeromus
Copy link
Contributor

zeromus commented Aug 28, 2022

"cold discs" are ones you didn't load a few seconds ago on another test. it must be completely out of the filesystem caches, even the disk caches. maybe you should read as many gigs of garbage data as you have ram to make sure it's all completely gone... actually there is probably a win32 api call to invalidate the whole disk cache? or maybe just the cache for a certain filename? yeah that would be a better way to do it

@CasualPokePlayer
Copy link
Member

Seems with a "cold disc" it's around 1811 ms for 700MB. Afterwards it goes back down to around the numbers I saw.

Still not so bad probably (especially as that's a more worser case here, cold discs for smaller games are generally much less). I'll see how well my threading idea works and if it has a significant difference).

@zeromus
Copy link
Contributor

zeromus commented Aug 29, 2022

Threading reads is not likely to help, in fact likely to hurt by making the OS and disk trying its hardest to serve multiple users not realizing they're actually one user trying to pretend to be 3 users. 3 users in parallel can't be serviced as quickly as 3 users in serial, you only take the added overhead of going parallel to increase responsiveness... of unrelated tasks. Here where it's all the same task, it's nuts, but good luck.
You might get some overall improvements by threading the reading and hashing separately though (read chunks in one thread and hand off for hashing to another thread). But not a whole lot

@CasualPokePlayer
Copy link
Member

Yes, I was saying threading the reading and hashing, as those in theory are parallelizable. Probably not super much of an improvement.

@CasualPokePlayer
Copy link
Member

CasualPokePlayer commented Aug 29, 2022

Apparently not even that helps, that just makes hashing take double the time (or worse, depending on how I do it). I suspect it's due to far more managed memory being allocated and that hurts performance more than threads can help.

@YoshiRulz
Copy link
Member

Doesn't the disc need to be parsed before it's hashed? Or is that the other one? Point is the managed parsing code is going to be the bottleneck.

@zeromus
Copy link
Contributor

zeromus commented Aug 29, 2022

Disc "parsing" is not time-consuming, all the data is in a "header" (aka a cue file, or the actual header of a ccd file) and it's limited to 99 tracks. The big binary blob part/file requires no further analysis. It's designed to be random-access once the header is parsed

@YoshiRulz
Copy link
Member

This issue needs re-triaging after that PR ^ e2a6942

@YoshiRulz YoshiRulz added the Repro: Patch pending Potentially fixed in dev build, see readme for download label Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
App: EmuHawk Relating to EmuHawk frontend Repro: Patch pending Potentially fixed in dev build, see readme for download Request: Feature/Enhancement For feature requests or possible improvements
Projects
None yet
Development

No branches or pull requests

5 participants