Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify task #1839

Merged
merged 53 commits into from
Oct 28, 2021
Merged

Identify task #1839

merged 53 commits into from
Oct 28, 2021

Conversation

WithoutPants
Copy link
Collaborator

@WithoutPants WithoutPants commented Oct 13, 2021

As discussed in #1802.

Resolves #1353 (alongside #1812).
Resolves #1589.

Supercedes #1522.

Adds an Identify task to the Tasks page, grouped with Scan and Auto Tag in the Library section:

image

Opens a dialog as follows:

image

By default, only the first stash-box is added as a source. Other sources can be added, but must either be a stash-box instance or a scene fragment scraper.

Allows setting default options which can be overridden in the individual sources. Includes field options:

image

This allows specific field behaviour to be changed. Strategy can be "ignore", "merge", or "overwrite". For single value fieds, merge is functionally the same as Overwrite only sets if no value is already set on the field. For multivalue fields (performers, tags), it means that existing values are kept while adding to them. For performers, tags and studios, a create missing option is available that creates any missing objects as needed. This defaults to false.

It is possible to set the defaults for the task. This pattern will be applied to the other tasks, alongside a general refactor of the Tasks pages, in another PR.

The Identify task itself tries to identify the scene starting from the first source. If it finds a result, then it sets the scene fields in accordance with the field settings (preferring source-specific settings, then defaulting to the global values. Where a scene field isn't specified, its strategy is set to "merge" and not to create missing objects.

The other sources are only checked if the previous ones had no results.

The task logs if a scene was identified or not.

This PR still has some pending things to do, but it's worth giving it a test as it is.

TODO:

  • identify from scene selection
  • support Identify using folder selection
  • documentation
  • unit tests

Future iterations:

  • add identify task as scan option
  • add filename filtering to scraper configs
  • add some sort of throttling for scraper requests

@WithoutPants WithoutPants added the feature Pull requests that add a new feature label Oct 13, 2021
@WithoutPants WithoutPants added this to the Version 0.11.0 milestone Oct 13, 2021
@WithoutPants WithoutPants marked this pull request as draft October 13, 2021 05:27
Copy link
Contributor

@SmallCoccinelle SmallCoccinelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this looks good. I only looked at the Go code though, since I'm bad at Typescript idioms and stuff. Most of the comments are just my curiosity as to why certain things are written as they are.

graphql/schema/types/metadata.graphql Show resolved Hide resolved
pkg/manager/config/config.go Show resolved Hide resolved
pkg/manager/task_identify.go Outdated Show resolved Hide resolved
pkg/models/stash_box.go Show resolved Hide resolved
pkg/models/stash_box.go Outdated Show resolved Hide resolved
pkg/models/stash_box.go Outdated Show resolved Hide resolved
pkg/autotag/identify.go Outdated Show resolved Hide resolved
pkg/autotag/identify.go Outdated Show resolved Hide resolved
pkg/autotag/identify.go Outdated Show resolved Hide resolved
pkg/autotag/identify.go Outdated Show resolved Hide resolved
@kermieisinthehouse
Copy link
Collaborator

  • It should complain if you have no sources configured, otherwise it will confuse users who haven't configured scrapers / stashdb
  • The message "Default options will be used where source-specific options are not set." is unclear. It took a little while to figure out what it refers to.
  • The "Unable to identify" should maybe be at the warning log level, since they outnumber the matches 1000:1 (at least in my dataset)

We should also be much clearer with the user that media information about their entire library is about to be transmitted over the network to their sources. Maybe have an option about what the match criteria to send, as in only send phashes/oshashes, or filename fragments, etc.

@kermieisinthehouse
Copy link
Collaborator

Some questions: What does merge mean for a studio? What does it mean for titles?

@stg-annon
Copy link
Collaborator

stg-annon commented Oct 13, 2021

Some questions: What does merge mean for a studio? What does it mean for titles?

Seems to be ignored and only used for multi field values based off this comment

@WithoutPants
Copy link
Collaborator Author

Some questions: What does merge mean for a studio? What does it mean for titles?

For single value fieds, merge is functionally the same as Overwrite. For multivalue fields (performers, tags), it means that existing values are kept while adding to them.

@WithoutPants
Copy link
Collaborator Author

Potential future enhancement - would be useful if this (could) include the blacklist from the scene tagger (or it's own blacklist). There are files where it works if query done in the scene tagger due to blacklist stripping out a value like _1080, but doesn't work in Identify because it uses the raw file name.

May be somewhat redundant with the TODO of adding filename filtering to scraper configs, but it may also be less development work if the blacklist data is already available.

Otherwise my tests found that it identified all files that could be found with "fingerprint match", and seemed to correctly identify all files that could be found with the 'query' button in scene tagger as long as that query string had not had any values removed from it by blacklist.

The identify task does not use filenames and the blacklist does not apply. The Identify task is the equivalent of running a query by fingerprint not a keyword query. I realise that some scrapers - for example ThePornDB scraper - use the filename, but in this instance the blacklist is not used; the blacklist is only applied when using the keyword query function.

In short, the Identify task is the equivalent of running the "Scrape by fragment" or "Scrape All" operations in the Tagger, not the "Search" operation.

There is future scope for having scraper-specific options and a blacklist might be worth including, but at least with the current implementation, the blacklist simply doesn't apply.

Copy link
Contributor

@SmallCoccinelle SmallCoccinelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Went over the Go code only. Added some comments here and there, and some of them are definitely worth looking into.

graphql/schema/schema.graphql Outdated Show resolved Hide resolved
graphql/schema/types/metadata.graphql Show resolved Hide resolved
graphql/schema/types/metadata.graphql Show resolved Hide resolved
pkg/autotag/identify.go Outdated Show resolved Hide resolved
}

if scene == nil {
return fmt.Errorf("no scene found with id %d", t.SceneID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An option here is to have an error at the top-level:

var ErrNotFound = errors.New("entity not found")

so you can wrap that static error here: fmt.Errorf("scene with id %d: %w", t.SceneID, ErrNotFound).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll reuse models.ErrNotFound instead, unless there's a reason not to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on where you want to go. Consolidation is good. Unless you want to tell the errors apart, or you want to avoid cross-dependencies between packages.

pkg/autotag/identify.go Outdated Show resolved Hide resolved
return ret
}

func idListEquals(a, b []int) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this does what it's supposed to. Consider e.g.:

func TestIdEqualsReject(t *testing.T) {
	x := []int{1, 1, 1, 2, 2, 2}
	y := []int{1, 2, 3, 4, 5, 6}

	if IdListEquals(x, y) {
		t.Errorf("Lists aren't the same")
	}
}

Which fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are just interested in precise equality, you can use reflect.DeepEqual. Or use the hammer which is also safer: `https://pkg.go.dev/github.com/google/go-cmp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we want to compare lists while ignoring order, since the order may be changed. I've written a utility function to do this generically.

pkg/autotag/identify.go Outdated Show resolved Hide resolved
pkg/autotag/identify.go Outdated Show resolved Hide resolved
pkg/scene/query.go Outdated Show resolved Hide resolved
Copy link
Contributor

@SmallCoccinelle SmallCoccinelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - just updating so I'm caught up.

@SmallCoccinelle
Copy link
Contributor

Went over this PR as well. Managed to push the wrong button so I reviewed some outdated stuff as well. But I'm sure you can manage to find if it has moved elsewhere.

@ghost
Copy link

ghost commented Oct 22, 2021

The UI feels a bit confusing. I would suggest listing all fields all the time, and using a button group for the strategy instead of a dropdown, that way it's easy to quickly toggle on/off different fields. It's also clear what actions will be taken for all fields. For source settings you could have an additional "inherit" option which is selected by default.

I also think it would be better to remove the option to set invalid values, like merge for single value fields. Considering the identify task can run on an entire collection, it's best to be clear about what it will do.

Lastly, for the set cover images option, isn't that basically just another field?

@WithoutPants
Copy link
Collaborator Author

The UI feels a bit confusing. I would suggest listing all fields all the time, and using a button group for the strategy instead of a dropdown, that way it's easy to quickly toggle on/off different fields. It's also clear what actions will be taken for all fields. For source settings you could have an additional "inherit" option which is selected by default.

Sounds reasonable, I'll take a look.

I also think it would be better to remove the option to set invalid values, like merge for single value fields. Considering the identify task can run on an entire collection, it's best to be clear about what it will do.

Merge is a valid value, but perhaps could be named better for single-value fields. It basically means overwrite if empty.

Lastly, for the set cover images option, isn't that basically just another field?

Scene cover is in a bit of a funny place at the moment. Conceptually, the image cover always exists - it is automatically generated during scan. We can check for the presence of the cover image in the database, but I don't think this is good long-term design (shifting images out of the database is something I want to do in future). As such, there's no good way to determine if a cover is set (to something other than the default) or not, meaning that the option for it is a simple boolean. There's no reason to have overwrite/merge/ignore state for it, so a simple boolean state seemed like the sensible option.

@WithoutPants
Copy link
Collaborator Author

  • Did some refactoring to support running the scene update post hook after updating a scene.
  • Fixed bug where directories could not be cleared in the identify dialog.
  • Identify dialog now shows all fields and presents strategy options with radio buttons instead of dropdown:

image

@WithoutPants WithoutPants merged commit 27cdefb into stashapp:develop Oct 28, 2021
WithoutPants added a commit to WithoutPants/stash that referenced this pull request Oct 28, 2021
* Add identify task
* Change type naming
* Debounce folder select text input
* Add generic slice comparison function
@SmallCoccinelle SmallCoccinelle mentioned this pull request Oct 28, 2021
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Pull requests that add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Bulk/Automate Scrap from stashdb or other scraper How to scrape multiple videos
6 participants