ISSUE-91:ML processors #92

DiegoPino · 2024-05-14T20:27:46Z

Fully depends on esmero/strawberryfield#326

ML processors use a IIIF URL, so no need to download the file at all

Abstract bc it can't be used directly. Because each ML model is quite opinionated. Full of bugs, lacks of checks but it works. Will keep refining and trying to make it elegant before merging

Generates the proper embeddings and those end properly in Solr

But it will. Basically i can't pass an Image URL and a bounding box via GET (damn Drupal 10.2) but! i can gzip and base64 encode. I will that next. The idea here is that this filter can be fed via "ajax" by another view/formatter/plugin with a JSON structure that i will then decode here. I will check, if no bounding box was given, if i have a Vector for the same type configured in Solr and use that one, if not, i will call the corresponding API on the NLP container and the generate a vector on the fly (well like 3 seconds at least) and then alter the query to do a KNN... cool stuff here

Basically preparing for re-use inside a Views Filter. That way we can ensure we search with the same ML model/vector size that generated the "against what we are searching for"

and expand the interface public methods of the runnerUtility Service (just bc it aids in autocompleting/validating/overriding if needed/ever in the future)

…gins that are not "top level"... because we might end needing deeper level listings for external access of the method

Because vectors are so specific. And we don't want white screens of sorts like e.g "solr died" because it felt lonely, tired who knows.

+ match the description users would read, I think nobody is into vectors so better so

Simpler than multiple if/then/what/else/who-cares situation

So, now that we are actually - Fetching an image - Allowing a region - calling the processor on the backend to generate vector (and object detection... will be useful in the future too) We need to start thinking of query/alter via tags. Similar to what we do with the OCR one. But there are many things here we will have to figure out... like how does KNN interact with other ones? The prequery v/s the post query and how we will deal with facets. But baby steps first. We need to add a new query option, alter/act on it and see if !KNN works first

(strawberryfield level, since that module provides the knn fields) the query and join/etc the weird magic

Plus the filter processing via exposed filters (which is indeed a different plugin and the one i really need for interactive filtering)

@alliomeria

… similar! @alliomeria (*since i did not pinged you before) I think that for exposed ML arguments instead of passing a JSON i will only allow a UUID of an existing file + a fragment selector so UUID#xywh=percent:{$left},{$top},{$width},{$height}" Why? The argument is shorter/easier do decode... also will basically remove the creepness of passing JSON/and s3:// Adds though an extra file load, but that adds security (ask me if you want tomorrow why an Contextual filter, and i will answer)

@alliomeria

This filter could also allow a "hide" option... so basically only to be set via URL arguments ... right? @alliomeria right?

pack("*c" and stuff)

Processors are getting cleaner (not there yet) and Image Argument filter less beef-fy

@alliomeria

One annoying thing is that because how i built Runners a runner can not generate (without a pager) multiple Flavors. Insightface so far is the one where embeddings have a closer alignment with the logic of detecting a face. The main features of a faces are encoded in the Vector which in test allowed me to even get family members (tested with myself and mom!) but also means that if a single image has multiple detections i need multiple Vectors, and Solr allows on vector per field/per document. Will explore the one to many option in our code @alliomeria

to the output so we can chain

@alliomeria

Todo ask @alliomeria and the community about "empty completeness" v/s "absence of data" (first implies normalization, e.g doc count. == processed count), the second slimmer archipelago

We want people to know this will be possible, but for now it is easier to treat all filters are pre-queries (as we do)

Also don't state it is YOLOv8, in the future (near) it might be yolov10, yolov11, etc.

@alliomeria

Also: Need to document this \Drupal\strawberry_runners\Plugin\views\filter\StrawberryRunnersMLImagefilter::IMAGEML_INPUT_SCHEMA so people testing know what the expected SHAPE of the JSON is. Again. This filter is just a way of exposing values. Programmatically (user/dev needs to know what needs to be done) one would submit a JSON with that structure @alliomeria

DiegoPino · 2024-06-20T17:31:02Z

src/strawberryRunnerUtilityService.php

@@ -298,19 +318,91 @@ public function invokeProcessorForAdo(ContentEntityInterface $entity, array $sbf
              }
            }
          }
+          // JSON/Metadata level plugins coming from ADO JSON directly , $plugin_definition['input_type'] == entity:node


@alliomeria this is code for the future. This here will allow Post processors to also act on pure NODES and their metadata. I have no plugin that needs this but once we have a few ideas this can used directly.

DiegoPino · 2024-06-20T17:33:43Z

@alliomeria I would like this is mentioned in the new release but documentation is something that we actually need to work internally and generate/decide on the demo implementations. The least harmful ML method here is mobileNet, so maybe enable as an admin that one? Also. I need to know which permission/users should be able to use ML driven Views for testing.

alliomeria · 2024-06-20T17:38:34Z

Ok @DiegoPino, understood about your preferences for this. Admin only is what you've been saying for awhile and sounds good to me too.

DiegoPino · 2024-06-20T17:56:58Z

@alliomeria the only issue with Admin Only is if we have a group of people working/testing (this is query) part that the institution does not trust on full admin access, but still provides Authenticated/User access.
What if I make it "has a certain permissions" OR "is admin" but is not (never) "anonymous" ?

alliomeria · 2024-06-20T18:04:45Z

Would defer to you on this @DiegoPino, all depends on how widely you think this tool should be used. Having the option for any type of Authenticated User with the checked permissions, but never anonymous/non-authenticated users also works & enables non-admistrator level users to work/test as you noted.

DiegoPino · 2024-06-20T18:42:53Z

@alliomeria I will try to work tonight on generating a very simple documentation piece. Added the permissions and restrictions. This requires a new NLP processor container which I will build and publish tomorrow AM (takes me 3 hours of docker built at least). Thanks for reading all this, this is at least 3K lines of code

DiegoPino added 10 commits May 13, 2024 17:59

Future comment for myself

a66007d

Don't pre-fetch the file if input_property != 'filepath'

cde946d

ML processors use a IIIF URL, so no need to download the file at all

Cleanup on OCR. Mostly comments/unreachable code

746210d

Allow Metadata only processing. Needs more code on the Queue itself

88585a5

Focus diego!

6a3df50

First pass on an abstract ML processor.

4d6baea

Abstract bc it can't be used directly. Because each ML model is quite opinionated. Full of bugs, lacks of checks but it works. Will keep refining and trying to make it elegant before merging

ML YOLO. Missing some $output keys and validations still

f81c5dd

Generates the proper embeddings and those end properly in Solr

remove unused method

faf595c

process Object boxes and names as OCR with name as text + certainty

76f40a6

Confidence goes with label detected

01e2fe6

DiegoPino mentioned this pull request May 14, 2024

ISSUE-325: Vectors for each king, one float to rule them all esmero/strawberryfield#326

Merged

DiegoPino added 19 commits May 15, 2024 20:50

Clean up Abstract ML processor and add public methods/constants

f478f27

Basically preparing for re-use inside a Views Filter. That way we can ensure we search with the same ML model/vector size that generated the "against what we are searching for"

Clean up YOLO processor (more to come)

3645846

and expand the interface public methods of the runnerUtility Service (just bc it aids in autocompleting/validating/overriding if needed/ever in the future)

Not close to ready, but better commit now than be sorry later

23da661

Allow the runners service to provide (not default) also a list of plu…

d1351f9

…gins that are not "top level"... because we might end needing deeper level listings for external access of the method

Overly complex validation of configured field v/s Processor

21a1b51

Because vectors are so specific. And we don't want white screens of sorts like e.g "solr died" because it felt lonely, tired who knows.

Make documenting easier by giving errors that explain why

73e0929

+ match the description users would read, I think nobody is into vectors so better so

adds topk (topK solr) as an argument

d4f7ac2

Little bit of JSON schema magic for ML filter sanitizing

fe578cd

Simpler than multiple if/then/what/else/who-cares situation

Yep. This is working now. We need to alter based on the backend option

f6d4070

(strawberryfield level, since that module provides the knn fields) the query and join/etc the weird magic

ha! it works... not perfect and needs more validation/cleaning

fe68ce3

Plus the filter processing via exposed filters (which is indeed a different plugin and the one i really need for interactive filtering)

Clean up the demo filter ...

7722afb

This filter could also allow a "hide" option... so basically only to be set via URL arguments ... right? @alliomeria right?

Remove non sense here. Still need to rewrite how this one reads data

8151899

pack("*c" and stuff)

More supporting ML code commit (tested)

2c20fed

Processors are getting cleaner (not there yet) and Image Argument filter less beef-fy

Why am i so distracted? wrong accessor

2e7f883

You made typos 3 years ago? "consume this ouput"

aba8b76

Put the right time this was created. Not 2022

0b9e503

DiegoPino added 12 commits May 25, 2024 19:20

Adds KNN Text exposed filter

257aa9e

Sbert Filter and Tiny Fix on ML Image Filter (the one for debugging)

f0ccaee

Make Sure Processors that are Indexing ALso pass the whole ->saerchapi

a43c352

to the output so we can chain

Cleans ML processors

7e7f25d

Simplify a bit the AbstractML Processor all the other ones inherit

5ae461c

Adds the Text filter as a Filter/Plugin for Search API

5bbf75a

If no face, then maybe just don't generate search API at all?

4eed5d2

Todo ask @alliomeria and the community about "empty completeness" v/s "absence of data" (first implies normalization, e.g doc count. == processed count), the second slimmer archipelago

mark pre query as disabled/future features.

c3eba87

We want people to know this will be possible, but for now it is easier to treat all filters are pre-queries (as we do)

Limit this. ADO processing does not yet exist. But will.

51a6c3d

Also don't state it is YOLOv8, in the future (near) it might be yolov10, yolov11, etc.

Mark facet pre queries as future feature

c491d64

DiegoPino commented Jun 20, 2024

View reviewed changes

DiegoPino added 3 commits June 20, 2024 14:20

Adds Permissions and note about Anonymous users

820f3eb

Apply permissions on Exposed form and Query itself. Check for Anonymous

adc699b

Apply permissions on Exposed Image Argument/Views for ML

9e26bda

DiegoPino merged commit 42a36eb into 0.8.0 Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISSUE-91:ML processors #92

ISSUE-91:ML processors #92

DiegoPino commented May 14, 2024 •

edited

Loading

DiegoPino Jun 20, 2024

DiegoPino commented Jun 20, 2024

alliomeria commented Jun 20, 2024

DiegoPino commented Jun 20, 2024

alliomeria commented Jun 20, 2024

DiegoPino commented Jun 20, 2024 •

edited

Loading

ISSUE-91:ML processors #92

ISSUE-91:ML processors #92

Conversation

DiegoPino commented May 14, 2024 • edited Loading

DiegoPino Jun 20, 2024

Choose a reason for hiding this comment

DiegoPino commented Jun 20, 2024

alliomeria commented Jun 20, 2024

DiegoPino commented Jun 20, 2024

alliomeria commented Jun 20, 2024

DiegoPino commented Jun 20, 2024 • edited Loading

DiegoPino commented May 14, 2024 •

edited

Loading

DiegoPino commented Jun 20, 2024 •

edited

Loading