Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE-91:ML processors #92

Merged
merged 44 commits into from
Jun 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
a66007d
Future comment for myself
DiegoPino May 13, 2024
cde946d
Don't pre-fetch the file if input_property != 'filepath'
DiegoPino May 13, 2024
746210d
Cleanup on OCR. Mostly comments/unreachable code
DiegoPino May 13, 2024
88585a5
Allow Metadata only processing. Needs more code on the Queue itself
DiegoPino May 13, 2024
6a3df50
Focus diego!
DiegoPino May 14, 2024
4d6baea
First pass on an abstract ML processor.
DiegoPino May 14, 2024
f81c5dd
ML YOLO. Missing some $output keys and validations still
DiegoPino May 14, 2024
faf595c
remove unused method
DiegoPino May 14, 2024
76f40a6
process Object boxes and names as OCR with name as text + certainty
DiegoPino May 14, 2024
01e2fe6
Confidence goes with label detected
DiegoPino May 14, 2024
73b9790
Adds an ML Filter. Still Working on it. Does nothing yet
DiegoPino May 16, 2024
f478f27
Clean up Abstract ML processor and add public methods/constants
DiegoPino May 17, 2024
3645846
Clean up YOLO processor (more to come)
DiegoPino May 17, 2024
23da661
Not close to ready, but better commit now than be sorry later
DiegoPino May 17, 2024
d1351f9
Allow the runners service to provide (not default) also a list of plu…
DiegoPino May 17, 2024
21a1b51
Overly complex validation of configured field v/s Processor
DiegoPino May 17, 2024
73e0929
Make documenting easier by giving errors that explain why
DiegoPino May 17, 2024
d4f7ac2
adds topk (topK solr) as an argument
DiegoPino May 17, 2024
fe578cd
Little bit of JSON schema magic for ML filter sanitizing
DiegoPino May 18, 2024
f87f549
The live backend ML vectorizing on query time is working... time to !KNN
DiegoPino May 18, 2024
f6d4070
Yep. This is working now. We need to alter based on the backend option
DiegoPino May 18, 2024
fe68ce3
ha! it works... not perfect and needs more validation/cleaning
DiegoPino May 18, 2024
24d2344
Oh gosh Drupal. Argument queries (even if basically a filter) are 0.0…
DiegoPino May 20, 2024
7722afb
Clean up the demo filter ...
DiegoPino May 20, 2024
8151899
Remove non sense here. Still need to rewrite how this one reads data
DiegoPino May 21, 2024
2c20fed
More supporting ML code commit (tested)
DiegoPino May 23, 2024
2e7f883
Why am i so distracted? wrong accessor
DiegoPino May 23, 2024
aba8b76
You made typos 3 years ago? "consume this ouput"
DiegoPino May 24, 2024
0b9e503
Put the right time this was created. Not 2022
DiegoPino May 24, 2024
2201ac3
Small updates. Insightface is working now
DiegoPino May 25, 2024
257aa9e
Adds KNN Text exposed filter
DiegoPino May 26, 2024
f0ccaee
Sbert Filter and Tiny Fix on ML Image Filter (the one for debugging)
DiegoPino May 26, 2024
a43c352
Make Sure Processors that are Indexing ALso pass the whole ->saerchapi
DiegoPino May 26, 2024
7e7f25d
Cleans ML processors
DiegoPino May 26, 2024
5ae461c
Simplify a bit the AbstractML Processor all the other ones inherit
DiegoPino May 26, 2024
5bbf75a
Adds the Text filter as a Filter/Plugin for Search API
DiegoPino May 26, 2024
4eed5d2
If no face, then maybe just don't generate search API at all?
DiegoPino May 29, 2024
c3eba87
mark pre query as disabled/future features.
DiegoPino Jun 20, 2024
51a6c3d
Limit this. ADO processing does not yet exist. But will.
DiegoPino Jun 20, 2024
c491d64
Mark facet pre queries as future feature
DiegoPino Jun 20, 2024
a387f0e
mark Test/Internal Filter for ML pre queries as future features
DiegoPino Jun 20, 2024
820f3eb
Adds Permissions and note about Anonymous users
DiegoPino Jun 20, 2024
adc699b
Apply permissions on Exposed form and Query itself. Check for Anonymous
DiegoPino Jun 20, 2024
9e26bda
Apply permissions on Exposed Image Argument/Views for ML
DiegoPino Jun 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/Annotation/StrawberryRunnersPostProcessor.php
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ class StrawberryRunnersPostProcessor extends Plugin {
public $input_argument;

/**
* Processing stage: can be Entity PreSave or PostSave
* Processing stage: can be Entity PreSave or PostSave. Pre save is good for ADO/Metadata. Implementation to follow.
*
* @var string $when;
*
Expand Down
169 changes: 119 additions & 50 deletions src/Plugin/QueueWorker/AbstractPostProcessorQueueWorker.php

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ public function settingsForm(array $parents, FormStateInterface $form_state) {
'searchapi' => 'In a Search API Document using the Strawberryfield Flavor Data Source (e.g used for HOCR highlight)',
],
'#default_value' => (!empty($this->getConfiguration()['output_destination']) && is_array($this->getConfiguration()['output_destination'])) ? $this->getConfiguration()['output_destination'] : [],
'#description' => t('As Input for another processor Plugin will only have an effect if another Processor is setup to consume this ouput.'),
'#description' => t('As Input for another processor Plugin will only have an effect if another Processor is setup to consume this output.'),
'#required' => TRUE,
];

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ public function run(\stdClass $io, $context = StrawberryRunnersPostProcessorPlug
// We use the actual file UUID to as part of the ID
// e.g default_solr_index-strawberryfield_flavor_datasource/5801:1:en:1e9f687c-e29e-4c23-91ba-655d9c5cdfe6:ocr
// For the general ID we will use this number when there are multiple siblings
// or 1 if the File is a single ouput
// or 1 if the File is a single output
$sequence_number[] = $io->input->metadata['sequence'];
}

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
<?php
/**
* Created by PhpStorm.
* User: dpino
* Date: 05/22/24
* Time: 8:07AM
*/

namespace Drupal\strawberry_runners\Plugin\StrawberryRunnersPostProcessor;

use Drupal\Core\Cache\CacheBackendInterface;
use Drupal\Core\Form\FormStateInterface;
use Drupal\Core\StreamWrapper\StreamWrapperManager;
use Drupal\strawberry_runners\Annotation\StrawberryRunnersPostProcessor;
use Drupal\strawberry_runners\Plugin\StrawberryRunnersPostProcessorPluginInterface;
use Drupal\strawberry_runners\VTTLine;
use Drupal\strawberry_runners\VTTProcessor;
use Drupal\strawberryfield\Plugin\search_api\datasource\StrawberryfieldFlavorDatasource;
use Drupal\strawberry_runners\Web64\Nlp\NlpClient;
use Laracasts\Transcriptions\Transcription;

/**
*
* ML YOLO
*
* @StrawberryRunnersPostProcessor(
* id = "ml_insightface",
* label = @Translation("Post processor that generates Face detection(s) and Vector Embedding(s) using InsightFace/ArcFace"),
* input_type = "entity:file",
* input_property = "filepath",
* input_argument = "sequence_number"
* )
*/
class MLInsightfacePostProcessor extends abstractMLPostProcessor {

public $pluginDefinition;

/**
* {@inheritdoc}
*/
public function defaultConfiguration() {
return [
'source_type' => 'asstructure',
'mime_type' => ['image/jpeg'],
'output_type' => 'json',
'output_destination' => 'searchapi',
'processor_queue_type' => 'background',
'language_key' => 'language_iso639_3',
'language_default' => 'eng',
'timeout' => 300,
'nlp_url' => 'http://esmero-nlp:6400',
'ml_method' => '/image/insightface',
] + parent::defaultConfiguration();
}

public function settingsForm(array $parents, FormStateInterface $form_state) {
$element = parent::settingsForm($parents, $form_state);
$element['source_type'] = [
'#type' => 'select',
'#title' => $this->t('The type of source data this processor works on'),
'#options' => [
'asstructure' => 'File entities referenced in the as:filetype JSON structure',
],
'#default_value' => $this->getConfiguration()['source_type'],
'#description' => $this->t('Select from where the source data this processor needs is fetched'),
'#required' => TRUE,
];
$element['ml_method'] = [
'#type' => 'radios',
'#title' => $this->t('ML endpoint to use (fixed)'),
'#options' => [
'/image/insightface' => 'InsightFace (Detections as MiniOCR Annotations and one embedding as a Unit Length Vector)',
],
'#default_value' => $this->getConfiguration()['ml_method'],
'#description' => $this->t('The ML endpoint/Model. This is fixed for this processor.'),
'#required' => TRUE,
];
// Only Images for now.
$element['jsonkey']['#options'] = [ 'as:image' => 'as:image'];
return $element;
}

protected function runTextMLfromJSON($io, NlpClient $nlpClient): \stdClass {
$output = new \stdClass();
return $output;
}

protected function runImageMLfromIIIF($io, NlpClient $nlpClient): \stdClass {
$output = new \stdClass();
$config = $this->getConfiguration();
$input_argument = $this->pluginDefinition['input_argument'];
$file_languages = isset($io->input->lang) ? (array) $io->input->lang : [$config['language_default'] ? trim($config['language_default'] ?? '') : 'eng'];
$sequence_number = isset($io->input->{$input_argument}) ? (int) $io->input->{$input_argument} : 1;
setlocale(LC_CTYPE, 'en_US.UTF-8');
$width = $io->input->metadata['flv:identify'][$io->input->{$input_argument}]['width'] ?? NULL;
$height = $io->input->metadata['flv:identify'][$io->input->{$input_argument}]['height'] ?? NULL;
if (!($width && $height)) {
$width = $io->input->metadata['flv:exif']['ImageWidth'] ?? NULL;
$height = $io->input->metadata['flv:exif']['ImageHeight'] ?? NULL;
}
$iiifidentifier = urlencode(
StreamWrapperManager::getTarget( isset($io->input->metadata['url']) ? $io->input->metadata['url'] : NULL)
);

if ($iiifidentifier == NULL || empty($iiifidentifier)) {
return $output;
}
/// Mobilenet does its own (via mediapipe) image scalling. So we can pass a smaller if needed. Internally
/// it uses 480 x 480 but not good to pass square bc it makes % bbox calculation harder.
// But requires us to call info.json and pre-process the sizes.
$iiif_image_url = $config['iiif_server']."/{$iiifidentifier}/full/full/0/default.jpg";
//@TODO we are not filtering here by label yet. Next release.
$labels = [];
$page_text = NULL;
$output->plugin = NULL;
$labels = [];
$ML = $this->callImageML($iiif_image_url,$labels);
$output->searchapi['vector_512'] = isset($ML['insightface']['vector']) && is_array($ML['insightface']['vector']) && count($ML['insightface']['vector'])== 512 ? $ML['insightface']['vector'] : NULL;
if (isset($ML['insightface']['objects']) && is_array($ML['insightface']['objects']) && count($ML['insightface']['objects']) > 0 ) {
// Don't do anything if no detection.
$miniocr = $this->insightfacenetToMiniOCR($ML['insightface']['objects'], $width, $height, $sequence_number);
$output->searchapi['fulltext'] = $miniocr;
$page_text = isset($output->searchapi['fulltext']) ? strip_tags(str_replace("<l>",
PHP_EOL . "<l> ", $output->searchapi['fulltext'])) : '';
// What is a good confidence ratio here?
// based on the % of the bounding box?
// Just the value?
$labels['Face'] = 'Face';
$output->searchapi['metadata'] = $labels;
$output->searchapi['service_md5'] = isset($ML['insightface']['modelinfo']) ? md5(json_encode($ML['insightface']['modelinfo'])) : NULL;
$output->searchapi['plaintext'] = $page_text ?? '';
$output->searchapi['processlang'] = $file_languages;
$output->searchapi['ts'] = date("c");
$output->searchapi['label'] = $this->t("Insightface ML Image Embeddings & Vectors") . ' ' . $sequence_number;
$output->plugin['searchapi'] = $output->searchapi;
}
return $output;
}


protected function insightfacenetToMiniOCR(array $objects, $width, $height, $pageid) {
$miniocr = new \XMLWriter();
$miniocr->openMemory();
$miniocr->startDocument('1.0', 'UTF-8');
$miniocr->startElement("ocr");
$atleastone_word = FALSE;
// To avoid divisions by 0
$pwidth = (float) $width;
$pheight = (float) $height;
// Format here is again different. Instead of normalizing on Python we do here?
// @TODO make all methods in python act the same
// :[{"bbox":[x1,y1,x2,y2],"score":0.8881509304046631}]
// We are not using labels here. We have age, gender. Discriminatory!
// NOTE: floats are in the form of .1 so we need to remove the first 0.
$miniocr->startElement("p");
$miniocr->writeAttribute("xml:id", 'ml_insightface_' . $pageid);
$miniocr->writeAttribute("wh",
ltrim($pwidth ?? '', 0) . " " . ltrim($pheight ?? '', 0));
$miniocr->startElement("b");
foreach ($objects as $object) {
$notFirstWord = FALSE;
if ($object['bbox'] ?? FALSE) {
$miniocr->startElement("l");
$x0 = (float)$object['bbox'][0];
$y0 = (float)$object['bbox'][1];
$w = (float)$object['bbox'][2] - $x0;
$h = (float)$object['bbox'][3] - $y0;
$l = ltrim(sprintf('%.3f', $x0) ?? '', 0);
$t = ltrim(sprintf('%.3f', $y0) ?? '', 0);
$w = ltrim(sprintf('%.3f', $w) ?? '', 0);
$h = ltrim(sprintf('%.3f', $h) ?? '', 0);
$text = (string)('Face') . ' ~ ' . (string)sprintf('%.3f', $object['score'] ?? 0);

if ($notFirstWord) {
$miniocr->text(' ');
}
$notFirstWord = TRUE;
// New OCR Highlight does not like empty <w> tags at all
if (strlen(trim($text ?? '')) > 0) {
$miniocr->startElement("w");
$miniocr->writeAttribute("x",
$l . ' ' . $t . ' ' . $w . ' ' . $h);
$miniocr->text($text);
// Only assume we have at least one word for <w> tags
// Since lines? could end empty?
$atleastone_word = TRUE;
$miniocr->endElement();
}
$miniocr->endElement();
}
}
$miniocr->endElement();
$miniocr->endElement();
$miniocr->endElement();
$miniocr->endDocument();
if ($atleastone_word) {
return $miniocr->outputMemory(TRUE);
}
else {
return StrawberryfieldFlavorDatasource::EMPTY_MINIOCR_XML;
}
}

public function callImageML($image_url, $labels):mixed {
$nlpClient = $this->getNLPClient();
$config = $this->getConfiguration();
$arguments['iiif_image_url'] = $image_url;
//@TODO we are not filtering here by label yet. Next release.
$arguments['labels'] = $labels;
$ML = $nlpClient->get_call($config['ml_method'], $arguments, 1);
return $ML;
}

public function callTextML($text, $query):mixed {
return FALSE;
}
}
Loading