-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Make detection training and inference Multiclass #1097
Conversation
Test in CI |
After release ? 😅 |
@felixdittrich92 Yes, it seems weird written like that haha.
@aminemindee following @felixdittrich92's remark, I just saw that we need to enforce DocTR to be the latest version in the |
@odulcy-mindee @aminemindee you need also to increase the major version specifier in root pyproject.toml (breaking change) and sort the milestones/open tickets after merge/release :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aminemindee @odulcy-mindee (seems you have both worked on it 😅) thanks for the PR i added a first general review without going to deep and left some questions and remarks :)
Due to the breaking change and the amount of changes, someone should have a second look at it. @frgfm ?
@@ -25,6 +25,11 @@ def _read_sample(self, index: int) -> Tuple[torch.Tensor, Any]: | |||
if isinstance(target, dict): | |||
assert "boxes" in target, "Target should contain 'boxes' key" | |||
assert "labels" in target, "Target should contain 'labels' key" | |||
elif isinstance(target, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to be careful with this currently we have two cases:
more then 1 annotation: dict (ocr dataset with labels and boxes or object detection dataset)
otherwise label (recognition) or boxes (text detection)
We need to be as clear as possible here so we don't run into problems with the transformations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In here, for now the only moment when target is a tuple is coming from detection dataset when there is multiple classes to keep that information and include it into the target in the _pre_transform.
yes we should be careful but for now this is the only case where it will be used.
@@ -25,6 +25,11 @@ def _read_sample(self, index: int) -> Tuple[tf.Tensor, Any]: | |||
if isinstance(target, dict): | |||
assert "boxes" in target, "Target should contain 'boxes' key" | |||
assert "labels" in target, "Target should contain 'labels' key" | |||
elif isinstance(target, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this still troublesome (transformations) better to keep dict for all which has multible annotations 🤔
@@ -10,13 +10,16 @@ | |||
import os | |||
import sys | |||
|
|||
CLASS_NAME: str = "words" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a text detection specific attribute.
We have to add it to cfg of each text detection model as it is also done for object detection and then work with model.cfg['class_names'] afterwards:
"classes": ["background", "qr_code", "bar_code", "logo", "photo"], |
this makes the names also flexible for training
NOTE: this change will need also an update for HF Api (factory) and update the config.json for uploaded text detection models
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok to have this default single class and use the class of the datasets otherwise to build the model, but the user should be able to modify this CLASS_NAME to a specific value like "words", "numbers", "default_class", "handwritten"... so it shouldn't be hardcoded here but I rather see it as a parameter given by the user either in the config or when building the mode, WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @charlesmindee yep sounds good to me 👍
I see it much more problematic to modify the blocks output structure .. the results should be added to each predicted word like
'house', conf: '0.99', class: 'word'
Instead of:
Blocks:
'word': { Line .. Words ..},
'salary': { Line .. Words ..}
Wdyt ?
(Wrote from mobile phone sry 😅 )
doctr/io/elements.py
Outdated
@@ -311,65 +316,66 @@ def export_as_xml(self, file_title: str = "docTR - XML export (hOCR)") -> Tuple[ | |||
}, | |||
) | |||
# iterate over the blocks / lines / words and create the XML elements in body line by line with the attributes | |||
for block in self.blocks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a while ago i wrote this export as xml functionality 😅 but we should test that it still works as expected with the PDF/A notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i looked at it a little bit. in google colab the installation using git doesn't seem to work for me. However, the code should work just the output will be slightly different i think.
val_metric.update(gts=boxes_gt, preds=boxes_pred[:, :4]) | ||
for target, loc_pred in zip(targets, loc_preds): | ||
if isinstance(target, np.ndarray): | ||
target = {CLASS_NAME: target} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this name not come from the dataset specified names ? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the training part, you can still use a data format where you don't give any class name. in that case, targets are not a dict so in that case i change it here. But now that i'm looking at it i think it's better to just add the CLASS_NAME in the dataset and not handle it outside. right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to avoid this CLASS_NAME completly .. class names should come from the dataset and update the default class_names in model.cfg ... to explain a box can be annotated with different labels like:
box 1: name, box 2: salary i do not understand correctly why we need to modify in the end the blocks output structure
i think what we want is this:
(words): [
Word(value='ABC Gmbh.', confidence=0.91, class_name='name'),
Word(value='1200', confidence=0.99, class_name='salary),
]
And not something like:
(blocks): {'salaray' :
(lines): [Line(
(words): [
Word(value='1200.', confidence=0.91),
]
)}, {'name': ....
]
right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiclass detection model is prediction bounding boxes that represent the blocks. so in theory the whole block is of that class and the words inside are of that class. A user will want to get all the words of the same block of that class and he can get that directly with the blocks as dict approach. different words in the same block will always have the same class also.
val_metric.update(gts=boxes_gt, preds=boxes_pred[:, :4]) | ||
for target, loc_pred in zip(targets, loc_preds): | ||
if isinstance(target, np.ndarray): | ||
target = {CLASS_NAME: target} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same ?
doctr/io/elements.py
Outdated
@@ -215,7 +220,7 @@ class Page(Element): | |||
"""Implements a page element as a collection of blocks | |||
|
|||
Args: | |||
blocks: list of block elements | |||
blocks: Dictionary with list of block elements for each detection class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still try to find a solution to avoid this breaking change but i think in front of this a short explanation would help me to understand the use case for multi class in text detection or better what you want to reach with it 😅
Is it not possible to add another key value pair like 'class_names': 'words'
instead of changing the output structure ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm something like a class_names is the same len as blocks and give the class name of each block ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mh are you sure that block is the correct part ? If i understood it correctly each box can have a different label so it is the words part and not block (blocks can contain multible lines which can contains multible words)
Document(
(pages): [Page(
dimensions=(340, 600)
(blocks): [Block(
(lines): [Line(
(words): [
Word(value='No.', confidence=0.91),
Word(value='RECEIPT', confidence=0.99),
Word(value='DATE', confidence=0.96),
]
)]
(artefacts): []
)]
)]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually blocks are not resolved, so this sub object in the Page
might be obsolete, we could either rename it to Layer
to have a more meaningful page structure and keep it like that or add a Layer
object between Page
and Block
in the hierarchy, so that a class would be on a layer and inside each layer we can resolve blocks and lines.
img, target = self.sample_transforms(img, target) | ||
if isinstance(target, dict): | ||
img_transformed = copy_tensor(img) | ||
for class_name, bboxes in target.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An OCRDataset comes also with dict as target and has no class names (string labels and bbox annotations) so if we transform samples in this case this should normally fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true 😨 i'm trying to think how to handle it . open to any suggestions x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As described in the other comments if we handle it like it is done with object_detection (classes comes from model.cfg which is flexible and can be updated with classes from dataset) we could remove this
Maybe thinking again about this dataset format:
polygons: [....]
classes: ['printed', 'handwritten', 'printed', 'printed', ...] ? @frgfm @odulcy-mindee @charlesmindee wdyt ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should split the topics of this PR, and discuss this in several issues/discussions to find a proper implementations. There are several breaking change modifications in the PR over quite a lot of files, it's quite hard to ensure the review quality will be both high and exhaustive 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added another check in addition to if it's a dict to ensure it's the target coming from detection.
references/detection/README.md
Outdated
@@ -57,7 +57,30 @@ labels.json | |||
... | |||
} | |||
``` | |||
If you want to train a model with multiple classes, you can use the following format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a bit more clear:
multiclass box a ['printed', 'handwritten'] or box a ['printed'] box b ['handwritten']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok i think now i got it 😅
for example class_name_1 == 'printed' contains all boxes for printed and class_name_2 == 'handwritten' contains all boxes for handwritten.
Maybe a better format:
polygons: [....]
classes: ['printed', 'handwritten', 'printed', 'printed', ...]
So we can avoid to introduce a new dataset format then it should also be possible to avoid the breaking change only to add a new key value pair like 'class': 'printed' @aminemindee @odulcy-mindee wdyt ?
Hello @felixdittrich92 the general use case of multiclass text detection is when you want to find not just the words in a document but some specific infomations. An example, in a receipt from a restaurant you want to find the name of the restaurant and the total you paid. then you can have two classes and predict different bounding boxes for each class. |
Ok got it so in the end like the LayoutLM model (without the text input) |
Codecov Report
@@ Coverage Diff @@
## main #1097 +/- ##
==========================================
- Coverage 95.23% 94.97% -0.26%
==========================================
Files 142 146 +4
Lines 6017 6389 +372
==========================================
+ Hits 5730 6068 +338
- Misses 287 321 +34
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello everyone 👋
Thanks a lot of the PR! It's only my opinion but I think we have too many things at stake in the PR to be able to say whether it's OK to merge or not. I would suggest pausing this one shortly and opening GH issues for the following topics:
- dataset target format modification for multiple class of text/objects
- versioning of docTR (when is 1.0.0 within reach, etc.)
- minimal API modifications
- export format modifications
- detection model modifications
- training/evaluation
I might have forgotten a few
Once a topic has reached a good implementation idea / way to handle it, we can cherry pick from this PR or reimplement it :)
I already added comments for the section I checked, but then thought there is too much to handle in a single PR 😅 If you prefer to go with that one in a single PR, no worries!
Cheers ✌️
api/pyproject.toml
Outdated
version = "0.5.2a0" | ||
version = "1.0.1a0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest reverting back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I realize you want to bump docTr to 1.0.0. Perhaps that might be easier to discuss this in an issue and lower the amount of modifications in this specific PR?
api/pyproject.toml
Outdated
python = ">=3.8,<3.11" | ||
python = ">=3.8.2,<3.11" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment to specify the reason of the minor version? (much easier to reassess later when we have a comment saying why)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good idea ! added it.
api/pyproject.toml
Outdated
python-doctr = ">=0.2.0" | ||
python-doctr = { version = ">=1.0.0", extras = ['tf'] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but I think this will result as double version specifiers. If we consider extras, we need to remove the "tensorflow" & cie specifiers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think i understand your comment can you be more clear?
we added the extras because now doctr if you don't install the extras "tf" you don't have tf2onnx library and things like that, that are needed to run the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sure, I wasn't referring to the extra but the version index specifier
You switched it from >=0.2.0
to >=1.0.0
which is a big deal 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the extra, I'd argue it's up for discussion but the API doesn't need to convert models to ONNX
Ideally, we'd like it to run on ONNX, but some postprocessing ops are not compatible yet. So the alternative is to use doctr, and then carefully add as little deps as possible to run it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. but for now how doctr is done to run the code in the api tf2onnx is needed
img, target = self.sample_transforms(img, target) | ||
if isinstance(target, dict): | ||
img_transformed = copy_tensor(img) | ||
for class_name, bboxes in target.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should split the topics of this PR, and discuss this in several issues/discussions to find a proper implementations. There are several breaking change modifications in the PR over quite a lot of files, it's quite hard to ensure the review quality will be both high and exhaustive 😅
doctr/datasets/datasets/pytorch.py
Outdated
elif isinstance(target, tuple): | ||
assert isinstance(target[0], str) or isinstance( | ||
target[0], np.ndarray | ||
), "first element of the tuple should be a string or a numpy array" | ||
assert isinstance(target[1], list), "second element of the tuple should be a list" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this type of condition is that if the target is a tuple with a single element the code will crash. Either we have to add assert len(target) == 2
or change that a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few remarks, but minor
doctr/io/elements.py
Outdated
@@ -215,7 +220,7 @@ class Page(Element): | |||
"""Implements a page element as a collection of blocks | |||
|
|||
Args: | |||
blocks: list of block elements | |||
blocks: Dictionary with list of block elements for each detection class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually blocks are not resolved, so this sub object in the Page
might be obsolete, we could either rename it to Layer
to have a more meaningful page structure and keep it like that or add a Layer
object between Page
and Block
in the hierarchy, so that a class would be on a layer and inside each layer we can resolve blocks and lines.
doctr/models/_utils.py
Outdated
@@ -161,3 +161,18 @@ def get_language(text: str) -> Tuple[str, float]: | |||
if len(text) <= 1 or (len(text) <= 5 and lang.prob <= 0.2): | |||
return "unknown", 0.0 | |||
return lang.lang, lang.prob | |||
|
|||
|
|||
def invert_between_dict_of_lists_and_list_of_dicts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invert_data_structure
with a clear docstring which explains the typing maybe ? so that we keep a short and clear name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -10,13 +10,16 @@ | |||
import os | |||
import sys | |||
|
|||
CLASS_NAME: str = "words" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok to have this default single class and use the class of the datasets otherwise to build the model, but the user should be able to modify this CLASS_NAME to a specific value like "words", "numbers", "default_class", "handwritten"... so it shouldn't be hardcoded here but I rather see it as a parameter given by the user either in the config or when building the mode, WDYT ?
Hello @frgfm i understand that this PR is kind of big with a lot of changes. The goal is to have it available as soon as possible. i know that the review won't be exhaustive but we can see this 1.0.0 release as experimental and once it's there we can improve things. After this PR, i will work on improving the training part ( changing losses maybe and also adding some augmentations and things of the sort) |
Hi @aminemindee :)
I could agree with your suggestion if it were 0.7.0, but 1.0.0 in software dev, at least to the best of my experience & knowledge, is a mark of stability in the API (in the broad sense) and performance milestone. So jumping from 0.6.0 to 1.0.0 looks a bit hasty to me (even the 0.7.0 release tracker is full of things that would be worth considering before releasing a 1.0.0) To put this into context, the day the community saw PyTorch jump from 0.4.0 to 1.0.0, it was because they added the C++ API, the JIT compiler (which was their target for the first 1.0.0 release). We knew that this could now be considered to run in production. Here, at least I think the release roadmap should follow a structure readable by the community (cf. release trackers) 😅
Sounds good 👍 |
About the versioning, i agree with @frgfm 1.0.0 should be the first stable (production ready) version specifier. Line 57 in acb9f64
|
I don't think adding an attribute I think the main problem here is that, once the detection model is not trying to detect all words in a page, the fact that So i agree with @charlemindee, we can add a new object |
Mh .. we should keep in mind that it is an feature task ... core is still OCR (where a user wants to get all the extracted text at once) so i think it would be totally ok if the user needs to iterate trough the results ... but yeah that's my opinion 😅 let's wait on @frgfm @charlesmindee 👍 |
Hello again ! After discussing it internally, i have a new suggestion to add the detection multiclass without a big breaking change. I added a new kie_predictor ( like ocr_predictor) that will handle the different classes and has a dict like output, while keeping ocr_predictor as it was. So a summary of the changes:
WDYT @felixdittrich92 @odulcy-mindee @frgfm @charlesmindee ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aminemindee thanks for the changes i added a few comments but in general there are some questions / remarks and the PR is still much to big to review it careful 😅
@@ -25,6 +25,11 @@ def _read_sample(self, index: int) -> Tuple[tf.Tensor, Any]: | |||
if isinstance(target, dict): | |||
assert "boxes" in target, "Target should contain 'boxes' key" | |||
assert "labels" in target, "Target should contain 'labels' key" | |||
elif isinstance(target, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this still troublesome (transformations) better to keep dict for all which has multible annotations 🤔
…lements for kie predictor (#6) * feat: ✨ add load backbone * feat: change kie predictor out * fix new elements for kie, dataset when class is empty and fix and add tests * fix api kie route * fix evaluate kie script * fix black * remove commented code * update README
aa15f43
to
b396f6d
Compare
Hello, this PR is to make the detection part of DocTR multiclass.
blocks
of classPage
is no longer a list of blocks but a dictionnary with keys being the class names.A first review of the code was done on my fork. you can find it here: aminemindee#4
Any feedback is welcome!