Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SynthText dataset integration #624

Merged
merged 36 commits into from
Nov 24, 2021
Merged

Conversation

felixdittrich92
Copy link
Contributor

@felixdittrich92 felixdittrich92 commented Nov 15, 2021

This PR adds the Synthtext dataset

cf. #587

Problems:

  • the unpack part needs a bit longer (>800k data) so i think we have to implement a check that unpacking is finished before continue
  • same for the hash check this will end in a MemoryError

wdyt @fg-mindee how to handle? :)
suggestions are welcome 🤗

  • tests
  • bypass the hash check

@fg-mindee fg-mindee self-requested a review November 15, 2021 11:13
@fg-mindee fg-mindee self-assigned this Nov 15, 2021
@fg-mindee fg-mindee added topic: documentation Improvements or additions to documentation module: datasets Related to doctr.datasets ext: tests Related to tests folder labels Nov 15, 2021
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I added some comments

doctr/datasets/synthtext.py Show resolved Hide resolved
doctr/datasets/synthtext.py Show resolved Hide resolved
doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
doctr/utils/data.py Outdated Show resolved Hide resolved
@felixdittrich92
Copy link
Contributor Author

@fg-mindee
Dataset size: ~41 GB

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of comments 😅

doctr/datasets/synthtext.py Show resolved Hide resolved
doctr/datasets/synthtext.py Show resolved Hide resolved
@felixdittrich92
Copy link
Contributor Author

@fg-mindee
I added to bypass the hash check and removed the tests for this dataset ftm also the train/val split now before iterating.
Already any decision about the dataset test ? :)

@felixdittrich92
Copy link
Contributor Author

@fg-mindee
after attaching the example can you trigger integration again pls ? 😃

Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this Felix!
I added some modification suggestions 👌

doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
tests/pytorch/test_datasets_pt.py Outdated Show resolved Hide resolved
tests/pytorch/test_datasets_pt.py Outdated Show resolved Hide resolved
tests/pytorch/test_datasets_pt.py Outdated Show resolved Hide resolved
tests/tensorflow/test_datasets_tf.py Outdated Show resolved Hide resolved
tests/tensorflow/test_datasets_tf.py Outdated Show resolved Hide resolved
tests/tensorflow/test_datasets_tf.py Outdated Show resolved Hide resolved
tests/pytorch/test_datasets_pt.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I added a few comments!

doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
doctr/datasets/synthtext.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fg-mindee fg-mindee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@fg-mindee fg-mindee merged commit 2284575 into mindee:main Nov 24, 2021
@felixdittrich92 felixdittrich92 deleted the synthtext branch November 25, 2021 07:32
@fg-mindee fg-mindee added the type: new feature New feature label Dec 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: tests Related to tests folder module: datasets Related to doctr.datasets topic: documentation Improvements or additions to documentation type: new feature New feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants