Add Container and Block for Text #207

Chandu-4444 · 2022-03-26T06:30:11Z

Tried starting at creating a simple textual recipe based on ImageFolders dataset recipe. This specifically works for imdb and similar datasets. Any feedback is highly appreciated.

Chandu-4444 · 2022-03-28T12:31:04Z

julia> using FastAI

julia> name, recipe = finddatasets(blocks=(Any, Any), name="imdb")[1]
Pair{String, FastAI.Datasets.DatasetRecipe}("imdb", TextFolders(FastAI.Datasets.parentname, false, FastAI.Text.var"#2#4"()))

julia> data, blocks = loadrecipe(recipe, datasetpath("imdb"))
((mapobs(loadfile, ["/home/luna/.julia/datadeps/fastai-imdb/imdb/test/neg/0_2.txt", "/home/luna/.ju…]), mapobs(parentname, ["/home/luna/.julia/datadeps/fastai-imdb/imdb/test/neg/0_2.txt", "/home/luna/.ju…])), (TextBlock(), Label{String}(["neg", "pos"])))

julia> text, class = obs = getobs(data, 1000)
("Every movie I have PPV'd because Leonard Maltin praised it to the skies has blown chunks! 
Every single one! 
When will I ever learn?<br /><br />Evie is a raving Old Bag who thinks nothing of saying she's dying of breast cancer to get her way! 
Laura is an insufferable Medusa filled with  The Holy Spirit (and her hubby's protégé)! 
Caught between these harpies is Medusa's dumb-as-a-rock boy who has been pressed into weed-pulling servitude by The Old Bag!<br /><br />
As I said, when will I ever learn?<br /><br />
I was temporarily lifted out of my malaise when The Old Bag stuck her head in a sink, but, unfortunately, she did not die. 
I was temporarily lifted out of my malaise again when Medusa got mowed down, but, unfortunately, she did not die. 
It should be a capital offense to torture audiences like this!<br /><br />
Without Harry Potter to kick him around, Rupert Grint is just a pair of big blue eyes that practically bulge out of its sockets.  
Julie Walters's scenery-chewing (especially the scene when she \"plays\" God) is even more shameless than her character.
<br /><br />
At least this Harold bangs some bimbo instead of Maude. 
For that, I am truly grateful. And if you're reading this Mr. Maltin, you owe me \$3.99!", "neg")

Chandu-4444 · 2022-03-30T09:43:34Z

I have started adding functions for replacing words that start with uppercase letters, contain all uppercase letters with special tokens like xxup, xxmaj etc. All the remaining utilities used for preprocessing can be used from JuliaText.

lorenzoh

Thanks for the PR!

I've left some comments.

Is there a fastai tutorial that uses this dataset? Would be helpful to know what kind of tasks could be tackled with this.

src/Text/blocks/text.jl

src/Text/Text.jl

src/Text/recipes.jl

src/Text/Text.jl

Sure, I just forgot to remove them from export after I tried testing them. Co-authored-by: lorenzoh <[email protected]>

Co-authored-by: lorenzoh <[email protected]>

Chandu-4444 · 2022-03-31T19:50:03Z

Is there a fastai tutorial that uses this dataset? Would be helpful to know what kind of tasks could be tackled with this.

Yes, fastai does have a tutorial that uses this dataset, https://docs.fast.ai/tutorial.text.html. This tutorial focuses on the sentiment analysis. The first part uses a pre-trained language model (called AWD-LSTM) on Wikipedia for predicting the next word (language generation), and is directly used for predicting the sentiment for the given review. In the second part of the tutorial, they used an approach called ULMFit approach that involves fine-tuning the model with the IMDB dataset and using that for predicting the sentiment. They achieved SOTA using the second method.

I'll commit to the suggestions provided and will improve upon those.

Simultaneously, I'll start looking into that AWD-LSTM (https://arxiv.org/abs/1708.02182) paper to get deeper into how the model works. After that, the plan was to go through the ULMFit (https://arxiv.org/abs/1801.06146) paper.

for TextBlock.

src/Text/recipes.jl

src/Text/transform.jl

src/datasets/containers.jl

Co-authored-by: Brian Chen <[email protected]>

lorenzoh

Left a comment regarding the block type, but other than that and Brian's open suggestion, this looks good to me!

src/Text/blocks/text.jl

between `Base.Text`

lorenzoh · 2022-04-20T06:15:12Z

Sorry for letting this sit!

Tests were failing due to issues that should be fixed on master, so merging master into this should make the CI green.

Last thing that would be good to have would be some tests

Chandu-4444 · 2022-04-20T08:04:51Z

Sure! Will synchronise it with master and add some tests.

Chandu-4444 · 2022-04-20T18:26:27Z

Umm... For writing tests to the TextFolders(), I need to access the IMDb dataset. I remember Lorenz mentioning that it isn't very nice to use large datasets for testing as it might overload the CI system. And for other recipes, there are smaller version datasets that replicate the original larger version datasets. I couldn't find any such datasets for IMDb (Actually there is one such dataset that is available as a CSV file, but I need an IMDb-like directory structure for testing the recipe). Is there any workaround?

ToucheSir · 2022-04-20T19:14:07Z

I wouldn't worry about testing the bits that require file IO for now, mostly the helper functionality.

Chandu-4444 · 2022-04-20T19:15:25Z

That sounds good!

Make mockblock for `Paragraph()` work.

src/Textual/blocks/text.jl

src/Textual/recipes.jl

src/Textual/transform.jl

Co-authored-by: lorenzoh <[email protected]>

Add tests for text transforms

Add basic Text module and sample recipe.

a8d0f52

Chandu-4444 changed the title ~~Add basic Text module and sample recipe.~~ Add basic Container and Block for Text Mar 26, 2022

Chandu-4444 changed the title ~~Add basic Container and Block for Text~~ Add Container and Block for Text Mar 26, 2022

Add docstrings

6ea531c

Start adding text transforms

c533ce3

Chandu-4444 added 2 commits March 30, 2022 15:17

Remove basic preprocessing functions

50cfd2e

Add xxbos transform and minor updates

e51b866

lorenzoh requested changes Mar 31, 2022

View reviewed changes

src/Text/blocks/text.jl Outdated Show resolved Hide resolved

src/Text/Text.jl Outdated Show resolved Hide resolved

src/Text/recipes.jl Outdated Show resolved Hide resolved

src/Text/Text.jl Outdated Show resolved Hide resolved

Chandu-4444 and others added 2 commits March 31, 2022 21:11

Update src/Text/Text.jl

15f710c

Sure, I just forgot to remove them from export after I tried testing them. Co-authored-by: lorenzoh <[email protected]>

Update src/Text/recipes.jl

2d5e4ba

Co-authored-by: lorenzoh <[email protected]>

Update TextBlock documentation.

d44e1b0

Chandu-4444 requested a review from lorenzoh April 1, 2022 04:47

Chandu-4444 added 2 commits April 1, 2022 14:38

Update declaration of checkblock method

e714966

for TextBlock.

Update Text.jl to remove an unexpected error

357eaf0

ToucheSir reviewed Apr 1, 2022

View reviewed changes

src/Text/recipes.jl Outdated Show resolved Hide resolved

src/Text/recipes.jl Outdated Show resolved Hide resolved

src/Text/recipes.jl Outdated Show resolved Hide resolved

src/Text/transform.jl Outdated Show resolved Hide resolved

src/datasets/containers.jl Outdated Show resolved Hide resolved

Chandu-4444 and others added 4 commits April 1, 2022 22:13

Update src/Text/recipes.jl

cbebcb6

Co-authored-by: Brian Chen <[email protected]>

Update src/Text/transform.jl

a18d72d

Co-authored-by: Brian Chen <[email protected]>

Update src/datasets/containers.jl

1a1266a

Co-authored-by: Brian Chen <[email protected]>

Update recipes.jl with suggestions provided

d94c94e

lorenzoh requested changes Apr 4, 2022

View reviewed changes

src/Text/blocks/text.jl Outdated Show resolved Hide resolved

Chandu-4444 added 4 commits April 4, 2022 17:37

Change TextBlock to more reasonable Paragraph

ce50151

Change Text to Textual to resolve conflict

db511fb

between `Base.Text`

Remove type annotations for text transforms

afc479f

Add mockblock for text

5a779cf

Merge branch 'FluxML:master' into master

98df8ec

Chandu-4444 added 3 commits April 21, 2022 01:48

Add simple test for TextFolders

b6224b4

Remove tests

14a96d0

Add test (again).

cb561b7

Make mockblock for `Paragraph()` work.

lorenzoh requested changes Apr 23, 2022

View reviewed changes

src/Textual/blocks/text.jl Outdated Show resolved Hide resolved

src/Textual/recipes.jl Outdated Show resolved Hide resolved

src/Textual/transform.jl Show resolved Hide resolved

Chandu-4444 and others added 3 commits April 23, 2022 14:16

Update src/Textual/blocks/text.jl

8a17345

Co-authored-by: lorenzoh <[email protected]>

Update test for TextFolders

70405ac

Add tests for text transforms

Merge branch 'FluxML:master' into master

290ae32

Chandu-4444 requested a review from lorenzoh May 4, 2022 14:20

lorenzoh merged commit 2f227aa into FluxML:master May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Container and Block for Text #207

Add Container and Block for Text #207

Chandu-4444 commented Mar 26, 2022

Chandu-4444 commented Mar 28, 2022

Chandu-4444 commented Mar 30, 2022 •

edited

Loading

lorenzoh left a comment

Chandu-4444 commented Mar 31, 2022 •

edited

Loading

lorenzoh left a comment

lorenzoh commented Apr 20, 2022

Chandu-4444 commented Apr 20, 2022

Chandu-4444 commented Apr 20, 2022

ToucheSir commented Apr 20, 2022

Chandu-4444 commented Apr 20, 2022

Add Container and Block for Text #207

Add Container and Block for Text #207

Conversation

Chandu-4444 commented Mar 26, 2022

Chandu-4444 commented Mar 28, 2022

Chandu-4444 commented Mar 30, 2022 • edited Loading

lorenzoh left a comment

Choose a reason for hiding this comment

Chandu-4444 commented Mar 31, 2022 • edited Loading

lorenzoh left a comment

Choose a reason for hiding this comment

lorenzoh commented Apr 20, 2022

Chandu-4444 commented Apr 20, 2022

Chandu-4444 commented Apr 20, 2022

ToucheSir commented Apr 20, 2022

Chandu-4444 commented Apr 20, 2022

Chandu-4444 commented Mar 30, 2022 •

edited

Loading

Chandu-4444 commented Mar 31, 2022 •

edited

Loading