-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Container and Block for Text #207
Conversation
julia> using FastAI
julia> name, recipe = finddatasets(blocks=(Any, Any), name="imdb")[1]
Pair{String, FastAI.Datasets.DatasetRecipe}("imdb", TextFolders(FastAI.Datasets.parentname, false, FastAI.Text.var"#2#4"()))
julia> data, blocks = loadrecipe(recipe, datasetpath("imdb"))
((mapobs(loadfile, ["/home/luna/.julia/datadeps/fastai-imdb/imdb/test/neg/0_2.txt", "/home/luna/.ju…]), mapobs(parentname, ["/home/luna/.julia/datadeps/fastai-imdb/imdb/test/neg/0_2.txt", "/home/luna/.ju…])), (TextBlock(), Label{String}(["neg", "pos"])))
julia> text, class = obs = getobs(data, 1000)
("Every movie I have PPV'd because Leonard Maltin praised it to the skies has blown chunks!
Every single one!
When will I ever learn?<br /><br />Evie is a raving Old Bag who thinks nothing of saying she's dying of breast cancer to get her way!
Laura is an insufferable Medusa filled with The Holy Spirit (and her hubby's protégé)!
Caught between these harpies is Medusa's dumb-as-a-rock boy who has been pressed into weed-pulling servitude by The Old Bag!<br /><br />
As I said, when will I ever learn?<br /><br />
I was temporarily lifted out of my malaise when The Old Bag stuck her head in a sink, but, unfortunately, she did not die.
I was temporarily lifted out of my malaise again when Medusa got mowed down, but, unfortunately, she did not die.
It should be a capital offense to torture audiences like this!<br /><br />
Without Harry Potter to kick him around, Rupert Grint is just a pair of big blue eyes that practically bulge out of its sockets.
Julie Walters's scenery-chewing (especially the scene when she \"plays\" God) is even more shameless than her character.
<br /><br />
At least this Harold bangs some bimbo instead of Maude.
For that, I am truly grateful. And if you're reading this Mr. Maltin, you owe me \$3.99!", "neg")
|
I have started adding functions for replacing words that start with uppercase letters, contain all uppercase letters with special tokens like xxup, xxmaj etc. All the remaining utilities used for preprocessing can be used from JuliaText. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I've left some comments.
Is there a fastai tutorial that uses this dataset? Would be helpful to know what kind of tasks could be tackled with this.
Sure, I just forgot to remove them from export after I tried testing them. Co-authored-by: lorenzoh <[email protected]>
Co-authored-by: lorenzoh <[email protected]>
Yes, fastai does have a tutorial that uses this dataset, https://docs.fast.ai/tutorial.text.html. This tutorial focuses on the sentiment analysis. The first part uses a pre-trained language model (called AWD-LSTM) on Wikipedia for predicting the next word (language generation), and is directly used for predicting the sentiment for the given review. In the second part of the tutorial, they used an approach called ULMFit approach that involves fine-tuning the model with the IMDB dataset and using that for predicting the sentiment. They achieved SOTA using the second method. I'll commit to the suggestions provided and will improve upon those. Simultaneously, I'll start looking into that AWD-LSTM (https://arxiv.org/abs/1708.02182) paper to get deeper into how the model works. After that, the plan was to go through the ULMFit (https://arxiv.org/abs/1801.06146) paper. |
Co-authored-by: Brian Chen <[email protected]>
Co-authored-by: Brian Chen <[email protected]>
Co-authored-by: Brian Chen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment regarding the block type, but other than that and Brian's open suggestion, this looks good to me!
Sorry for letting this sit! Tests were failing due to issues that should be fixed on master, so merging master into this should make the CI green. Last thing that would be good to have would be some tests |
Sure! Will synchronise it with master and add some tests. |
Umm... For writing tests to the TextFolders(), I need to access the IMDb dataset. I remember Lorenz mentioning that it isn't very nice to use large datasets for testing as it might overload the CI system. And for other recipes, there are smaller version datasets that replicate the original larger version datasets. I couldn't find any such datasets for IMDb (Actually there is one such dataset that is available as a CSV file, but I need an IMDb-like directory structure for testing the recipe). Is there any workaround? |
I wouldn't worry about testing the bits that require file IO for now, mostly the helper functionality. |
That sounds good! |
Make mockblock for `Paragraph()` work.
Co-authored-by: lorenzoh <[email protected]>
Add tests for text transforms
Tried starting at creating a simple textual recipe based on ImageFolders dataset recipe. This specifically works for
imdb
and similar datasets. Any feedback is highly appreciated.