Skip to content

Latest commit

 

History

History
365 lines (289 loc) · 19 KB

README.md

File metadata and controls

365 lines (289 loc) · 19 KB

Introduction

Corpus Flows Associated with the DHN Workshop

source: stylistic-profile.json

Etymology Flow Associated with the DHN Workshop

source: dictionary-etymology.json

This flow is used to generate etymology dataset from dictionary etymology. After the etymology dataset is generated (with the namespace-name: dictionary.etymology.wiktionary.deep-partial), it can be used across the project. At the moment it is used in the corpora flow: stylistic-profile.

Execute the flow (dictionary-etymology): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/dictionary-etymology.json

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/dictionary-etymology.json

Corpora Flows Associated with the DHN Workshop

Each corpora-flow consists of dedicated sections for each plain text corpus it processes. It invokes tasks from the corpus-flow necessary for each task and tweaks any relevant parameters, such as how a particular document should be parsed and split into words and sentences.

nabokov-english

Novels written in English:

Novels Nabokov English: corpus-en, sinister-en, lolita-en, pnin-en, harlequins-en, invite-en, gift-en, defense-en, knave-en, speak-en

nabokov-russian

Russian: corpus-ru, mary-ru, knave-ru, defense-ru, invite-ru, gift-ru, speak-ru, lolita-ru

Other books to be added in future experiments

Nabokov’s Novels in English

  1. (1941) The Real Life of Sebastian Knight
  2. (1947) Bend Sinister
  3. (1955) Lolita, self-translated into Russian (1965)
  4. (1957) Pnin
  5. (1962) Pale Fire
  6. (1969) Ada or Ardor: A Family Chronicle
  7. (1972) Transparent Things
  8. (1974) Look at the Harlequins!
  9. Plus Speak, Memory (1951/1967)

Nabokov’s Novels in Russian

  1. (1926) Mashen'ka (Машенька); English translation: Mary (1970)
  2. (1928) Korol' Dama Valet (Король, дама, валет); English translation: King, Queen, Knave (1968)
  3. (1930) Zashchita Luzhina (Защита Лужина); English translation: The Luzhin Defense or The Defense (1964) (also adapted to film, The Luzhin Defence, in 2000)
  4. (1930) Sogliadatai (Соглядатай (The Voyeur)), novella; first publication as a book 1938; English translation: The Eye (1965)
  5. (1932) Podvig (Подвиг (Deed)); English translation: Glory (1971)
  6. (1933) Kamera Obskura (Камера Обскура); English translations: Camera Obscura (1936), Laughter in the Dark (1938)
  7. (1934) Otchayanie (Отчаяние); English translation: Despair (1937, 1965)
  8. (1936) Priglasheniye na kazn' (Приглашение на казнь (Invitation to an execution)); English translation: Invitation to a Beheading (1959)
  9. (1938) Dar (Дар); English translation: The Gift (1963) Plus Lolita and Drugie berega.

Example Flow

import documents (corpora)

English

Russian

import documents (balanced corpora)

  • Brown Fiction
  • Russian National Corpus (fiction)

parse documents

POS, both unigrams and bigrams combinations of two (we can mention we support trigrams but not necessary to use them)

do POS-mapping

visualizing of distributions

Calling up examples of text with the POS bigrams and unigrams (more than X)

Sentence length

Richness (mention exoticism, but unnecessary to run)

Etymology

  • As a special bonus, would be great to run etymology in Russian too - was it possible?

SoW?

Tentative Schedule

First hour:

  1. 15 min intro in methods in stylometry, existing tools, their pros and cons
  2. 15 min intro into Bukvik and existing results
  3. 15 min – Accessing Bukvik on the participants’ computers

Second hour:

  1. Make experiments. We will pre-run each task beforehand to make sure everything is smooth.
  2. Simulate research process through experiments. Give quote from Grayson/Nabokov on differences for researchers. How do we test that?
POS: First, run Nabokov in Russian and English on POS, focus on nouns and adj.
  1. Aha, more nouns, interesting.
  2. Find examples of sentences with lots of nouns.
Sentence length. Hypothesis – longer sentences? Run.
  1. Nope, but look at the translation effect, make a note.
  2. So, original texts in English have more nouns but not longer sentences.
POS combinations. For example, can it be lists?
  1. See NN NN NN. not done for dissertation
  2. Demonstrate POS combinations. not done for dissertation
  3. Ask if we should check other combinations, do a couple they suggest.
Richness. Another hypothesis – richer vocabulary?
  1. Run, yep.
  2. Why richer? More foreign words, for sure. Knowing Nabokov, seems like we’re on the right path. Here, cite Chepiga, give an example from Ada.
Etymology. Hm, but what about choices in non-foreign words? Run etymology, indeed.
  1. Show different languages.
  2. Create a diagram with average distribution by origin. not done for dissertation
  3. If possible, double the experiment for Russian. not done for dissertation

Third hour:

  1. Sum up (give reference for forthcoming publication).
  2. Is etymology then the reason for richer vocab? For more nouns? Well, it can contribute to varied.
  3. Nabokov has more nouns and richer vocab in his L2, and one factor that may help account for it is his preference for a distribution of words with particular kinds of origins, different from the normal distribution. This is one feature of deviation from the norm within standard language = style.
  4. Stylistic profile.
  5. Get samples of text, read in the light of what we learned – paying attention to nouns. Find a good passage for that.
  6. What could we look at next? Brainstorm. Potential projects. Potential development.
  7. Intro of other capacities of Bukvik, existing and in progress. Society of Words, semantic…
  8. Brainstorm on the future of such tools, and where Bukvik can/will develop (modular etc).
  9. Collaborations?

Running

IMPORTANT ABOUT NAMING: corpora flows are renamed to avoid collisions of namespace and NamespaceName, for example: namespace: bukvik-workshop.data.lolita-ru and NsN: bukvik-workshop.data.lolita-ru:pos

(NOTE: BukvikDatasets: setDataset > BukvikNamespaceContainer: setEntity > _getContainer >

cdbd
cd ../..
mkdir datasets
cd datasets
git clone https://github.com/Cha-OS/bukvik-workshop-corpora
cdbp

Execute a corpora-flow:

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/nabokov-in-english.json

Execute a particular task (one text) of the corpora-flow:

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/nabokov-in-english.json -cmd execTask -t "<NAMESPACE_TASK>.defense-en"

(mprinc)

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/nabokov-in-english.json

Execute a task in corpora-flow:

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/nabokov-in-english.json -cmd execTask -t "<NAMESPACE_TASK>.corpus-en"

(mprinc)

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/nabokov-in-english.json -cmd execTask -t "<NAMESPACE_TASK>.corpus-en"

(mprinc)

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/nabokov-in-russian.json -cmd execTask -t "<NAMESPACE_TASK>.execute.speak-ru"

Execute the joint-flow:

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile-joined.json

(mprinc)

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile-joined.json

(mprinc)
```sh
python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile-joined.json -cmd execTask -t "<NAMESPACE_TASK>.execute.english-stylistic-profile-pos-joining"

Execute whole flow: (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json

Execute particular task (<NAMESPACE_TASK>.import.importing-the-corpus): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.execute.corpus-en"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.import.importing-the-corpus"

Execute particular task (<NAMESPACE_TASK>.parsers.words):

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.parsers.words"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.parsers.words"

Execute particular task (<NAMESPACE_TASK>.pos.parsing-pos-external): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.parsing-pos-external"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.parsing-pos-external"

Execute particular task (<NAMESPACE_TASK>.pos.remapping-pos-tags): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.remapping-pos-tags"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.remapping-pos-tags"

Execute particular task (<NAMESPACE_TASK>.corpora.brown.words.distribution.generating-brown-list-of-words-distribution): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.corpora.brown.words.distribution.generating-brown-list-of-words-distribution"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.corpora.brown.words.distribution.generating-brown-list-of-words-distribution"

Execute particular task (<NAMESPACE_TASK>.stats.calculating-simple-stats): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.stats.calculating-simple-stats"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.stats.calculating-simple-stats"

Execute particular task (<NAMESPACE_TASK>.stats.calculating-etymology): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.stats.calculating-etymology"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.stats.calculating-etymology"

Execute particular task (<NAMESPACE_TASK>.distribution): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.distribution"

Execute particular task (<NAMESPACE_TASK>.distribution-out): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.distribution-out"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.distribution-out"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.distribution"

Execute particular task (<NAMESPACE_TASK>.pos.distribution): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.distribution"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.distribution"

Execute particular task (<NAMESPACE_TASK>.pos.distribution-out): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.distribution-out"

(mprinc):

python RunBukvik.py -env ../../../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.mprinc.json -exp ../../../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.distribution-out"

Execute particular task (<NAMESPACE_TASK>.pos.searching-pos): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.searching-pos"

Execute particular task (<NAMESPACE_TASK>.pos.exporting-pos-patterns): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.exporting-pos-patterns"

Execute particular task (<NAMESPACE_TASK>.pos.exporting-pos-document): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.pos.exporting-pos-document"

Execute particular task (<NAMESPACE_TASK>.dictionary.ner.characters.import.importing-ner-dictionary-file): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.dictionary.ner.characters.import.importing-ner-dictionary-file"

Execute particular task (<NAMESPACE_TASK>.dictionary.ner.characters.recognize.recognizing-ner): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.dictionary.ner.characters.recognize.recognizing-ner"

Execute particular task (<NAMESPACE_TASK>.distribution-out): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.distribution-out"

Execute particular task (<NAMESPACE_TASK>.words-society.wordssociety): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.words-society.wordssociety"

Execute particular task (<NAMESPACE_TASK>.words-society.wordssociety-out-graph): (server)

python RunBukvik.py -env ../experiments/projects/bukvik-workshop-project/environments/bukvik-workshop.env.server.json -exp ../experiments/projects/bukvik-workshop-project/flows/stylistic-profile.json -cmd execTask -t "<NAMESPACE_TASK>.words-society.wordssociety-out-graph"