Skip to content

preprocessing of large corpora to induce various cluster types


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



13 Commits

Repository files navigation


This README explains the pre-processing performed to create the cluster lexicons that are used as features in the IXA pipes tools []. So far we use the following three methods: Brown, Clark and Word2vec.


  1. Overview
  2. Brown clusters
  3. Clark clusters
  4. Word2vec clusters
  5. XML/HTML cleaning


We induce the following clustering types:


Let us assume that the source data is in plain text format (e.g., without html or xml tags, etc.), and that every document is in a directory called corpus-directory. Then the following steps are performed:

Preclean corpus

This step is performed by using the following function in ixa-pipe-convert:

java -jar ixa-pipe-convert-$version.jar --brownClean corpus-directory/

ixa-pipe-convert will create a .clean file for each file contained in the folder corpus-directory.

  • Move all .clean files into a new directory called, for example, corpus-preclean.

Tokenize clean files to oneline format

  • Tokenize all the files in the folder to one line per sentence. This step is performed by using ixa-pipe-tok in the following shell script:
./ $lang corpus-preclean

The tokenized version of each file in the directory corpus-preclean will be saved with a .tok suffix.

  • cat to one large file: all the tokenize files are concatenate it into a large huge file called corpus-preclean.tok.
cd corpus-preclean
cat *.tok > corpus-preclean.tok

Format the corpus for Liang's implementation

  • Run the script like this to create the format required to induce Brown clusters using Percy Liang's program.
./ corpus-preclean.tok > corpus-preclean.tok.punct

Induce Brown clusters:

brown-cluster/wcluster --text corpus-preclean.tok.punct --c 1000 --threads 8

This trains 1000 class Brown clusters using 8 threads in parallel.


Let us assume that the source data is in plain text format (e.g., without html or xml tags, etc.), and that every document is in a directory called corpus-directory. Then the following steps are performed:

Tokenize clean files to oneline format

  • Tokenize all the files in the folder to one line per sentence. This step is performed by using ixa-pipe-tok in the following shell script:
./ $lang corpus-directory

The tokenized version of each file in the directory corpus-directory will be saved with a .tok suffix.

  • cat to one large file: all the tokenize files are concatenate it into a large huge file called corpus.tok.
cd corpus-directory
cat *.tok > corpus.tok

Format the corpus

  • Run the script like this to create the format required to induce Clark clusters using Clark's implementation.
./ corpus.tok > corpus.tok.punct.lower

Train Clark clusters:

To train 100 word clusters use the following command line:

cluster_neyessenmorph -s 5 -m 5 -i 10 corpus.tok.punct.lower - 100 > corpus.tok.punct.lower.100


Assuming that the source data is in plain text format (e.g., without html or xml tags, etc.), and that every document is in a directory called corpus-directory. Then the following steps are performed:

Tokenize clean files to oneline format

  • Tokenize all the files in the folder to one line per sentence. This step is performed by using ixa-pipe-tok in the following shell script:
./ $lang corpus-directory

The tokenized version of each file in the directory corpus-directory will be saved with a .tok suffix.

  • cat to one large file: all the tokenize files are concatenate it into a large huge file called corpus.tok.
cd corpus-directory
cat *.tok > corpus.tok

Format the corpus

  • Run the script like this to create the format required by Word2vec.
./ corpus.tok > corpus-word2vec.txt

Train K-means clusters on top of word2vec word embeddings

To train 400 class clusters using 8 threads in parallel we use the following command:

word2vec/word2vec -train corpus-word2vec.txt -output corpus-s50-w5.400 -cbow 0 -size 50 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 8 -classes 400

Cleaning XML, HTML and other formats

There are many ways of cleaning XML, HTML and other metadata than often comes in corpora. As we will usually be processing very large amounts of texts, we do not pay too much attention to detail and crudely remove every tag using regular expressions. In the scripts directory there is a shell script that takes either a file as argument like this:

./ file.html > file.txt

NOTE that this script will replace your original files with a cleaned version of them.


If you are interested in using the Wikipedia for your language, here you can find many Wikipedia dumps already extracted to XML which can be directly fed to the script:


If your language is not among them, we usually use the Wikipedia Extractor and then the script:


Contact information

Rodrigo Agerri
University of the Basque Country (UPV/EHU)
E-20018 Donostia-San Sebastián
[email protected]


preprocessing of large corpora to induce various cluster types







No releases published


No packages published
