Skip to content

Files

Latest commit

author
Austin Matthews
Jan 22, 2015
3517ce0 · Jan 22, 2015

History

History

hw1

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
Jan 22, 2015
Jan 22, 2015
Jan 22, 2015
Jan 22, 2015
Jan 22, 2015

There are three Python programs here (-h for usage):

  • ./align aligns words using Dice's coefficient.
  • ./check checks for out-of-bounds alignment points.
  • ./grade computes alignment error rate.

The commands are designed to work in a pipeline. For instance, this is a valid invocation:

./align -t 0.9 -n 1000 | ./check | ./grade -n 5

The data/ directory contains a fragment of the German/English Europarl corpus.

  • data/dev-test-train.de-en is the German/English parallel data to be aligned. The first 150 sentences are for development; the next 150 is a blind set you will be evaluated on; and the remainder of the file is unannotated parallel data.

  • data/dev.align contains 150 manual alignments corresponding to the first 150 sentences of the parallel corpus. When you run ./check these are used to compute the alignment error rate. You may use these in any way you choose. The notation i-j means the word at position i (0-indexed) in the German sentence is aligned to the word at position j in the English sentence; the notation i?j means they are "probably" aligned.