Unknown words

The default translation mode allows the model to produce the symbol when it is not sure of the specific target word.

Often times symbols will correspond to proper names that can be directly transposed between languages. The -replace_unk option will substitute with source words that have the highest attention weight. The -replace_unk_tagged option will do the same, but wrap the token in a ⦅unk:xxxxx⦆ tag.

Phrase table

Alternatively, advanced users may prefer to provide a pre-constructed phrase table from an external aligner (such as fast_align) using the -phrase_table option to allow for non-identity replacement.

Instead of copying the source token with the highest attention, it will lookup in the phrase table for a possible translation. If a valid replacement is not found only then the source token will be copied.

The phrase table is a file with one translation per line in the format:

source|||target

Where source and target are case sensitive and single tokens.

Workarounds

Several techniques exist to minimize the out-of-vocabulary issue:

  • sub-tokenization like BPE or "wordpiece" to simulate open vocabularies
  • mixed word/characters model as described in Wu et al. (2016)