Import of Moses SMT Phrase Tables
crossMining also enables the import of phrase tables of Moses SMT, a free system for statistical machine translation.
On the basis of the phrase tables, statistical lexica can be created and used for terminology harvesting or auto-completion in crossDesk, just like the conventional lexica created on the basis of the crossTank data.
The phrase tables created with Moses SMT are text files containing source-language phrases (e.g. individual words, several words, or sentences) and their – statistically determined – target-language equivalents including statistical information.
The Dictionary Import Wizard assists you in creating a statistical lexicon on the basis of a Moses SMT phrase table.
Proceed as follows to import a Moses SMT phrase table and create a statistical lexicon:
- Start the Dictionary Import Wizard via File > Import.
- Once the wizard has started, click Next.
- Select the source and target languages (and the sublanguage, if applicable) contained in the phrase table.
- Click Next.
- Select the storage location of the phrase table by clicking Browse.
The phrase table may exist in the form of a plain TXT file or a compressed GZ file.
Under the option Count co-occurrences, you can determine a training set consisting of a parallel text pair (in the source and target languages). In the next wizard step, you can determine a minimum co-occurrence count – i.e. search hits both in the source text and in the target text of the training set – for the lexicon creation.
- Click Next.
- Determine the minimum probability from which terms are to be proposed. Moreover, you can specify the minimum probability value of correspondence of the source and target-language terms.
Furthermore, you can determine that terms are to be proposed only above a specified co-occurrence count.
- Finally, you can exclude phrase-table entries from the lexicon creation. For this, you can define words that should not occur at the beginning or end of the respective entries in the source and target-language phrase-table entries. Click Edit to determine the words. You can edit the words manually, import them from a file, and/or import the stopword list of the particular language from Across. Click Save to finish the definition of words.
- Click Next.
- Set the output folder for the lexicon. By default, a subfolder of the "Common Files" directory in the "Program Files" folder is used for this purpose.
- Click Start Import to start creating the statistical lexicon.Attention
As the creation of the lexicon is very resource-intensive, it may take some time, depending on the size of the chosen phrase table. Therefore, you should only run the lexicon creation at times when the computer has nothing or little else to do.
- Upon completion of the lexicon creation, the lexicon is displayed with the determined equivalents, the respective probability, and the co-occurrence count.
As Moses SMT phrase tables can be very large and contain several million entries, the statistical lexica generated on the basis of these can also be very large. Therefore, you can narrow down the determined equivalents by means of extensive filter functions.
- To edit the created lexicon, you can define filter criteria.
First, select one of the following three filters:
- Text value: Filter on the basis of a particular text or character string.
- Text length: Filter on the basis of a particular number of characters.
- Number: Filter on the basis of the probability or co-occurrence count.
After selecting a filter, you can select the column to which the filter criterion is to be applied. For Text value and Text length, you can select either the source text or the target text. For Number, you can determine that the filter is to refer to the probability or to the co-occurrence count.
Subsequently, you can enter the respective value for the filter – e.g. a word or special characters (for Text value) or a particular numeric value (for Text length and Number). In the latter case, you can use one of the following operators: > (greater than), >= (greater than or equal to), < (less than), <= (less than or equal to), = (equal to).
Click Add to adopt the filter criterion.
AttentionPlease note that the filter process will take place immediately after adding a filter criterion. For large lexica, this might take some time.
- Click Save to save the statistical lexicon to the selected output folder.
- A message is displayed after the lexicon is saved to the output folder.
You can now use the lexicon for the auto-completion functions and the terminology harvesting functions of crossTank just like conventional lexica created on the basis of the crossMining data.