Statistical Lexica
Statistical lexica form the basis for the work with the various functions of crossMining. These are created automatically in several steps and are mainly based on the crossTank data of an Across Language Server. Optionally, the existing terminology in crossTerm can also be taken into consideration when creating lexica.
Furthermore, statistical lexica can be created on the basis of Moses SMT phrase tables, a free system for statistical machine translation.
The statistical lexica have the file extension DIC and are created for a particular language pair. The lexica can only be used in one direction for the other crossMining functions, i.e. only for the language direction selected during creation.
Before you continue using the statistical lexica for the other functions of crossMining, you should test the lexicon creation thoroughly on the basis of your specific data and, if necessary, with professional help in order to ensure the most suitable values and settings for your data.
A certain amount of data (translation units) is necessary for the efficient use of crossMining. The smaller the amount of data available for the calculation of probabilities, the poorer the results will be. Generally, about 10,000 translation units (per language pair) should be provided, though this does not mean that good results cannot be achieved with fewer translation units.
The quality of the results also depends on the respective language or language combination. Languages with a simpler morphological structure, such as English, enable good results even with a relatively small amount of data. In contrast, the satisfactory determination of probabilities for highly inflectional languages like Finnish is only possible from a larger amount of training data. Moreover, the language direction is also important.
As the creation of the lexicon is very resource-intensive, it may take some time, depending on the data volume. Therefore, you should only run the lexicon creation at times when the computer has nothing or little else to do.
Of course, it is possible to create statistical lexica as often as necessary. Creating new lexica is recommended especially when the crossTank data have changed substantially, e.g. after importing a large translation memory or upon completion of a major translation project. Some users may want to create lexica at regular intervals, e.g. once a month.