Creating Statistical Lexica
Proceed as follows to create a statistical lexicon:
- Start the lexicon creation via the
icon in the crossMining toolbar or via the menu item File > Create Lexicon.
TipWhen creating lexica, the settings defined under Tools > Settings are used.
Further information is available here.
- The first step is the selection of the languages in which the lexicon is to be created. crossMining automatically determines the languages set up in Across. Select a source language and then a target language (and sublanguages if applicable). You can freely combine the source and target languages and also define multiple language pairs. A separate lexicon is created for each language pair.
- Now you can define crossTank and/or crossTerm filters to limit the lexicon creation to certain crossTank and crossTerm entries or ranges. For crossTerm, you can filter the crossTerm data by instances, relations, and subjects. For crossTank, you can filter by users, subjects, projects, relations, and user-defined system attributes.
- Subsequently, you can set the output folder for the lexicon. By default, a subfolder of the "Common Files" directory in the "Program Files" folder is used for this purpose.
From this subdirectory, the statistical lexica are read and deployed to the Across Clients.
If you wish, you can select a different output folder. For example, this enables you to optimize the creation of the statistical lexica for test purposes before you store the lexica in the default output folder for deployment to the clients. To select a different output folder, disable the option Use default output folder and click Browse to select a different folder.
- Now the lexica are created. This process comprises the following steps:
- Compilation of the crossTerm data in the selected languages
(this step is skipped if the option for including terms (see above) is disabled). - Compilation of the crossTank data in the selected languages
- First phase of the lexicon creation: probability calculation of possible word equivalents.
- Second phase of the lexicon creation: inclusion of the word position in the probability calculation.
- Third phase of the lexicon creation: determination of possible equivalents of multi-word combination (e.g. English table of contents vs. German Inhaltsverzeichnis) under application of the minimum probability and frequency values configured in the settings.
TipUnder View > Process output, you can have the process steps currently performed by crossMining displayed in a pane.
- Compilation of the crossTerm data in the selected languages
- A message is displayed upon completion of the lexicon creation. Click OK.
- The statistical lexicon has been saved to the selected storage location as .dic file. The name of the file consists of the installation GUID of the Across Language Server and the country codes (LCIDs) of the source and target languages.
You can now use the created lexicon for the auto-completion functions and the terminology harvesting functions of crossMining.
Process Graphs
Under View > Graph, you can view the development of probabilities during the generation of lexica in graphical form. The tabs allow you to select the graph for the first or second phase of the lexicon creation. The iterations are displayed on the x axis and the probabilities on the y axis.
The creation of statistical lexica can be optimized by analyzing the graphs, e.g. by duly adjusting the number of iterations.
If you select a section of the graph while keeping the left mouse button pressed, the respective section will be enlarged.
Click Reset zoom to restore the original display.
Using File > Save graph and File > Load graph you can save and load graphs as XML files. For example, this enables different graphs to be compared with each other.