Sentence detection

The sentence detection used by Across is rule-based, i.e. Across uses rules to determine where a sentence ends and a new sentence begins.

You can import or export language settings in XML format via Import or Export.

Sentence rules are structured as follows:

Part
Function
Example
1
Specifies which separator the rule handles
[?]
2
Type of rule, i.e. whether the rule defines the end of a sentence (+) or not (-).
+ or -
3
The actual rule
[?^_]

Default sentence rules

By default, Across uses the following sentence rules (Standard language set > Sentence rules):

Wildcard
Function
[!]+[!^_]
An exclamation mark followed by a white space is interpreted as the end of a sentence.
[!]-[!^_^a]
An exclamation mark followed by a white space and a lower case letter is not interpreted as the end of a sentence.
[.]+[.^_]
A period followed by a white space is interpreted as the end of a sentence.
[.]-[.^_^a]
A period followed by a white space and a lower case letter is not interpreted as the end of a sentence.
[.]-[^_^n.]
A space followed by a one-digit number and a period is not interpreted as the end of a sentence. Multi-digit numbers should be mapped by means of additional rules with multiple placeholders n such as, for example, [.]‑[^_^n^n.] for a two-digit number.
[?]+[?^_]
A question mark followed by a white space is interpreted as the end of a sentence.
[?]-[?^_^a]
A question mark followed by a white space and a lower case letter is not interpreted as the end of a sentence.
[n]+[.\n]
A period followed by a backslash and the letter n is interpreted as the end of a sentence.
The background to this rule is that the character sequence \n represents a line break, especially in the localization of software resources. In the following string, for example, the sentence ends after \n according to the rule: Cannot load file.\nError: 0x%x
[n]+[!\n]
An exclamation mark followed by a backslash and a letter n is interpreted as the end of a sentence.
[n]+[?\n]
A question mark followed by a backslash and a letter n is interpreted as the end of a sentence.
[t]+[.\t]
A period followed by a backslash and a letter t is interpreted as the end of a sentence.
The background to this rule is that the character sequence \t represents a horizontal tabulator, especially in the localization of software resources. In the following string, for example, the sentence ends after \t according to the rule: &Find...\tCtrl+F
[t]+[!\t]
An exclamation mark followed by a backslash and a letter t is interpreted as the end of a sentence.
[t]+[?\t]
A question mark followed by a backslash and a letter t is interpreted as the end of a sentence.

Example:

[.]+[.^_]
Defines the end of a sentence: A period (.) followed by a white space (^_) is interpreted as the end of a sentence. Usually, the underscore _ stands for a white space. The ^ character before the underscore defines the following character - the underscore - as a placeholder. Without the ^ character, the subsequent character would be interpreted as a normal character – i.e. as an actual underscore – and not as a wildcard.
[.]-[.^_^a]
However, defines an exception for the example above: If a period is followed by a white space and a lower case letter (^a), the sentence has not ended.

In the word combination "This is a sentence. This is another sentence.", the first period constitutes the end of the sentence, because it is followed by a white space. In the word combination "But not. this!" however, it does not, because the period is followed by a lower case letter and no capital letter follows.

Abbreviations

The definition of abbreviations represents a special case of the sentence rules: An abbreviation in a source text will only be identified as such and not taken to represent the end of a sentence if duly defined as an abbreviation.

Uppercase and lowercase spelling is not taken into consideration for abbreviations. The abbreviation "max." will be identified as such even if a sentence contains "Max." (at the beginning of the sentence).

In the list of abbreviations, the abbreviations are sorted in ascending order according to the ASCII code of the characters. Abbreviations with an initial umlaut or accent are displayed at the end of the list of abbreviations.

Abbreviations added while editing the sentence detection in crossDesk are automatically added to the language settings.