Best Way to Tokenize German texts

The landscape of NLP tools is vast. However, most of those tools still don’t work as perfectly for other languages as they do for English. Even the simplest NLP tasks as tokenization become a challenge for mainstream tools as Spacy, especially when the texts get more conversational and domain specific.

With some search, though, you might find tools that perform much better for isolated tasks. The self-proclaimed German text state-of-the-art tokenizer SoMaJo is backed by academic research and worked very well for my highly domain specific and conversational dataset.

Installation and Usage

pip install SoMaJo

The only default you have to be aware of is that paragraphs should be separated by a new line. If you have your separated text in a file like a corpus.txt, you can just get your tokens into a new file like this:

somajo-tokenizer corpus.txt > tokens.txt

For further functionality, like splitting XMLs or running the process on multiple processes, you can refer to the tool’s documentation.

Performance Example

nlp = spacy.load('de_core_news_sm')
doc = nlp(u"""
Ca. 90min mit newmotion geladen, weil ich mit Maingau/EinfachStromLaden keine Verbindung über die App bekam. 
Säule hat keinen RFID-Leser usw.
2. Buchse seit 2 Tagen mit Kommunalfahrzeug/EWV blockiert. 
""")
for token in doc:
    print(token.text)

… yields these tokens:

Ca  
.  
90min
mit  
newmotion  
geladen  
,  
weil  
ich  
mit  
Maingau/EinfachStromLaden  
keine  
Verbindung  
über  
die  
App  
bekam  
.  
Säule  
hat  
keinen  
RFID-Leser    
usw.    
2  
.  
Buchse  
seit  
2  
Tagen  
mit
Kommunalfahrzeug/EWV  
blockiert  
.  

Doesn’t recognize Ca. as one token:
Ca
.

Doesn’t recognize 90min as two tokens:
90min

Doesn’t split
Maingau/EinfachStromLaden

Doesn’t recognize end of sentence
usw.

Doesn’t recognize that this isn’t an end of sentence and this should be 2. (like second)
2
.

Besides the fact that sentences are split pretty wrong and the slashed words are not split, it gets even tougher when Spacy tries to PoS tag and 90min is tagged as VERB…

SoMaJo handles a lot of those German and not German special cases and splits the sentences above perfectly.

Extension

The tool is easily extendable for special cases of your dataset. For example, you can add special cases like E.ON to be handled as one word to the library’s single_token_abbreviations_de.txt file.

Next Steps

Unfortunately it doesn’t seem to be possible to load tokenized text into Spacy. You would rather have to train your own Spacy tokenizer to get better results with it.

But from here you could feed your tokens to something like NLTK‘s corpus reader and do basic statistics on the texts as well as preprocessing your corpus further. Or you could tag them with Part-of-Speech tags first. The latter I will describe for German texts in one of my next posts, again using a research-proven tool from the SoMaJo creator: SoMeWeTa.

Comments