MDParser stands for multilingual dependency parser and is a data-driven system, which can be used to parse text of an arbitrary language for which training data is available. The parser is able to create both unlabeled and labeled dependency structures. The number of possible relation types depends on the granularity of the training data. It currently supports 2 "standard" output formats: Stanford and CoNLLX and computes dependency relations for German and English.
The models of the system are based on various features, which are extracted from the words of the sentence, including word forms and part of speech tags. Therefore in order to process previously unannotated text MDParser additionally includes some preprocessing components:
- a sentence splitter, since the parser constructs a dependency structure for individual sentences
- a tokenizer, in order to recognise the elements between the dependency relations will be built
- a part of speech tagger, in order to determine the part of speech tags, which are one of the most important influencing factors for constructing the dependency structure.
MDParser is an especially fast system (~ 10 sentences / second) and therefore it is particularly suitable for processing very large amounts of data. Thus it can be used as a part of bigger applications in which dependency structures are desired. MDParser has already been tested for several languages, including German and English. It is currently able to achieve quite competitive results (86% - 88%), considering that it is based on a fast linear classification approach and a deterministic parsing strategy.