Erik Dörnenburg

Extension of the EMILE algorithm for inductive learning of context-free grammars for natural languages.

The following passage is quoted from the introduction:

“Most of the customary methods for knowledge discovery are not directly applicable to free-text entries in databases. These methods require a formal and rigid structure in the input against which passages of texts in natural languages with their ambiguity and richness of form appear to be unstructured, almost arbitrary. It is, however, also obvious that natural languages do adhere to certain rules and linguistic research has revealed clear principles, structure and functional dependencies on all levels of language, ranging from word formation to discourse. For a successful application of knowledge discovery techniques to free-texts it therefore seems to be necessary to transform the structure inherent in these texts into a more accessible, explicit form. This can be achieved in several steps starting with a syntactic analysis.

“It is the goal of this project to devise a procedure that given short texts written by different authors learns a grammar that can be utilised to process other texts written by same or even different authors. The result of processing unknown input in the form of sentences should be a structural description of each sentence with categories that were established automatically. For the intended area of application, i.e. information extraction and data-mining, the algorithm must also be suited to handle large amounts of data efficiently.

“Pieter Adriaans has carried out extensive work in the area of language learning. His research was motivated differently as it was his foremost interest to establish a model of dialogical language learning, to examine if and under which circumstances natural languages can be learnt and also to provide a formal proof for his findings. In the course of his research he developed several learning algorithms and a later version of his Emile algorithm seems to be suited to the task outlined above. In this thesis I will investigate whether the Emile algorithm with its theoretical background can indeed be used successfully in a very practical situation. There are, of course, a variety of different approaches to the problem and my concentration on the Emile algorithm should not be misunderstood to imply that only the Emile algorithm can be used to tackle the problem. This thesis is merely an exploration of one of the possible solutions.

“In the first chapter I will present an overview of natural language processing in general and theories of grammar in particular. Some of these theories have a long tradition in computational models of language and I assume that the reader will be somewhat familiar with them. Nevertheless I present a brief overview in order to establish a consistent terminology and to explicate certain points which might be open to interpretation otherwise. A special emphasis is placed on those properties that are relevant in the envisioned area of application. I will also present an alternative theory of grammar, mainly because Adriaans' research is based on it. In chapter2 Iwill introduce some concepts from machine learning theory and I will determine where Adriaans' approach is located in the field. I will also present other approaches to the problem of language/grammar learning, some of which can be understood as precursors of Adriaans' approach. Finally, I will describe in some detail the theoretical framework of the Emile algorithm. It might seem that chapters 1 and 2 are overly extensive in relation to the rest of the thesis but I believe that a sound understanding of the complete background is necessary for a proper evaluation of any approach to the problem. If certain aspects are left undiscussed, research might be directed in the wrong direction or false hope might be found in approaches that are doomed right from the start. Also, the field of language/grammar learning seems to have a similar history as the area of connectionist computing, i.e. after a devastating criticism research activity was extremely low for an extended period of time before a realistic reassessment of the complexity issues involved resulted in a second spring. In the last couple of years a number of novel approaches to the problem of language/grammar learning have been explored but there are few works relating them to each other. Thus, chapter 2 can also be seen as my modest attempt to present an overview and comparison of several recent approaches to the problem.

“In chapter 3 I will present the specific version of the Emile algorithm developed by Pieter Adriaans and Arno Knobbe which I believe might be suitable to solve the problem of learning a grammar for the purpose of structuring large amounts of text such that customary knowledge-discovery techniques can be applied to these texts. I will also describe a concrete implementation of the algorithm in an object-oriented environment and will present results from some experiments. These results are not extensive and systematic enough to provide an irrefutable characterization or an accurate quantitative analysis of the algorithm but they do indicate certain general problems that will inevitably occur when the algorithm is used for practical purposes. In chapters 4 and 5 I will explore various approaches to solve these problems and provide an evaluation of these approaches in terms of the experiments from chapter 3.”

You can download the complete thesis, including all figures, from the University of Dortmund AI publications page as either PDF or PostScript. Alternatively, you can download a local copy of the PDF Document. The implementation is available upon request.

Copyright © 1997 Erik Dörnenburg. All rights reserved. The author hereby grants permission to reproduce and distribute publicly paper and electronic copies of this thesis document in whole, and to grant others the right to do so.