Institutions | About Us | Help | Gaeilge
rian logo


Mark
Go Back
C-structures and f-structures for the British national corpus
Wagner, Joachim; Seddah, Djamé; Foster, Jennifer; van Genabith, Josef
We describe how the British National Corpus (BNC), a one hundred million word balanced corpus of British English, was parsed into Lexical Functional Grammar (LFG) c-structures and f-structures, using a treebank-based parsing architecture. The parsing architecture uses a state-of-the-art statistical parser and reranker trained on the Penn Treebank to produce context-free phrase structure trees, and an annotation algorithm to automatically annotate these trees into LFG f-structures. We describe the pre-processing steps which were taken to accommodate the differences between the Penn Treebank and the BNC. Some of the issues encountered in applying the parsing architecture on such a large scale are discussed. The process of annotating a gold standard set of 1,000 parse trees is described. We present evaluation results obtained by evaluating the c-structures produced by the statistical parser against the c-structure gold standard. We also present the results obtained by evaluating the f-structures produced by the annotation algorithm against an automatically constructed f-structure gold standard. The c-structures achieve an f-score of 83.7% and the f-structures an f-score of 91.2%.
Keyword(s): Machine translating; lexical functional grammar
Publication Date:
2007
Type: Conference item
Peer-Reviewed: Yes
Language(s): English
Institution: Dublin City University
Funder(s): Irish Research Council for Science Engineering and Technology; Science Foundation Ireland
Citation(s): Wagner, Joachim and Seddah, Djamé and Foster, Jennifer and van Genabith, Josef (2007) C-structures and f-structures for the British national corpus. In: Lexical Functional Grammar 2007, 28-30 July 2007, California, USA.
Publisher(s): CSLI Publications
File Format(s): application/pdf
Related Link(s): http://doras.dcu.ie/15205/1/jwagner_et_al_07.pdf
First Indexed: 2010-02-18 05:05:06 Last Updated: 2014-08-23 05:18:48