Abstracts IWPT '97-'98
The Computation of Movement
NEC Research Institute, Inc.
4 Independence Way, Princeton NJ 08540. USA
A central goal of parsing is to recover linguistic structure for interpretation. One property of language that seems to be prevalent is the so-called displacement property. That is, syntactic items commonly appear in places other than where we would normally expect for interpretation. Some examples of phenomena involving displacement include Wh-movement, raising, passivization, scrambling, topicalization and focus. As Chomsky (1995) points out, displacement is an irreducible fact about human language that every contemporary theory of language has to address. In the principles-and-parameters framework, it is customary to posit a general movement operation, Move-a , that in concert with conditions on its application serve to link displaced elements with their base positions. In terms of parsing, the task is to decode or unravel the effects of Move-a from the surface order. More specifically, for each element, we have to determine whether that element has been displaced or not, and, if so, determine the original position it was displaced from and reconstruct the path it took - including any intermediate positions or landing sites. In general, each displaced element is said to head a (non-trivial) chain with one or more empty categories known as traces occupying the positions that it passed through. Note that in such theories, empty categories are not just simple placeholders, but elements with much of the same type and range of syntactic properties displayed by their overt counterparts. For example, empty categories in argument positions, like anaphors and pronouns, participate in binding theory and theta role discharge. Hence, the well-formedness of a given sentence will depend, in general. in recovering both the visible and non-visible parts of syntactic structure.
In this talk, we will describe how PAPPI, a multi-lingual parser for theories in the principles-and-parameters frameworks, deals with the computation of movement chains and empty categories in general. Drawing from implemented examples across a variety of languages, we will discuss the mechanism used to handle standard cases of phrasal movement commonly discussed in the literature such as Wh-movement, passivization. raising and verb second (V2) phemomena. We will also describe how this mechanism is adapted to handle instances of argument scrambling in languages like Korean and Japanese. We will also focus our attention on head movement. Here, following Pollock (1989), we will discuss the mechanism used to handle the surface differences in the behaviour of verbal inflection in English and French. Following Pesetsky (1995), we will also discuss the implementation of a theory of double object constructions involving the incorporation of both overt and non-overt prepositions into verbal heads.
Finally, we will describe two recent additions to the movement mechanism in the PAPPI system. Moving towards a theory of goal-driven movement - as opposed to the free movement system implied bv Move-a , we will discuss an implementation of Case-driven movement within the VP-shell to handle examples involving focus, backgrounding and topicalization in Turkish. Finally, using examples from English and Turkish, we will discuss the necessity of a mechanism of reconstruction that optionally "undoes" or reverses the effects of movement to handle facts involving binding and scope.
Parsing Technology and RNA Folding:
a Promising Start
LIX, Ecole Polytechnique91128 Palaiseau Cedex, FRANCE
The determination of the secondary structure of RNAs is a problem which has been tackled bv distantly related methods ranging from comparative analysis to thermodynamic energy optimization or stochastic context-free grammars (SCFGs). Because of its very nature (properly nested pairs of bases of a single stranded sequence) the secondary structure of RNAs is well modeled by context-free grammars (CFGs). This fact has been recognized several years ago by people who used context-free grammars as a tool to discover some combinatorial properties of secondary structures. More recently SCFGs were used bv several teams (esp. David Haussler's team at UC Santa Cruz) as an effective tool to fold RNAs through Cocke-Younger-Kasami-Iike parsers. Until 1996, and in the context of RNA folding, CFGs and their derivatives where still considered theoretical tools, barely usable outside the computer scientist lab. The exception of SCFGs seemed promising, with all the hype around Hidden Markov Models and other stochastic methods, but it remained to be confirmed for RNAs longer than 200 bases.
The main obstacle to the use of context-free grammars and parsing technology for RNA folding and other closely related problems is the following : suitable grammars are exponentially, ambiguous, and sentences to parse (i.e. RNA or DNA sequences) typically have more than 200 words. and sometimes more than 4000 words. These figures are rather unusual for ordinary parsers or parser generators, because they are mostly, used in the context of natural language parsing., and thus do not have to face the same computation problems. Fact is, most people dealing with RNA folding problems were manually writing dynamic programming based tools. This was the case for folding models popularized by Michael Zuker, and based on free energy-mininimization. This was also the case for folding models based on SCFGs. This was in effect the case for just about every computer method available to fold or align sequenees. Parsing sequences was not an issue because it simply seemed too slow, too memory hungry and even unrelated.
In 1995, I showed that S-attribute grammars were perfectly able to handle both the thermodynamic model and the stochastic model of RNA folding. I then introduced a parser generator which was able, given a proper S-attribute grammar, to automatically write an efficient parser based on suitable optimizations of Earley's parsing algorithm. All generated parsers turned out to be faster and less memory hungry than other available parsers for the same exponentially ambiguous grammars and the same sequences. More surprisingly, these parsers also turned out to be faster than hand-written programs based on dynamic programming equations. This was the first proof that improvements in parsing technology may certainly be put to good use in biocomputing problems, and that they shall lead to better algorithms and tools.
While trying to overcome some limitations of SCFGs, I generalized S-attribute grammars to multi-tape S-attribute grammars (MTSAGs). The automata theory counterpart of a MTSAG would be a non-deterministic push-down automaton with several one-way reading heads, instead of a single one-way reading head as it is the case for CFG. Given these MTSAGs, a generalization of the previous single-tape parser generator was the obvious way forward.
Thanks to this new parser generator, I was able to show that most biocomputing models previously based on dynamic programming equations were unified bv MTSAGs, and that they were better handled by automatically generated parsers than by handwritten programs. lt did not matter whether these models were trying to align sequences, fold RNAS, align folded RNAS, align folded and unfolded RNAS, simultaneously align and fold RNAS, etc. It also turned out that the way SCFGs and HMMs are currently used rnay be better pictured, thanks to 2-tape MTSAGs, as the simultaneous alignrnent and folding of a first special tape, representing the target model against a second tape, containing the actual sequence. This representation may lead to algorithms which will efficiently learn SCFGs frorn initially unaligned sequences.
While the current parser generator for MTSAGs is a usable proof of concept, which nevertheless required several rnonths of work, 1 arn quite convinced that there should be better ways than the current algorithm to parse several tapes. There should also exist other generalizations of CFGs which rnay reveal themselves fruitful. Current results are only promising starting points.
The irony of the story is that HMMs and SCFGs were borrowed by biocomputing people from other fields such as signal or speech analysis. lt may very well be the time for these fields to retrofit their own rnodels with current advances in biocomputing such as MTSAGS.
Intelligent Multimedia Information Access
Mark T. Maybury
Advanced Information Systems Center
The MITRE Corporation
202 Burlington Road
Bedford, MA 01730, USA
The expansion of the information highway has generated requirements for more effective access to global and corporatic information repositories. These repositories are increasingly multimedia, including text, audio (e.g., spoken language, music), graphics, imagery, and video. The advent of large, multimedia digital libraries has turned attention toward the problem of processing and managing multiple and heterogeneous media in a principled manner, including their creation, storage, indexing, browsing, search, visualization, and summarization.
Intelligent multimedia information access is a multidisciplinary area that lies at the intersection of artificial intelligence, information retrieval, human computer interaction, and multimedia computing. Intelligent multimedia information access includes those systems which go beyond traditional hypermedia or hypertext environments and analyze media, generate media, or support intelligent interaction with or via multiple media using knowledge of the user, discourse, domain, world, or the media itself.
Providing machines with the ability to interpret, generate, and support interaction with multimedia artifacts (e.g., documents, broadcasts, hypermedia) will be a valuable facility for a number of key applications such as videoteleconference archiving, custom on-line news, and briefing assistants. These media facilities, in turn, may support a variety of tasks ranging from training to information analysis to decision support.
In this talk I will describe our group's efforts to provide content based access to broadeast news sources, including our use of corpus-based processing techniques to the problems of video indexing, segmentation, and summarization. In addition to better access to content, we also need to concern ourselves with enabling more effective, efficient and natural human computer or computer mediated human-human interaction. This will require automated understanding and generation of multimedia and demand explicit representation of and reasoning about the user, discourse, task and context (Maybury 1993). To this end, I will describe our work in progress that aims to fully instrument the interface and build (automatically and semi-automatically) annotated corpora of human-machine interaction. We believe this will yield deeper and more comprehensive models of interaction which should ultimately enable more principled interface design.
Making Use of Intonation in
Interactive Dialogue Translation
University of Pennsylvania, USA
Intonational information is frequently discarded in speech recognition, and assigned by default heuristies in text-to-speech generation. However, in many applications involving dialogue and interactive discourse, intonation conveys significant information, and we ignore it at our peril. Translating telephones and personal assistants are an interesting test case, in which the salience of rapidly shifting discourse topics and the fact that sentences are machine-generated, rather than written by humans, combine to make the application particularly vulnerable to our poor theoretical grasp of intonation and its functions. I will discuss a number of approaches to the problem for such applications, ranging from cheap tricks to a combinatory grammar-based theory of the semantics involved and a syntax-phonology interface for building and generating from interpretations.
Disambiguating with Controlled Disjunctions
2LC - CNRS
1361 route des Lucioles
F-06560 Sophia Antipolis
In this paper, we propose a disambiguating technique called controlled disjunctions. This extension of the socalled named disjuiictions relies on the relations existing between feature values (covariation, control, etc.). We show that controlled disjunctions can implement different kind of ambiguities in a consistent and homogeneous way. We describe the integration of controlled dijunctions into a HPSG feature structure representation. Finally we present a direct implementation by means of delayed evaluation and we develop an example within the functional programming paradigm.
Encoding Frequency Information In
John Carroll David Weir
School of Cognitive and Computing Sciences
University of Sussex
Faliller, Brighton, BN1 9QH, UK
We address the issue of how to associate frequency information with lexicalized grammar formalisms, using Lexicalized Tree Adjoining Grammar as a representative framework. We consider systematically a number of alternative probabilistic frameworks, evaluating their adequacy from both a theoretical and empirical perspective using data from existing large treebanks. We also propose three orthogonal approaches for backing off probability estimates to cope with the large number of parameters involved.
Towards a Reduced Commitment, D-Theory Style TAG
Department of Computer and lnformation Sciences, University of Delaware
Newark, DE 19716, USA
Many traditional TAG parsers handle ambiguity by considering all of the possible choices as they unfold during parsing. In contrast, D-theory parsers cope with ambiguity by using underspecified descriptions of trees. This paper introduces a novel approach to parsing TAG, namely one that explores how D-theoretic notions may be applied toTAG parsing.
Combining the D-tlieoretic approach to TAG parsing as we do here raises new issues and problems. D-theoretic underspecifition is used as a novel approach in the context of TAG parsing for delaying attachment decisions. Conversely, the use of TAG reveals the need for additional types of underspecification that have not been considered so far in the D-theoretic framework. These include combining sets of trees into their underspecified equivalents as well as underspecifying combination of trees.
ln this paper, we examine various issues that arise in this new approach to TAG parsing and present solutions to some of the problems. We, also describe other issues which need to be resolved for this method of parsing to be implemented.
Controlling Bottom-Up Chart Parsers
through Text Chunking
Fabio Ciravegna, Alberto Lavelli
Istituto per la Ricerca Scientifica e Technologica
Loc. Pante' di Povo, I-38050 Trento, Italy
In this paper we propose to use text chucking for controlling a bottom-up parsing. As it is well known, during analysis such parsers produce many constituents not contributing to the final solution(s). Most of these constituents are introduced due to the parser inability of checking the input context around them. Preliminary text chunking allows to focus directly on the constituents that seem more likely and to prune the search space in the case some satisfactory solutions are found. Preliminary experiments show that a CYK-like parser. controlled through chunking is definitely more efficient than a traditional parser without significantly losing in correctness. Moreover the quality of possible partial results produced by the controlled parser is high. The strategy is particularly suited for tasks like Information Extraction from text (IE) where sentences are often long and complex and it is very difficult to have a complete coverage. Hence, there is a strong necessity of focusing on the most likely solutions; furthermore, in IE the quality of partial results is important.
Pruning Search Space For Parsing Free Coordination In Categorial Grammar
Department of General Linguistics, Leiden University, The Netherlands
The standard resource sensitive invariants of categorial grammar are not suited to prune search space in the presence of coordination. We propose a weaker variant of count invariance in order to prune the search space for parsing coordinated sentences at a stage prior to proper parsing. This Coordinative Count Invariant is argued to be the strongest possible instrument to prune search space for parsing coordination in categorial grammar. Its mode of operation is explained, and its effect at pruning search space is exemplified.
Bilexical Grammars And a
Cubic-Time Probabilistic Parser
Dept. of Computer and Information Science, Univ. of Pennsylvania
200 South 33rd St., Philadelphia, PA 19103-6389 USA
Computational linguistics has a long tradition of lexicalized grammars, in which each grammatical rule is specialized for some individual word. The earliest lexicalized rules were word-specific subeategoriza,tion frames. It is now common to find fully lexicalized versions of many grammatical formalisms, such as context-ftee and tree-adjoining grammars [Schabes et al. 1988]. Other formalisms, such as dependency grammar [Mel'cuk 1988] and head-driven phrase-structure grammar [Pollard & Sag 1994], are explicitly lexical from the start.
Lexicalized grammars have two well-known advantages. Where syntactic acceptability is sensitive to the quirks of individual words, lexicalized rules are necessary for linguistic description. Lexicalized rules are also computationally cheap for parsing written text: a parser may ignore those rules that do not mention any input words.
More recently, a third advantage of lexicalized grammars has emerged. Even when syntactic acceptability is not sensitive to the particular words chosen, syntactic distribution may be [Resnik 1993]. Certain words may be able but highly unlikely to modify certain other words. Such facts can be captured by a probabilistic lexicalized grammar, where they may be rises to resolve ambiguity in favor of the most probable analysis, and also to speed parsing by avoiding ("pruning") unlikely search paths. Accuracy and efficiency can therefore both benefit.
Automaton-based Parsing For
Roger Evans, David Weir
University of Brighton University of Sussex
In wide-coverage lexicalized grammars many of the elementary structures have substructures in common. This means that during parsing some of the computation associated with different structures is duplicated. This paper explores ways in which the grammar can be precompiled into finite state automata so that some of this shared structure results in shared computation at runtime.
From Part of Speech Tagging
to Memory-based Deep Syntactic Analysis
Emmanuel Giguet and Jacques Vergue
GREYC - CNR UPRESA 6072 - Universite' de Caen
14032 Caen cedex - France
This paper presents a robust system for deep syntactic parsing of unrestricted French. This system uses techniques from Part-of-Speech tagging in order to build a constituent structure and uses other techniques from dependency grammar in an original framework of memories in order to build a functional structure. The two structures are build simultaneously by two interacting processes. The processes share the same aim, that is, to recover efficiently and reliably syntactic information with no explicit expectation on text structure.
Probabilistic Feature Grammars
40 Oxford St.
Cambridge, MA 02138
We present a new formalism, probabilistic feature grammar (PFG). PFGs combine most of the best properties of several other formalisms, including those of Collins, Magerman, and Charniak, and in experiments have comparable or better performance. PFGs generate features one at a time, probabilistically, conditioning the probabilities of each feature on other features in a local context. Because the conditioning is local, efficient polynomial time parsing algorithms exist for computing inside, outside, and Viterbi parses. PFGs can produce probabilities of strings, making them potentially useful for language modeling. Precision and recall results are comparable to the state of the art with words, and the best reported without words.
Message-Passing Protocols for Real-World Parsing
An Object-Oriented Model and its Preliminary Evaluation
Udo Hahn and Peter Neuhaus and Norbert Brueker
Computational Linguistics Lab
Freiburg University, Werthmannplatz, D-79085 Freiburg, Germany
We argue for a performance-based design of natural language grammars and their associated parsers in order to meet the constraints imposed by real-world NLP. Our approach incorporates declarative and procedural knowledge about language and language use within an object-oriented specification framework. We discuss several message-passing protocols for parsing and provide reasons for sacrificing completeness of the parse in favor of efficiency based on a preliminary empirical evaluation.
Probabilistic Parse Selection Based On
Computer Laboratory, University of Cambridge
This paper presents a new technique for selecting the correct parse of ambiguous sentences based on a probabilistic analysis of lexical cooccurrences in semantic forms. The method is called "Semco" (for semantic cooccurrence analysis) and is specifically targeted at the differential distribution of such cooccurreiices in correct and incorrect parses. It uses Bavesian Estimation for the cooccurrence probabilities to achieve higher accuracy for sparse data than the more common Maximum Likelihood Estimation would. It has been tested on the Wall Street Journal corpus (in the PENN Treebank) and shown to find the correct parse of 60.9% of parseable sentences of 6-20 words.
A New Formalization of Probabilistic GLR Parsing
Kentaro Inui, Virach Sornlertlamvanich, Hozumi Tanaka, and Takenobu Tokunaga
Graduate School of Information Science and Engineering
Tokyo Institute of Technology
2-12-1 O-okayama Meguro Tokyo 152, Japan
This paper presents a new formalization of probabilistic GLR language modeling for statistical parsing. Our model inherits its essential features from Briscoe and Carroll's generalized probabilistic LR model, which obtains context-sensitivity bv assigning a probability to each LR parsing action according to its left and right context. Briscoe and Carroll's model, however, has a drawback in that it is not formalized in any probabilistically wellfounded way, which may degrade its parsing performance. Our formulation overcomes this drawback with a few significant refinements, while maintaining all the advantages of Briscoe and Carroll's modeling.
Efficient Parsing For CCGS With
Generalized Type-Raised Categories
Department of Computer and Information Science
University of Pennsylvania
A type of 'non-traditional constituents' motivates an extended class of Combinatory Categorial Grammars, CCGs with Generalized Type-Raised Categories (CCG-GTRC) involving variables. Although the class of standard CCGs is known to be polynomially parsable, unrestricted use of variables can destroy this essential requirement for a practical parser. This paper argues for polynomial parsabibty of CCG-GTRC from practical and theoretical points of view. First, we show that an experimental parser runs polynomially in practice on a realistic fragment of Japanese by eliminating spurious ambiguity and excluding genuine ambiguities. Then, we present a worst-case polynomial recognition algorithm for CCG-GTRC by extending the polynomial algorithm for the standard CCGS.
Using Left Corner Language Models
Christopher D. Manning
University of Sydney NSW 2006
Lucent Technologies Bell Labs
600 Mountain Avenue, Room 2D-329
Murray Hill NJ 07974
We introduce a novel parser based on a probabilistic version of a left-corner parser. The left-corner strategy is attractive because rule probabilities can be conditioned on both top-down goals and bottom-up derivations. We develop the underlying theory and explain how a grammar can be induced from analyzed data. We show that the left-corner approach provides an advantage over simple top-down probabilistic context-free grammars in parsing the Wall Street Journal using a grammar induced from the Penn Treebank. We also conclude that the Penn Treebank provides a fairly weak testbed due to the flatness of its bracketings and to the obvious overgeneration and undergeneration of its induced grammar.
Regular Approximations Of CFLS: A
Dept. of Humanities Computing, University of Groningen
P.O. Box 716, NL-9700 AS Groningen, The Netherlands
We show that for each context-free grammar a new grammar can be constructed that generates a regular language. This construction differs from existing methods of approximation in that use of a pushdown automaton is avoided. This allows better insight into how the generated language is affected. The new method is also more attractive from a computational viewpoint.
A Left-To-Right Tagger For Word Graphs
Bell Laboratories, Lucent Technologies
600 Mountain Ave, Room 2D-339
Murray Hill, NJ 07974, USA
An algorithm is presented for tagging input word graphs and producing output tag graphs that are to be subjected to further syntactic processing. It is based on an extension of the basic HMM equations for tagging an input wordstring that allows it to handle word-graph input, where each are has been assigned a probability. The scenario is that of some word-graph source, e.g., an acoustic speech recognizer, producing the arcs of a word graph, and the tagger will in turn produce output arcs, labelled with tags and assigned probabilities. The processing as done entirely left-to-right, and the output tag graph is constructed using a mininum of lookahead, facilitating real-time processing.
Parsing By Successive Approximation
IMS-CL, University of Stuttgart
Azenbergstr. 12, D-70174 Stuttgart, Germany
It is proposed to parse feature structure-based grammars in several steps. Each step is aimed to eliminate as many invalid analyses as possible as efficiently as possible. To this end the set of feature constraints is divided into three subsets, a set of context-free constraints. a set of filtering constraints and a set of structure-building constraints, which are solved in that order. The best processing strategy differs: Context-free constraints are solved efficiently with one of the well-known algorithms for context-free parsing. Filtering constraints can be solved using unification algorithms for non-disjunctive feature structures whereas structure-building constraints require special techniques to represent feature structures with embedded disjunctions efficiently. A compilation method and an efficient processing strategy for filtering constraints are presented.
Performance Evaluation of Supertagging for
Dept. of Computer and Information Science
University of Pennsylvania
Philadelphia, PA 19104
In previous work we introduced the idea of supertagging as a means of improving the efficiency of a lexicalized grammar parser. In this paper, we present supertagging in conjunction with a lightweight dependency analyzer as a robust and efficient partial parser. The present work is significant for two reasons. First, we have vastly improved our results; 92% accurate for supertag disambiguation using lexical information, larger training corpus and smoothing techniques. Second, we show how supertagging can be used for partial parsing and provide detailed evaluation results for detecting noun chunks, verb chunks, preposition phrase attachment and a variety of other linguistic constructions. Using supertag representation, we achieve a recall rate of 93.0% and a precision rate of 91.8% for noun chunking, improving on the best known result for noun chunking.
An Earley Algorithm For Generic Attribute
Augmented Grammars And Applications
INRIA-Rocquencourt, BP 105, 78153 Le Chesnay CEDEX France
We describe an extension of Earley's algorithm which computes the decoration of a shared forest in a generic domain. Attribute computations are defined by a morphism from leftmost derivations to the generic domain, which leaves the computations independent from (even if guided by) the parsing strategy. The approach is illustrated by the example of a definite clause grammar, seen as CF-grainmars decorated by attributes.
A Case Study In
Optimizing Parsing Schemata
By Disambiguation Filters
Programming Research Group
University of Amsterdam
Disambiguation methods for context-free grammars enable concise specification of programming languages by ambiguous grammars. A disambiguation filter is a function that selects a subset from a set of parse trees- the possible parse trees for an ambiguous sentence. The framework of filters provides a declarative description of disambiguation methods independent of parsing. Although filters can be implemented straightforwardly as functions that prune the parse forest produced by some generalized parser, this can be too inefficient for practical applications.
In this paper the optimization of parsing schemata, a framework for high-level description of parsing algorithms, by disambiguation filters is considered in order to find efficient parsing algorithms for declaratively specified disambiguation methods. As a case study the optimization of the parsing schema of Earley's parsing algorithm by two filters is investigated. The main result is a technique for generation of efficient LR-Iike parsers for ambiguous grammars disambiguated by means of priorities.
New Parsing Method Using
Global Association Table
Juntae Yoon and Seonho Kim and Mansuk Song
Department of Computer Science
Yonsei University, Seoul, Korea
This paper presents a new parsing method using statistical information extracted from corpus, especially for Korean. The structural ambiguities are occurred in deciding the dependency relation between words in Korean. While figuring out the correct dependence, the lexical associations play an important role in resolving the ambiguities. Our parser rises statistical cooccurrence data to compute the lexical associations. In addition, it can be shown that sentences are parsed deterministically by the global management of the association. In this paper, the global association table(GAT) is defined and the association between words is recorded in the GAT. The system is the hybrid semi-deterministic parser and is controlled not by the condition-action rule, but by the association value between phrases. Whenever the expectation of the parser fails, it chooses the alternatives using a chart to remove the backtracking.
Constraint-driven Concurrent Parsing
Applied to Romanian VP
Dept. of Computer Science
University of Iasi, ROMANIA
We show that LP constraints (together with language specific constraints) could be interpreted as metarules in (an extended) head-corner parsing algorithm using weakened ID rule schemata from the theory of HPSG [Pollard and Sag, 1994].
Reducing the Complexity of Parsing by a Method of
Caroline Lyon and Bob Dickerson
School of Information Sciences, University of Hertfordshire,UK
A method of automatically locating the subject of a sentence has been developed, and we may be able to take advantage of this to reduce the complexity of parsing English text. Declarative sentences can almost always be segmented into three concatenated sections: pre-subject - subject - predicate. The pre-subject segment may be empty; for imperative sentences the subject section is empty. Other constituents, such as clauses, phrases, noun groups, are contained within these segments, but do not normally cross the boundaries between them. Though a constituent in one section may have dependent links to elements in other sections, such as agreement between the head of the subject and the main verb, once the three sections have been located, they can then be partially processed separately, in parallel.
The tripartite segmentation can be produced automatically, using the ALPINE parser. This is a hybrid processor in which neural networks operate within a rule based framework. Readers are invited to access a prototype via telnet, to use on their own text (for details contact the authors).
Formal Tools For Separating Syntactically Correct And
Martin Pla'tek, Vladislav Kubon, Toma's Holan
Faculty of Mathematics and Physics
Charles University, Prague, Czech Republic
firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
In this paper we introduce a class of formal grammars with special measures capable to describe typical s' syntactic inconsistencies in free word order languages. By means of these measures it is possible to characterize more precisely the problems connected with the task of building a robust parser or a, grammar checker of Czech.
Parsers Optimization for Wide-coverage Unification-Based
Grammars Using The Restriction Technique
Nora La Sema Arantxa Di'az
Department of Computer Languages and Systems, University of the Basque Country
p.k. 649, 20080 Donostia, Spain.
Department of Computer Languages and Systems, Universitat Polite'cnica de Catalunya
Pau Gargallo 5, 08028 Barcelona, Spain.
This article describes the methodology we have followed in order to improve the efficiency of a parsing algorithm for widecoverage unification-based grammars. The technique used is the restriction technique (Shieber 85), which has been recognized as an important operation to obtain efficient parsers for unification-based grammars. The main objective of the research is how to choose appropriate restrictors for using the restriction technique. We have developed a statistical model for selecting restrictors. Several experiments have been done in order to characterise those restrictors.
Language Analysis in SCHISMA
Danny Lie, Joris Hulstijn, Hugo ter Doest, Anton Nijholt
University of Twente
Po Box 217
7500 AE Enschede
SCHISMA is a research project that is concerned with the development of a natural language accessible theatre information and booking system. In this project two research approaches can be distinguished. One approach is devoted to theoretical research in the areas syntax, semantics and pragmatics. Research on syntax has been conducted on a unifying parsing approach, left and head corner grammars, on stochastic context-free grammars and on unification grammar parsing. Research on semantics and pragmatics has been conducted on dialogue act classification and from a logical point of view. The other approach is more practically oriented in that it is more focused on the realisation of a fully functional prototype of the SCHISMA system. Our most recent achievement is a fully functional SCHISMA system (implemented in Java) based on a context-sensitive string rewrite mechanism. Although the system is primitive in nature and not necessarily built on linguistic principles (rather on intuitions), we believe its performance is such that the majority of users will be satisfied. Users will adapt to the system rather than reject it because of poor performance (when compared to human-human dialogues).
Bilexical Grammars And a Cubic-Time Probabilistic Parser
Dept. of Computer and Information Science, Univ. of Pennsylvania
200 South 33rd St., Philadelphia, PA 19103-6389 USA
Computational linguistics has a long tradition of lexicalized grammars, in which each grammatical rule is specialized for some individual word. More recently, a new advantage of lexicalized grammars has emerged. Even when syntactic acceptability is not sensitive to the particular words chosen, syntactic distribution may be. Certain words may be able but highly unlikely to modify certain other words. Such facts can be captured by a probabilistic lexicalized grammar, where they may be used to resolve ambiguity in favor of the most probable analysis, and also to speed parsing by avoiding unlikely search paths. A more recent flurry of probabilistic lexicalized parsers has focused on what one might call bilexical grammars, in which each grammatical rule is specialized for not one but two individual words. The present paper formalizes an inclusive notion of bilexical grammars and shows that they can be parsed efficiently.