\documentclass[a4paper,11pt]{article} \usepackage[utf8]{inputenc} % unicode input encoding is also allowed % use the following lines instead of the previous: %\usepackage{ucs} %\usepackage[utf8x]{inputenc} \usepackage{times} \usepackage[T1]{fontenc} \usepackage{pslatex} \pagestyle{empty} \bibliographystyle{plain} \title{What kinds of trees grow in Swedish soil?} \author{John Doe and John Smith \\[0.5cm]Department of Linguistics \\University of Franeker \\E-mail: \texttt{email@email}} \date{} \begin{document} \maketitle \begin{abstract} \noindent This workshop concerns the relationship between the syntactic properties of a given language and the choice of linguistic theory for annotation purposes. In this paper, I will discuss and compare four different annotation schemes that have been proposed for Swedish in terms of their suitability for Swedish syntax as well as their relationship to linguistic theory and annotation schemes proposed for other languages. \end{abstract} \thispagestyle{empty} \section{Introduction} One of the issues brought up in this workshop concerns the relationship between the syntactic properties of a given language and the choice of linguistic theory for annotation purposes. Our Swedish treebank consortium, consisting of researchers from Växjö University, KTH and Stockholm University, is currently facing a specific instance of this issue in trying to define an annotation standard for a large-scale treebank of Swedish written and spoken language. In this paper, I will discuss and compare four different annotation schemes that have been proposed for Swedish in terms of their suitability for Swedish syntax as well as their relationship to linguistic theory and annotation schemes proposed for other languages. Other aspects that will be touched upon are the availability of parsers and/or annotated training data for developing parsers, the different requirements for annotation of spoken and written language, and the different needs of different user groups. By way of background, I will start by reviewing some basic facts about the syntax of Swedish, a Germanic verb second language with moderately fixed word order. In doing this I will also introduce the Scandinavian tradition of descriptive grammar, in particular the influential field model due to Diderichsen \cite{did46}. The background section also contains a brief discussion of existing annotation schemes for other languages and their relation to current linguistic theory. The main part of the paper will be devoted to a discussion and comparison of the following four annotation schemes for Swedish: \begin{itemize} \item MAMBA (Teleman \cite{tel74}) \item SynTag (Järborg \cite{jar86}) \item SWECG (Birn \cite{bir98}) \item S-CLE (Gambäck \cite{gam92}) \end{itemize} The four schemes fall naturally into two groups, MAMBA and SynTag being standards designed for manual annotation of corpus material, while SWECG and S-CLE are primarily general purpose parsing systems which have corpus annotation as one of their (potential) applications. \section{Treebanks and Linguistic Theory} \label{tlt} The number of treebanks available for different languages is growing steadily and with them the number of different annotation schemes. This makes it very difficult to say something general about the relation between annotation schemes and linguistic theory, but broadly speaking I think we may distinguish three main kinds of annotation in current practice: \begin{itemize} \item Annotation of constituent structure \item Annotation of functional structure \item Theory-specific annotation \end{itemize} This is obviously not a proper taxonomy, since theory-specific annotation may concern both constituent structure and functional structure. Rather, the first two categories are meant to cover more or less theory-neutral annotation schemes, focusing on constituent structure or functional structure, respectively. It should also be pointed out immediately that the annotation found in many if not most of the existing treebanks actually combines two or even all three of these categories. Still, I believe that the categories may be useful in discussing existing annotation schemes and their relation to linguistic theory. I will treat the categories in the order in which they are listed above, which I think roughly corresponds to the historical development of treebank annotation schemes. The annotation of \emph{constituent structure}, often referred to as \emph{bracketing}, is the main kind of annotation found in pioneering projects such as the Lancaster Parsed Corpus (Garside et al.\ \cite{gar92}) and the original Penn Treebank (Marcus et al.\ \cite{mar93}). Normally, this kind of annotation consists of part-of-speech tagging for individual word tokens and annotation of major phrase structure categories such as NP, VP, etc. Figure \ref{ibm} shows a representative example, taken from the IBM Paris Treebank using a variant of the Lancaster annotation scheme. \begin{figure}[htbp] \vspace*{0.3cm} \begin{verbatim} [N Vous_PPSA5MS N] [V accedez_VINIP5 [P a_PREPA [N cette_DDEMFS session_NCOFS N] P] [Pv a_PREP31 partir_PREP32 de_PREP33 [N la_DARDFS fenetre_NCOFS [A Gestionnaire_AJQFS [P de_PREPD [N taches_NCOFP N] P] A] N] Pv] V] \end{verbatim} \caption{Constituency annotation in the IBM Paris Treebank} \label{ibm} \end{figure} Annotation schemes of this kind are usually intended to be theory-neutral and therefore try to use mostly uncontroversial categories that are recognized in all or most syntactic theories that assume some notion of constituent structure. Moreover, the structures produced tend to be rather flat, since intermediate phrase level categories are usually avoided, as well as complex structures such as Chomsky adjunction. The drawback of this is that the number of distinct expansions of the same phrase category can become very high. For example, Charniak \cite{cha96} was able to extract 10,605 distinct context-free rules from a 300,000 word sample of the Penn Treebank. Of these, only 3943 occurred more than once in the sample. The status of grammatical functions and their relation to constituent structure has long been a controversial issue in linguistic theory. Thus, whereas the standard view in transformational syntax since Chomsky \cite{cho65} has been that grammatical functions are derivable from constituent structure, proponents of dependency syntax such as Mel'\v{c}uk \cite{mel88} have argued that functional structure is more fundamental than constituent structure. Other theories, such as LFG, steer a middle course by assuming both notions as primitive. When it comes to treebank annotation, the annotation of \emph{functional structure} has become increasingly important in recent years. The most radical examples are perhaps the annotation schemes based on dependency syntax, exemplified by the Prague Dependency Treebank of Czech (Hajic \cite{haj98}) and the METU Treebank of Turkish (Oflazer et al.\ \cite{ofl00}), where the annotation of dependency structure is added directly on top of the morphological annotation without any layer of constituent structure. Figure \ref{pdt} shows a simple example of dependency annotation from the Prague Dependency Treebank. \begin{figure}[htbp] \vspace*{0.3cm} \begin{center} \begin{picture}(200,170) \put(20,160){\circle*{5}} \put(40,40){\circle*{5}} \put(80,100){\circle*{5}} \put(120,40){\circle*{5}} \put(140,100){\circle*{5}} \put(20,150){\makebox(0,0){\#}} \put(20,140){\makebox(0,0){AuxS}} \put(45,30){\makebox(0,0){Komin\'{i}k}} \put(45,20){\makebox(0,0){Sb}} \put(55,105){\makebox(0,0){vymet\'{a}}} \put(55,95){\makebox(0,0){Pred}} \put(125,30){\makebox(0,0){kom\'{i}ny}} \put(125,20){\makebox(0,0){Obj}} \put(145,90){\makebox(0,0){.}} \put(145,80){\makebox(0,0){AuxK}} \put(20,160){\line(1,-1){60}} \put(20,160){\line(2,-1){120}} \put(80,100){\line(-2,-3){40}} \put(80,100){\line(2,-3){40}} \end{picture}\\ \begin{tabular}{llll} Komin\'{i}k&vymet\'{a}&kom\'{i}ny&.\\ Chimneysweep&sweeps&chimney&.\\ \end{tabular} \caption{Functional annotation in the Prague Dependency Treebank} \label{pdt} \end{center} \end{figure} The trend towards more functionally oriented annotation schemes is also reflected in the extension of constituency-based schemes with annotation of grammatical functions. Cases in point are SUSANNE (Sampson \cite{sam95}), which is a development of the Lancaster annotation scheme mentioned above, and Penn Treebank II (Marcus et al.\ \cite{mar94}), which adds functional tags to the original phrase structure annotation. One of the most interesting examples in this respect is the annotation scheme adopted in the TIGER Treebank of German (Brants and Hansen \cite{bra02}), developed from the earlier NEGRA treebank and annotation scheme, which integrates the annotation of constituency and dependency in a graph where node labels represent phrasal categories while edge labels represent syntactic functions. The third kind of annotation scheme that is found in available treebanks is the kind that adheres to a specific linguistic theory and uses representations from that theory to annotate sentences. Thus, HPSG has been used as the basis for treebanks of Bulgarian (Simov et al.\ \cite{sim02}) and Polish (Marciniak et al.\ \cite{mar00}), and the Prague Dependency Treebank mentioned earlier is based on the theory of Functional Generative Description (Sgall et al.\ \cite{sga86}). There has also been work done on automatic f-structure annotation in the theoretical framework of LFG (see, e.g., Sadler et al.\ \cite{sad00}). In conclusion, we may perhaps say that there has been a trend towards more functionally oriented annotation schemes in recent years, and that theory-specific annotation schemes have become more common, but that it is probably still true to say that the dominant paradigm in treebank annotation is the kind of theory-neutral annotation of constituent structure with added functional tags represented by schemes such as the Penn Treebank II standard. \section{Conclusion} In conclusion, MAMBA and SWECG emerge as the strongest candidates for use in the annotation of a Swedish treebank. The other two schemes considered, SynTag and S-CLE, are interesting in their own right but are on the whole less suitable for adoption in a large-scale treebank project. MAMBA and SWECG have the advantage of being firmly based in the Swedish tradition of descriptive grammar and can therefore be expected to have good descriptive adequacy and coverage. This is true especially for MAMBA, which has been designed especially to handle spoken language as well as written language. Moreover, the fact that these schemes are based on notions of traditional grammar means that they provide an annotation which may be more accessible to non-expert treebank users. The main weakness of SWECG is that the annotation contains little or no information about phrase structure and is therefore difficult to relate to many current linguistic theories. However, this situation has clearly improved with the development of FDG, which establishes a more direct connection to dependency-based theories of syntax and also provides a better basis for the reconstruction of phrase structure from dependency structure if this is required. For MAMBA the biggest problem is instead the lack of resources for automatic annotation, although it may be possible to improve the situation by using the available annotated corpora for bootstrapping a parsing system. \begin{thebibliography}{99} \bibitem {bir98} Birn, Juhani (1998) Swedish Constraint Grammar. Lingsoft Inc. (URL: http://www.lingsoft.fi/ doc/swecg/intro/). \bibitem {bra02} Brants, Sabine and Hansen, Silvia (2002) Developments in the TIGER Annotation Scheme and their Realization in the Corpus. In \emph{Proceedings of the Third Conference on Language Resources and Evaluation (LREC 2002)}, pp.\ 1643--1649, Las Palmas. \bibitem {cha96} Charniak, Eugene (1996) Tree-Bank Grammars. In \emph{AAAI/IAAI}, Vol. 2, pp.\ 1031--1036. \bibitem {cho65} Chomsky, Noam (1965) \emph{Aspects of the Theory of Syntax.} MIT Press. \bibitem {did46} Diderichsen, Paul (1946) \emph{Elementær dansk grammatik.} Copenhagen: Gyldendal. \bibitem {jar86} Järborg, Jerker (1986) Manual för syntaggning [Manual for syntagging]. Göteborgs universitet: Institutionen för språkvetenskaplig databehandling. \bibitem {gam92} Gambäck, Björn and Rayner, Manny (1992) The Swedish Core Language Engine. In \emph{Papers from the 3rd Nordic Conference on Text Comprehension in Man and Machine}, Linköping University, Linköping, Sweden, pp.\ 71--85. \bibitem {gar92} Garside, R., Leech, G. and Varadi, T. (compilers) (1992) \emph{Lancaster Parsed Corpus}. A machine-readable syntactically-analysed corpus of 144,000 words, available for distribution through ICAME, The Norwegian Computing Centre for the Humanities, Bergen. \bibitem {haj98} Hajic, Jan (1998) Building a Syntactically Annotated Corpus: The Prague Dependency Treebank. In \emph{Issues of Valency and Meaning}, pp. 106--132. Prague: Karolinum. \bibitem {mar93} Marcus, Mitchell P., Santorini, Beatrice and Marcinkiewicz, Mary Ann (1993) Building a Large Annotated Corpus of English: The Penn Treebank. \emph{Computational Linguistics} 19, 313--330. [Reprinted in Armstrong, Susan (ed.) (1994) \emph{Using large corpora}, pp.\ 273--290. Cambridge, MA: MIT Press.] \bibitem {mar94} Marcus, Mitchell P., Kim, Grace, Marcinkiewicz, Mary Ann, MacIntyre, Robert, Bies, Ann, Ferguson, Mark, Katz, Karen and Schasberger, Britta (1994) The Penn Treebank: Annotating Predicate Argument Structure", In \emph{ARPA Human Language Technology Workshop}. \bibitem {mar00} Marciniak, Małgorzata, Mykowiecka, Agnieszka, Kup\'{s}\'{c}, Anna and Przepi\'{o}rkowski, Adam (2000) An HPSG-Annotated Test Suite for Polish. In \emph{Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000)}. \bibitem {mel88} Mel'\v{c}uk, Igor (1988) \emph{Dependency Syntax: Theory and Practice}. State University of New York Press. \bibitem {ofl00} Oflazer, Kemal, Say, Bilge and Hakkani Tur, Dilep (2000) A Syntactic Annotation Scheme for Turkish. In \emph{Proceedings of 10th International Conference on Turkish Linguistics (ICTL-2000)}. \bibitem {sad00} Sadler, Louisa, von Genabith, Josef and Way, Andy (2000) Automatic F-Structure Annotation from the AP Treebank. In Butt, Miriam and Holloway King, Tracy (eds.) \emph{Proceedings of the Fifth International Conference on Lexical-Functional Grammar}, The University of California at Berkeley, 19 July -- 20 July 2000. Stanford, CA: CSLI Publications. \bibitem {sga86} Sgall, Petr, Hajicova, Eva and Panevova, Jarmila (1986) \emph{The Meaning of the Sentence in Its Pragmatic Aspects}. Reidel. \bibitem {sam95} Sampson, Geoffrey (1995) \emph{English for the Computer}. Oxford University Press. \bibitem {sim02} Simov, Kiril, Popova, Gergana, Osenova, Petya (forthcoming) HPSG-Based Syntactic Treebank of Bulgarian (BulTreeBank). In Wilson, Andrew, Rayson, Paul, McEnery, Tony (eds.) \emph{A Rainbow of Corpora: Corpus Linguistics and the Languages of the World}, pp.\ 135-142. Munich: Lincom-Europa. \bibitem {tel74} Teleman, Ulf (1974) \emph{Manual för grammatisk beskrivning av talad och skriven svenska [Manual for grammatical description of spoken and written Swedish].} Lund: Studentlitteratur. \end{thebibliography} \end{document}