CLARIN Tool Portal

ELAN Multimedia Annotator

1 resources

ELAN is a professional tool for the creation of complex annotations on video and audio resources. With ELAN a user can add an unlimited number of annotations to audio and/or video streams. An annotation can be a sentence, word or gloss, a comment, translation or a description of any feature observed in the media. Annotations can be created on multiple layers, called tiers. Tiers can be hierarchically interconnected. An annotation can either be time-aligned to the media or it can refer to other existing annotations. The textual content of annotations is always in Unicode and the transcription is stored in an XML format. ELAN provides several different views on the annotations, each view is connected and synchronized to the media playhead. Up to 4 video files can be associated with an annotation document. Each video can be integrated in the main document window or displayed in its own resizable window. ELAN delegates media playback to an existing media framework, like Windows Media Player, QuickTime or JMF (Java Media Framework). As a result a wide variety of audio and video formats is supported and high performance media playback can be achieved. ELAN is written in the Java programming language and the sources are available for non-commercial use. It runs on Windows, Mac OS X and Linux. ELAN has been functionally extended with the help of the following CLARIN-NL-funded projects: - ColTime: Collaboration on Time-Based Resources. - EXILSEA: Exploiting ISOcat's Language Sections in ELAN and ANNEX. - MultiCon: Multilayer Concordance Functions in ELAN and ANNEX. - SignLinC: Linking lexical databases and annotated corpora of signed languages. Over the years, many funders have contributed to the development of ELAN in several projects, such as the Volkswagen Foundation, the Royal Netherlands Academy of Arts and Sciences, the Berlin-Brandenburg Academy of Sciences and Humanities, the German Federal Ministry of Education and Research, the Max Planck Society and the ARC Centre of Excellence for the Dynamics of Language.

Syntactic Profiler of Dutch

1 resources

SPOD is syntactic profiler that covers a broad spectrum of properties. It is part of the PaQu application but has its own interface with a menu of predefined queries. It can be used to provide general information about corpus properties, such as the number of main and subordinate clauses, types of main and subordinate clauses, and their frequencies, average length of clauses (per clause type: e.g. relative clauses, indirect questions, finite complement clauses, infinitival clauses, finite adverbial clauses, etc.). It yields output in HTML and tab-separated text format.

Corpus Studio Web

1 resources

Summary CorpusStudio is a web application that facilitates in-depth quantitative syntactic research for linguists. Background CorpusStudio is a web application that facilitates in-depth quantitative syntactic research for linguists. It does so by supporting researchers in writing queries that operate on syntactically parsed text corpora in a number of major xml formats. Queries that belong together are kept in xml documents that are called ‘Corpus Research Projects’ (CRPs). These documents contain the queries, the order in which they are to be executed, meta-information about the queries and the project as a whole, as well as a specification of the input used for the project. The use of CRPs helps improve the replicability of corpus research. Access Any CLARIN-NL user can access the CorpusStudio web application and make use of the 'standard' corpora. New users must provide a login name and password, after which they can make use of the application. Adaptable The CorpusStudio code is open-source. Users can take the code, adapt it and use it for their own purposes. Users can also take the code from GitHub as it is, but build their own server in order to run the application on their own text-corpora. User documentation and an API are available (see below). The current version of CorpusStudio supports xml text corpora in the FoLiA and Psdx formats. Extensions to other xml formats are possible. CrpxProcessor provides the basic functionality and is on github on https://github.com/ErwinKomen/CrpxProcessor. CrppServer takes care of /crpp and uses CrpxProcessor. It is on GitHub on https://github.com/ErwinKomen/CrppServer. CrpStudio is on https://github.com/ErwinKomen/CrpStudio, takes care of /crpstudio and uses CrpxProcessor. Main features Keep all important aspects of a research project in one file Define one or more search queries in a hierarchy Uses w3c developed Xquery and Xpath Integrated CorpusStudio-specific Xquery functions User-definable functions and variables Create corpus result databases with user-definable features accompanying each hit Divide the output into calculatable categories Divide the results into meta-data-dependent groups Parallel processing yields a speed-up of a factor 20-100 compared to the Windows version Compatibility with the Windows programs "Cesax" and "CorpusStudio" Limitations and future developments Current limitations to the program include: working with result database, restricted login system, no document view, grouping is restricted to system-defined groups, no query or project wizard. Although the CLARIN-NL project has stopped in December 2015, every effort will be undertaken to make sure that a number of essential features are going to be added.

Komen, E. R. 2017. Beyond Counting Syntactic Hits. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 259–268. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.21. License: CC-BY 4.0

Komen, Erwin R. 2011. Coreferenced corpora for information structure research. In Outposts of Historical Corpus Linguistics: From the Helsinki Corpus to a Proliferation of Resources. (Studies in Variation, Contacts and Change in English 10) Jukka Tyrkkö, Terttu Nevalainen, Matti Rissanen & Matti Kilpiö (eds). Helsinki, Finland: Research Unit for Variation, Contacts, and Change in English.

Komen, Erwin R. 2013. Finding focus: a study of the historical development of focus in English. Utrecht: LOT.

Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian academy of sciences.

Frog: An advanced Natural Language Processing suite for Dutch

1 resources

Frog's current version will tokenize, tag, lemmatize, and morphologically segment word tokens in Dutch text files, will assign a dependency graph to each sentence, will identify the base phrase chunks in the sentence, and will attempt to find and label all named entities.

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114

Use "Frog: An advanced Natural Language Processing suite for Dutch"

COAVA: Cognition, Acquisition and Variation Tool

1 resources

In COAVA two sets of databases are made available in a standardized way: one with historical dialect data (the databases WBD and WLD with lexical data of the Brabantish and Limburgian dialect between 1880-1980) and one with first language acquisition data (four databases form the CHILDES project). The databases contain linguistic information (dialect form, standardised form (“Dutchified”), lexical meaning), geographical information (locality, dialect area, province) and information on the source (inquiry forms or monotopic dictionaries and the date of documentation). The visualisation of the first two sets of information will lead to lexical maps. The most typical way for the user to get to the data will be with the use of the browsable concept taxonomy. The databases are, in other words, approachable via search tools but also via a thematic taxonomy. This taxonomy was developed for the dialect databases and covers the general vocabulary. COAVA (COgnition, Acquisition and VAriation Tool) brings together two strange bedfellows: first language acquisition and historical dialectology. In historical linguistics there is the common assumption that language change in the past is due to the process of non-target like transmission of linguistic features between generations i.e. between parents and children. Despite this assumption, both disciplines remain isolated from each other due to, among others, different methods of data-collection and different types of resources with empirical data. The aim of the COAVA project was to demonstrate that the common assumption in historical linguistics, mentioned above, can be examined in detail with the help of Digital Humanities. This interdisciplinary research targets at the development of a tool for easily exploring the linguistic characteristics of concepts. In COAVA two sets of databases are made available in a standardized way: one with historical dialect data (the databases WBD and WLD with lexical data of the Brabantish and Limburgian dialect between 1880-1980) and one with first language acquisition data (four databases form the CHILDES project).

Leonie Cornips, Jos Swanenberg, Wilbert Heeringa, Folkert de Vriend (2016). The relationship between first language acquisition and dialect variation: Linking resources from distinct disciplines in a CLARIN-NL project. Lingua, Vol. 178, 07.2016, p. 32-45. doi:10.1016/j.lingua.2015.11.007

Cornips, L., Swanenberg, J., Vriend, F. de, Heeringa, W. (2012), Is what we have acquired early, less vulnerable to variation? A comparison between data from dialectlexicography and data from first language acquisition. http://www.meertens.knaw.nl/coavasite/wp-content/uploads/2012/10/Abstract-SIDG-2-JS.pdf

Cornips, L., Kemps-Snijders, M., Snijders, M., Swanenberg, J. and Vriend, F. de (2011). Bridging the Gap between First Language Acquisition and Historical Dialectology with the Help of Digital Humanities. SDH Copenhagen. http://www.meertens.knaw.nl/coavasite/wp-content/uploads/2011/11/Paper-SDH.pdf

Use "COAVA: Cognition, Acquisition and Variation Tool"

Namescape Search

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

Dutch Memory-Based Coreference Resolution Demo

1 resources

This Demo is part of the Corea project and its purpose is to tag noun-phrase referents in a text.

Stylene, a robust, modular system for stylometry and readability research

1 resources

Stylene is a robust, modular system for stylometry and readability research on the basis of existing techniques for automatic text analysis and machine learning, and the development of a web service that allows researchers in the humanities and social sciences to analyze texts with this system. In this way, the project will make available to researchers recent advances in research on the computational modeling of style and readability. Background Stylene consists of an educational demonstration interface and tools for stylometry (authorship attribution and profiling) and readability research for Dutch. The Stylene system consists of a popularization interface for learning to understand stylometric analysis, and web-based interfaces to software for readability and stylometry research aimed at researchers from the humanities and social sciences who don’t want to develop or install such software themselves. Stylene has been created in the context of CLARIN Flanders.

Daelemans, W, De Clercq, O and Hoste, V. 2017. Stylene: an Environment for Stylometry and Readability Research for Dutch. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 195–209. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.16. License: CC-BY 4.0

Use "Stylene, a robust, modular system for stylometry and readability research"

Polimedia: Interlinking multimedia for the analysis of media coverage of political debates

1 resources

PoliMedia links the minutes of the debates in the Dutch Parliament (Dutch Hansard) to the databases of historical newspapers and ANP radio bulletins to allow cross-media analysis of coverage in a uniform search interface. For each fragment from a single speaker in a debate, the developers extracted relevant information: the speaker, the date, important terms from its content and important terms from the description of the complete debate. This information was then combined to create a query with which they searched the archives of newspapers, radio bulletins and television programmes. Media items that corresponded to this query were retrieved and a link was created between the speech and the media item, creating a Semantic Web of Dutch Hansard and media coverage. This Semantic Web contains links from the Dutch Hansard to newspaper articles and radio bulletins. From evaluations it was found that there was a 62% recall and 75% precision. To navigate this Semantic Web, a search user interface was developed based on a requirements study with five scholars in history and political communication. The developers created a faceted search interface in which the Dutch parliamentary minutes can be searched in full-text and in which refinements can be performed based on the speaker, the role of the speaker (parliament of government), political party and year. These debates are presented with links to the original locations of the media items. Polimedia is a collaboration of the TU Delft and the Free University (development of Semantic Web of Dutch Hansard and media), the Netherlands Institute of Sound and Vision (development of the search user interface) and Erasmus University Rotterdam (projectleader and user research of historians and political communication researchers).

Juric, D., Hollink, L., and Houben, G. (2013). Discovering links between political debates and media. The 13th International Conference on Web Engineering (ICWE'13). Aalborg, Denmark.

Juric, D., Hollink, L., and Houben, G. (2012). Bringing parliamentary debates to the Semantic Web. DeRiVE workshop on Detection, Representation, and Exploitation of Events in the Semantic Web.

Kemman, M. J., and Kleppe, M. (2013). PoliMedia - Improving Analyses of Radio, TV and Newspaper Coverage of Political Debates. In T. Aalberg and E. Al. (Eds.), TPDL2013, LCNS 8092 (pp. 409-412). Springer-Verlag Berlin Heidelberg.

Kemman, M. J., Kleppe, M., and Maarseveen, J. (2013). Eye Tracking the Use of a Collapsible Facets Panel in a Search Interface. In T. Aalberg and E. Al. (Eds.), TPDL2013, LCNS 8092 (pp. 405-408). Springer-Verlag Berlin Heidelberg.

Martijn Kleppe, Laura Hollink, Max Kemman, Damir Juric, Henri Beunders, Jaap Blom, Johan Oomen and Geert-Jan Houben. PoliMedia: Analysing Media Coverage of political debates by automatically generated links to Radio & Newspaper Items. http://ceur-ws.org/Vol-1124/linkedup_veni2013_04.pdf

Namescape Visualizer

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

ELAN Multimedia Annotator

Syntactic Profiler of Dutch

Corpus Studio Web

Frog: An advanced Natural Language Processing suite for Dutch

COAVA: Cognition, Acquisition and Variation Tool

Namescape Search

Dutch Memory-Based Coreference Resolution Demo

Stylene, a robust, modular system for stylometry and readability research

Polimedia: Interlinking multimedia for the analysis of media coverage of political debates

Namescape Visualizer

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

Session recording