CLARIN Tool Portal

Frog: An advanced Natural Language Processing suite for Dutch

1 resources

Frog's current version will tokenize, tag, lemmatize, and morphologically segment word tokens in Dutch text files, will assign a dependency graph to each sentence, will identify the base phrase chunks in the sentence, and will attempt to find and label all named entities.

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114

Use "Frog: An advanced Natural Language Processing suite for Dutch"

COAVA: Cognition, Acquisition and Variation Tool

1 resources

In COAVA two sets of databases are made available in a standardized way: one with historical dialect data (the databases WBD and WLD with lexical data of the Brabantish and Limburgian dialect between 1880-1980) and one with first language acquisition data (four databases form the CHILDES project). The databases contain linguistic information (dialect form, standardised form (“Dutchified”), lexical meaning), geographical information (locality, dialect area, province) and information on the source (inquiry forms or monotopic dictionaries and the date of documentation). The visualisation of the first two sets of information will lead to lexical maps. The most typical way for the user to get to the data will be with the use of the browsable concept taxonomy. The databases are, in other words, approachable via search tools but also via a thematic taxonomy. This taxonomy was developed for the dialect databases and covers the general vocabulary. COAVA (COgnition, Acquisition and VAriation Tool) brings together two strange bedfellows: first language acquisition and historical dialectology. In historical linguistics there is the common assumption that language change in the past is due to the process of non-target like transmission of linguistic features between generations i.e. between parents and children. Despite this assumption, both disciplines remain isolated from each other due to, among others, different methods of data-collection and different types of resources with empirical data. The aim of the COAVA project was to demonstrate that the common assumption in historical linguistics, mentioned above, can be examined in detail with the help of Digital Humanities. This interdisciplinary research targets at the development of a tool for easily exploring the linguistic characteristics of concepts. In COAVA two sets of databases are made available in a standardized way: one with historical dialect data (the databases WBD and WLD with lexical data of the Brabantish and Limburgian dialect between 1880-1980) and one with first language acquisition data (four databases form the CHILDES project).

Leonie Cornips, Jos Swanenberg, Wilbert Heeringa, Folkert de Vriend (2016). The relationship between first language acquisition and dialect variation: Linking resources from distinct disciplines in a CLARIN-NL project. Lingua, Vol. 178, 07.2016, p. 32-45. doi:10.1016/j.lingua.2015.11.007

Cornips, L., Swanenberg, J., Vriend, F. de, Heeringa, W. (2012), Is what we have acquired early, less vulnerable to variation? A comparison between data from dialectlexicography and data from first language acquisition. http://www.meertens.knaw.nl/coavasite/wp-content/uploads/2012/10/Abstract-SIDG-2-JS.pdf

Cornips, L., Kemps-Snijders, M., Snijders, M., Swanenberg, J. and Vriend, F. de (2011). Bridging the Gap between First Language Acquisition and Historical Dialectology with the Help of Digital Humanities. SDH Copenhagen. http://www.meertens.knaw.nl/coavasite/wp-content/uploads/2011/11/Paper-SDH.pdf

Use "COAVA: Cognition, Acquisition and Variation Tool"

Namescape Search

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

Dutch Memory-Based Coreference Resolution Demo

1 resources

This Demo is part of the Corea project and its purpose is to tag noun-phrase referents in a text.

Polimedia: Interlinking multimedia for the analysis of media coverage of political debates

1 resources

PoliMedia links the minutes of the debates in the Dutch Parliament (Dutch Hansard) to the databases of historical newspapers and ANP radio bulletins to allow cross-media analysis of coverage in a uniform search interface. For each fragment from a single speaker in a debate, the developers extracted relevant information: the speaker, the date, important terms from its content and important terms from the description of the complete debate. This information was then combined to create a query with which they searched the archives of newspapers, radio bulletins and television programmes. Media items that corresponded to this query were retrieved and a link was created between the speech and the media item, creating a Semantic Web of Dutch Hansard and media coverage. This Semantic Web contains links from the Dutch Hansard to newspaper articles and radio bulletins. From evaluations it was found that there was a 62% recall and 75% precision. To navigate this Semantic Web, a search user interface was developed based on a requirements study with five scholars in history and political communication. The developers created a faceted search interface in which the Dutch parliamentary minutes can be searched in full-text and in which refinements can be performed based on the speaker, the role of the speaker (parliament of government), political party and year. These debates are presented with links to the original locations of the media items. Polimedia is a collaboration of the TU Delft and the Free University (development of Semantic Web of Dutch Hansard and media), the Netherlands Institute of Sound and Vision (development of the search user interface) and Erasmus University Rotterdam (projectleader and user research of historians and political communication researchers).

Juric, D., Hollink, L., and Houben, G. (2013). Discovering links between political debates and media. The 13th International Conference on Web Engineering (ICWE'13). Aalborg, Denmark.

Juric, D., Hollink, L., and Houben, G. (2012). Bringing parliamentary debates to the Semantic Web. DeRiVE workshop on Detection, Representation, and Exploitation of Events in the Semantic Web.

Kemman, M. J., and Kleppe, M. (2013). PoliMedia - Improving Analyses of Radio, TV and Newspaper Coverage of Political Debates. In T. Aalberg and E. Al. (Eds.), TPDL2013, LCNS 8092 (pp. 409-412). Springer-Verlag Berlin Heidelberg.

Kemman, M. J., Kleppe, M., and Maarseveen, J. (2013). Eye Tracking the Use of a Collapsible Facets Panel in a Search Interface. In T. Aalberg and E. Al. (Eds.), TPDL2013, LCNS 8092 (pp. 405-408). Springer-Verlag Berlin Heidelberg.

Martijn Kleppe, Laura Hollink, Max Kemman, Damir Juric, Henri Beunders, Jaap Blom, Johan Oomen and Geert-Jan Houben. PoliMedia: Analysing Media Coverage of political debates by automatically generated links to Radio & Newspaper Items. http://ceur-ws.org/Vol-1124/linkedup_veni2013_04.pdf

Namescape Visualizer

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

TTNWW - TST Tools for the Dutch Language as Web services in a Workflow

1 resources

TTNWW integrates and makes available existing Language Technology (LT) software components for the Dutch language that have been developed in the STEVIN and CGN projects. The LT components (for text and speech) are made available as web-services in a simplified workflow system that enables researchers without much technical background to use standard LT workflow recipes. The web services are available in two separate domains: "Text" and "Speech" processing. For "Text", workflows for the following functionality is offered by TTNWW: - Orthographic Normalisation using TICCLops (version CLARIN-NL 1.0); - Part of Speech Tagging, Lemmatisation, Chunking, limited Multiword Unit Recognition, and Grammatical Relation Assignment by Frog (Version 012.012); - Syntactic Parsing (including grammatical relation assignment, limited named entity recognition, and limited multiword unit recognition) by the Alpino Parser (version 1.3); - Semantic Annotation; - Named Entity Recognition; - Co-reference Assignment. For "Speech", the following workflows are offered: - Automatic Transcription of speech files using a Netherlands Dutch acoustic model; - Automatic Transcription of speech files using a Flemish Dutch acoustic model; - Conversion of the input speech file to the required sampling rate, followed by automatic transcription. The TTNWW services have been created in a Dutch and Flemish collaboration project building on the results of past Dutch and Flemish projects. The web services are partly deployed in the SURF-SARA BiG-Grid cloud or at CLARIN centres in the Netherlands and at CLARIN VL University partners. The architecture of the TTNWW portal consists out of several components and follows the principles of Service Oriented Architecture (SOA). The TTNWW GUI front-end is a Flex module that communicates with the TTNWW web-application which keeps track of the different sessions and knows which LT recipes are available. TTNWW communicates assigments (workflow specifications) to the WorkflowService that evaluates the requested workflow and requests the DeploymentSevice to start the required LT web-services. After initialization of the LT web-services, the workflow specification is sent to the Taverna Server, that takes further care of the workflow. To facilitate the process of wrapping applications that were originally designed as standalone applications into web services, the CLAM (Computational Linguistics Application Mediator) wrapper software allows for easy and transparent transformation of applications into RESTful web services. The CLAM software has extensively been used in the TTNWW project for both text and speech processing tools. With the exception of Alpino and MBSRL all web services work operate on CLAM wrappers. Given the number of web services involved in the TTNWW project and possibilities offered by the cloud environment the preferred method of delivering the web service installations was delivery of complete virtual machine images by the LT providers. These could be directly uploaded into the cloud environment and thus relieving the CLARIN centres nd LT providers from the original foreseen task of running the webservices themselves. A potential advantage of this method, that has not been exploited in the project yet, is that these images may be also be delivered directly to the end user so these can be run in a local configuration using virtualization software such as VMWare of VirtualBox. The workflow engine used in the project was Taverna. But build on top of this was a a number of selectable task recipes, following a task oriented approach in line with the premises that users with no or little technical expertise should be able to use the system. In this context, tasks are understood in terms of end results of processes such as semantic role labelling, pos tagging or syntactic analysis and ready-made workflows are constructed that can be readily used by the end user.

Kemps-Snijders, M, Schuurman, I, Daelemans, W, Demuynck, K, Desplanques, B, Hoste, V, Huijbregts, M, Martens, J-P, Paulussen, H, Pelemans, J, Reynaert, M, Vandeghinste, V, van den Bosch, A, van denHeuvel, H, van Gompel, M, van Noord, G and Wambacq, P. 2017. TTNWW to the Rescue: No Need to Know How to Handle Tools and Resources. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 83–93. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.7. License: CC-BY 4.0

RemBench - a Digital Workbench for Rembrandt Research

1 resources

RemBench enables one to search and browse for works of art, artists, primary sources and library sources related to Rembrandt, using faceted search by location, author/artist name, author/artist type, and date range, and/or by both exact and fuzzy keyword search. It offers both a web application and a RESTful web service. RemBench combines the content of four different databases behind one search interface: RKDartists and RKDimages, two databases maintained by the Netherlands Institute for Art History (RKD); RemDoc, a collection of original documents related to Rembrandt van Rijn from the period between 1475 to circa 1750; RUQuest, a library system that provides access to full text articles, as well as the complete collection of (e-)books and journals from the Radboud University Library Catalogue. RemBench does not influence the content of these databases.

Verberne, S, van Leeuwen, R, Gerritsen, G and Boves, L. 2017. RemBench: A Digital Workbench for Rembrandt Research. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 337–350. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.28. License: CC-BY 4.0

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing

1 resources

A demo of a speech recognizer for POIs (Points of Interest). This demo recognizes stay-over addresses and eateries in some big cities (inter alia Amsterdam, Antwerpen, Gent, Rotterdam).

This STEVIN project is about the investigation of new pronunciation modeling technologies that can improve the automatic recognition of spoken names in the context of a POI (Point-of-Interest) information providing business service. Collaboration with RU (Nijmegen), UiL (Utrecht), Nuance and TeleAtlas.

Een demo van een spraakherkenner voor POIs (Points of Interest). Deze demo herkent overnachtingsadressen en eetgelegenheden in enkele grote steden (o.a. Amsterdam, Antwerpen, Gent, Rotterdam).

MIGMAP: Detailed interactive mapping of migration in The Netherlands in the 20th century.

1 resources

MIGMAP is a web application that can show migration flow between Dutch municipalities. The user first chooses generation (forward or backward in time) and gender, while subsequently the migration map of The Netherlands related to an interactively pointed municipality (or other aggregation unit) is shown. The data underlying the migration maps originate from the first name selection from the Civil Registration, acquired by Utrecht University and the Meertens Institute in 2006. These concern 16 million records from persons with Dutch citizenship, alive in 2006, and in addition 6 million persons deceased before 2006, but mentioned in other records – mainly as parents. The records include identifiers by which family relations can be reconstructed. After considerable efforts in data clearing and reconstruction of older generations, the data provide an almost complete overview of the Dutch population, born after 1930, and a fairly good sample from the period 1880-1930 (>25%). The user will be given options to choose generation (places of birth of the current population, their parents, grandparents grand-grandparents, or starting with the persons born between 1880-1900: the current places of residence of their children, grandchildren), and gender. Each map will be made available as a .csv record with municipality number and percentage as fields, and thus can be used by users in correlation studies with other variables. Utrecht University and the Meertens Institute have the signed permission of the "Basisadministratie voor Persoonsgegevens en Reisdocumenten, The Hague" to use the data for scientific purposes. The migration maps present the data in an aggregated way, and do not violate privacy requirements (no individual can possibly be identified from the maps). However, the underlying data containing information about individual persons and their family relations cannot be made available for reasons of privacy.

Bloothooft, G, Onland, D and Kunst, J.P. 2017. Mapping Migration across Generations. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 351–360. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.29. License: CC-BY 4.0

Ekamper, P. en Bloothooft, G. (2013), "Weg van je wortels. De afstand tussen overgrootouders en achterkleinkinderen", DEMOS 29, 2, p8.

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

Frog: An advanced Natural Language Processing suite for Dutch

COAVA: Cognition, Acquisition and Variation Tool

Namescape Search

Dutch Memory-Based Coreference Resolution Demo

Polimedia: Interlinking multimedia for the analysis of media coverage of political debates

Namescape Visualizer

TTNWW - TST Tools for the Dutch Language as Web services in a Workflow

RemBench - a Digital Workbench for Rembrandt Research

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing

MIGMAP: Detailed interactive mapping of migration in The Netherlands in the 20th century.

Result filters

Metadata provider

Language

Resource type

Availability

Organisation

Project

Active filters:

Search results

Session recording