CLARIN Tool Portal

Corpus Studio Web

1 resources

Summary CorpusStudio is a web application that facilitates in-depth quantitative syntactic research for linguists. Background CorpusStudio is a web application that facilitates in-depth quantitative syntactic research for linguists. It does so by supporting researchers in writing queries that operate on syntactically parsed text corpora in a number of major xml formats. Queries that belong together are kept in xml documents that are called ‘Corpus Research Projects’ (CRPs). These documents contain the queries, the order in which they are to be executed, meta-information about the queries and the project as a whole, as well as a specification of the input used for the project. The use of CRPs helps improve the replicability of corpus research. Access Any CLARIN-NL user can access the CorpusStudio web application and make use of the 'standard' corpora. New users must provide a login name and password, after which they can make use of the application. Adaptable The CorpusStudio code is open-source. Users can take the code, adapt it and use it for their own purposes. Users can also take the code from GitHub as it is, but build their own server in order to run the application on their own text-corpora. User documentation and an API are available (see below). The current version of CorpusStudio supports xml text corpora in the FoLiA and Psdx formats. Extensions to other xml formats are possible. CrpxProcessor provides the basic functionality and is on github on https://github.com/ErwinKomen/CrpxProcessor. CrppServer takes care of /crpp and uses CrpxProcessor. It is on GitHub on https://github.com/ErwinKomen/CrppServer. CrpStudio is on https://github.com/ErwinKomen/CrpStudio, takes care of /crpstudio and uses CrpxProcessor. Main features Keep all important aspects of a research project in one file Define one or more search queries in a hierarchy Uses w3c developed Xquery and Xpath Integrated CorpusStudio-specific Xquery functions User-definable functions and variables Create corpus result databases with user-definable features accompanying each hit Divide the output into calculatable categories Divide the results into meta-data-dependent groups Parallel processing yields a speed-up of a factor 20-100 compared to the Windows version Compatibility with the Windows programs "Cesax" and "CorpusStudio" Limitations and future developments Current limitations to the program include: working with result database, restricted login system, no document view, grouping is restricted to system-defined groups, no query or project wizard. Although the CLARIN-NL project has stopped in December 2015, every effort will be undertaken to make sure that a number of essential features are going to be added.

Komen, E. R. 2017. Beyond Counting Syntactic Hits. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 259–268. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.21. License: CC-BY 4.0

Komen, Erwin R. 2011. Coreferenced corpora for information structure research. In Outposts of Historical Corpus Linguistics: From the Helsinki Corpus to a Proliferation of Resources. (Studies in Variation, Contacts and Change in English 10) Jukka Tyrkkö, Terttu Nevalainen, Matti Rissanen & Matti Kilpiö (eds). Helsinki, Finland: Research Unit for Variation, Contacts, and Change in English.

Komen, Erwin R. 2013. Finding focus: a study of the historical development of focus in English. Utrecht: LOT.

Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian academy of sciences.

Frog: An advanced Natural Language Processing suite for Dutch

1 resources

Frog's current version will tokenize, tag, lemmatize, and morphologically segment word tokens in Dutch text files, will assign a dependency graph to each sentence, will identify the base phrase chunks in the sentence, and will attempt to find and label all named entities.

Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius, S. (2007). An efficient memory-based morphosyntactic tagger and parser for Dutch, In F. van Eynde, P. Dirix, I. Schuurman, and V. Vandeghinste (Eds.), Selected Papers of the 17th Computational Linguistics in the Netherlands Meeting, Leuven, Belgium, pp. 99-114

Use "Frog: An advanced Natural Language Processing suite for Dutch"

Namescape Search

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

Dutch Memory-Based Coreference Resolution Demo

1 resources

This Demo is part of the Corea project and its purpose is to tag noun-phrase referents in a text.

Namescape Visualizer

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

RemBench - a Digital Workbench for Rembrandt Research

1 resources

RemBench enables one to search and browse for works of art, artists, primary sources and library sources related to Rembrandt, using faceted search by location, author/artist name, author/artist type, and date range, and/or by both exact and fuzzy keyword search. It offers both a web application and a RESTful web service. RemBench combines the content of four different databases behind one search interface: RKDartists and RKDimages, two databases maintained by the Netherlands Institute for Art History (RKD); RemDoc, a collection of original documents related to Rembrandt van Rijn from the period between 1475 to circa 1750; RUQuest, a library system that provides access to full text articles, as well as the complete collection of (e-)books and journals from the Radboud University Library Catalogue. RemBench does not influence the content of these databases.

Verberne, S, van Leeuwen, R, Gerritsen, G and Boves, L. 2017. RemBench: A Digital Workbench for Rembrandt Research. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 337–350. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.28. License: CC-BY 4.0

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing

1 resources

A demo of a speech recognizer for POIs (Points of Interest). This demo recognizes stay-over addresses and eateries in some big cities (inter alia Amsterdam, Antwerpen, Gent, Rotterdam).

This STEVIN project is about the investigation of new pronunciation modeling technologies that can improve the automatic recognition of spoken names in the context of a POI (Point-of-Interest) information providing business service. Collaboration with RU (Nijmegen), UiL (Utrecht), Nuance and TeleAtlas.

Een demo van een spraakherkenner voor POIs (Points of Interest). Deze demo herkent overnachtingsadressen en eetgelegenheden in enkele grote steden (o.a. Amsterdam, Antwerpen, Gent, Rotterdam).

MIGMAP: Detailed interactive mapping of migration in The Netherlands in the 20th century.

1 resources

MIGMAP is a web application that can show migration flow between Dutch municipalities. The user first chooses generation (forward or backward in time) and gender, while subsequently the migration map of The Netherlands related to an interactively pointed municipality (or other aggregation unit) is shown. The data underlying the migration maps originate from the first name selection from the Civil Registration, acquired by Utrecht University and the Meertens Institute in 2006. These concern 16 million records from persons with Dutch citizenship, alive in 2006, and in addition 6 million persons deceased before 2006, but mentioned in other records – mainly as parents. The records include identifiers by which family relations can be reconstructed. After considerable efforts in data clearing and reconstruction of older generations, the data provide an almost complete overview of the Dutch population, born after 1930, and a fairly good sample from the period 1880-1930 (>25%). The user will be given options to choose generation (places of birth of the current population, their parents, grandparents grand-grandparents, or starting with the persons born between 1880-1900: the current places of residence of their children, grandchildren), and gender. Each map will be made available as a .csv record with municipality number and percentage as fields, and thus can be used by users in correlation studies with other variables. Utrecht University and the Meertens Institute have the signed permission of the "Basisadministratie voor Persoonsgegevens en Reisdocumenten, The Hague" to use the data for scientific purposes. The migration maps present the data in an aggregated way, and do not violate privacy requirements (no individual can possibly be identified from the maps). However, the underlying data containing information about individual persons and their family relations cannot be made available for reasons of privacy.

Bloothooft, G, Onland, D and Kunst, J.P. 2017. Mapping Migration across Generations. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 351–360. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.29. License: CC-BY 4.0

Ekamper, P. en Bloothooft, G. (2013), "Weg van je wortels. De afstand tussen overgrootouders en achterkleinkinderen", DEMOS 29, 2, p8.

Namescape Barcode Browser

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

A Visualiser/Editor for part-of-speech tagged corpora

1 resources

With this web-application an end user can view and edit corpora tokenized, lemmatized and part-of-speech tagged with Adelheid.

Use "A Visualiser/Editor for part-of-speech tagged corpora"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Corpus Studio Web

Frog: An advanced Natural Language Processing suite for Dutch

Namescape Search

Dutch Memory-Based Coreference Resolution Demo

Namescape Visualizer

RemBench - a Digital Workbench for Rembrandt Research

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing

MIGMAP: Detailed interactive mapping of migration in The Netherlands in the 20th century.

Namescape Barcode Browser

A Visualiser/Editor for part-of-speech tagged corpora

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording