CLARIN Tool Portal

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing

1 resources

A demo of a speech recognizer for POIs (Points of Interest). This demo recognizes stay-over addresses and eateries in some big cities (inter alia Amsterdam, Antwerpen, Gent, Rotterdam).

This STEVIN project is about the investigation of new pronunciation modeling technologies that can improve the automatic recognition of spoken names in the context of a POI (Point-of-Interest) information providing business service. Collaboration with RU (Nijmegen), UiL (Utrecht), Nuance and TeleAtlas.

Een demo van een spraakherkenner voor POIs (Points of Interest). Deze demo herkent overnachtingsadressen en eetgelegenheden in enkele grote steden (o.a. Amsterdam, Antwerpen, Gent, Rotterdam).

MIGMAP: Detailed interactive mapping of migration in The Netherlands in the 20th century.

1 resources

MIGMAP is a web application that can show migration flow between Dutch municipalities. The user first chooses generation (forward or backward in time) and gender, while subsequently the migration map of The Netherlands related to an interactively pointed municipality (or other aggregation unit) is shown. The data underlying the migration maps originate from the first name selection from the Civil Registration, acquired by Utrecht University and the Meertens Institute in 2006. These concern 16 million records from persons with Dutch citizenship, alive in 2006, and in addition 6 million persons deceased before 2006, but mentioned in other records – mainly as parents. The records include identifiers by which family relations can be reconstructed. After considerable efforts in data clearing and reconstruction of older generations, the data provide an almost complete overview of the Dutch population, born after 1930, and a fairly good sample from the period 1880-1930 (>25%). The user will be given options to choose generation (places of birth of the current population, their parents, grandparents grand-grandparents, or starting with the persons born between 1880-1900: the current places of residence of their children, grandchildren), and gender. Each map will be made available as a .csv record with municipality number and percentage as fields, and thus can be used by users in correlation studies with other variables. Utrecht University and the Meertens Institute have the signed permission of the "Basisadministratie voor Persoonsgegevens en Reisdocumenten, The Hague" to use the data for scientific purposes. The migration maps present the data in an aggregated way, and do not violate privacy requirements (no individual can possibly be identified from the maps). However, the underlying data containing information about individual persons and their family relations cannot be made available for reasons of privacy.

Bloothooft, G, Onland, D and Kunst, J.P. 2017. Mapping Migration across Generations. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 351–360. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.29. License: CC-BY 4.0

Ekamper, P. en Bloothooft, G. (2013), "Weg van je wortels. De afstand tussen overgrootouders en achterkleinkinderen", DEMOS 29, 2, p8.

Namescape Barcode Browser

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

TiCClops: Text-Induced Corpus Clean-up online processing system

3 resources

TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Text-Induced Corpus Clean-up (TICCL) was developed first as a prototype at the request of the Koninklijke Bibliotheek - The Hague (KB) and reworked into a production tool according to KB specifications (currently at production version 2.0) mainly during the second half of 2008. It is a fully functional environment for processing possibly very large corpora in order to largely remove the undesirable lexical variation in them. It has provisions for various input and output formats, is flexible and robust and has very high recall and acceptable precision. As a spelling variation detection system it is to the developer’s knowledge unique in making principled use of the input text as possible source for target output canonical forms. As such it is far less domain-sensitive than other approaches: the domain is largely covered by the input text collection. TICCL comes in two variants: one with a classic CLAM web application interface, and one with the PhilosTEI interface.

Reynaert, M. (2008). All, and only, the errors: More complete and consistent spelling and OCR-error correction evaluation. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco.

Reynaert, M. (2010). Character confusion versus focus word-based correction of spelling and ocr variants in corpora. International Journal on Document Analysis and Recognition, pp 1-15, URL http://dx.doi.org/10.1007/s10032-010-0133-5

DuELME: Search interface to the Dutch Electronic Lexicon of Multiword Expressions

1 resources

The DuELME search interface provides access to the DUELME electronic lexicon, which contains more than 5,000 Dutch multiword expressions (MWEs). MWEs with the same syntactic pattern are grouped in the same equivalence class. The search interface enables users to search for MWEs on the basis of a range of syntactic and semantic criteria, among them expression, pattern id, written form, type, conjugation, polarity, parameters, form, etc. Extensive documentation on the structure of the database is available. DuELME (Dutch Electronic Lexicon of Multiword Expressions) is one of the results of the project Identification and Representation of Multiword Expressions (IRME). The lexical descriptions boast to be highly theory- and implementation-neutral. The DUELME LMF lexicon is suitable for theoretical research on multiword expressions as for use in NLP systems. The DuELME-LMF project has been carried out within the CLARIN-NL programme.

Grégoire, N. (2009), Untangling Multiword Expressions. A study on the representation and variation of Dutch multiword expressions, PhD thesis, University of Utrecht.

Alpino: a dependency parser for Dutch

1 resources

Alpino is a dependency parser for Dutch, developed in the context of the PIONIER Project Algorithms for Linguistic Processing.

Bouma, G., van Noord, G. J. M. and Malouf, R. 2001.Alpino: Wide-coverage computational analysis of Dutch. in Daelemans, W., Simaan, K., Veenstra, J. and Zavrel, J. (eds.). Computational Linguistics in the Netherlands 2000. Amsterdam: Rodopi, p. 45-59 15 p. (LANGUAGE AND COMPUTERS : STUDIES IN PRACTICAL LINGUISTICS)

Robert Malouf and Gertjan van Noord. Wide Coverage Parsing with Stochastic Attribute Value Grammars. In: IJCNLP-04 Workshop Beyond Shallow Analyses - Formalisms and statistical modeling for deep analyses.

Leonoor van der Beek, Gosse Bouma, and Gertjan van Noord. Een brede computationele grammatica voor het Nederlands. 2002. Nederlandse Taalkunde. https://www.let.rug.nl/~vannoord/papers/taalkunde.pdf .

Use "Alpino: a dependency parser for Dutch"

VK: Verrijkt Koninkrijk (Enriched Kingdom)

2 resources

Dr Loe de Jong’s Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog remains the most appealing history of German occupied Dutch society (1940-1945). Published between 1969 and 1991, the 14 volumes, consisting of 30 parts and 18,000 pages combine the qualities of an authoritative work for a general audience, and an inevitable point of reference for scholars. In VK this corpus is enriched with: - Tokenization, sentence splitting, part-of-speech tagging and lemmatization (done with the FROG software from Tilburg University); - Named entity recognition (done using UvA's NE tagger (specially trained for Dutch within the Stevin DuoMan project)); - Polarity tagging (positive/negative connotation of words) (done using UvA's FietsTas software (developed for Dutch within the Stevin DuoMan project)); - Named entity reconciliation by linking to Wikipedia (done using software developed by Edgar Meij (UvA)).

REST web interface, HTTP GET

De Boer, V., J. van Doornik, L. Buitinck, K. Ribbens, and T. Veken. Enriched Access to a Large War Historical Text using the Back of the Book Index. Extended abstract presented at the Workshop on Semantic Web and Information Extraction (SWAIE 2012), Galway, Ireland, 9 october 2012

L. Buitinck and M.Marx, Two-Stage Named-Entity Recognition Using Averaged Perceptrons in proceedings of NDLB, Groningen, Netherlands, 2012. http://link.springer.com/chapter/10.1007%2F978-3-642-31178-9_17

Use "VK: Verrijkt Koninkrijk (Enriched Kingdom)"

Automatic Transcription of Oral History Interviews

1 resources

This webservice and web application uses automatic speech recognition to provide the transcriptions of recordings spoken in Dutch. You can upload and process only one file per project. For bulk processing and other questions, please contact Henk van den Heuvel at h.vandenheuvel@let.ru.nl.

PILNAR: Pilgrimage Narratives - a corpus for studying the profile of the modern pilgrim

1 resources

A corpus of pilgrimage narratives with Dutch texts written after ca. 2000 that present the thoughts and impressions of pilgrims to Santiago de Compostela. The PILNAR corpus is a source for research for a variety of (sub)disciplines: culture studies, ritual and religious studies, but also media and e-culture studies (cf the use of blogs and other social media for the self-presentation of experiences). Only for authorized users. The PILNAR corpus contains six subcorpora: - Volumes of De Jacobsstaf 1986-: 84 pdf files; - Volumes of De Pelgrim of the Flemish Society of Santiago de Compostella nos. 1-4 (16mb, 10mb, 16mb) (both societies work collaborate closely); - Volumes of Ultreia, a newsletter; 3 issues available now: January, February, April 2011; - Pilgrimage accounts and blogs by pilgrims available via the Societies Netherlands: circa 140 files; Flemish: circa 138 files; - A corpus of pilgrimage narratives compiled on the occasion of the exhibition in Museum Catharijneconvent held in collaboration with the Society: www.pelgrimsverhalen.nl; (link is external) already on the site now: about 180 fields (as of July 2011); - Accounts and narratives that come in after a specially targeted notice via the site and periodical by the Society (De Jacobsstaf), with perhaps a Flemish companion piece (De Pelgrim).

AVResearcherXL: Exploring audiovisual metadata in historical context

1 resources

AVResearcherXL is a tool for exploring radio and television programme descriptions, television subtitles and general newspaper articles. The interface searches across the catalogue "iMMix" of the Netherlands Institute for Sound and Vision and a selection of newspapers of KB Royal Archive of the Netherlands. By the end of 2014, the data used by AVResearcherXL are: iMMix 932,035 broadcasts indexed 18,124 broadcasts with subtitles 1 January 1900 is the date of the first broadcast in the index 26 October 2013 is the date of the last broadcast in the index KB newspapers 25,811,413 articles indexed 16,294,029 articles are of type "artikel" 8,483,542 articles are of type "advertentie" 630,929 articles are of type "illustratie met onderschrift" 402,913 articles are of type "familiebericht" 1 January 1900 is the date of the first article in the index 30 November 1994 is the date of the last article in the index AVResearcherXL is financially supported by CLARIN-NL within the QuaMeRDES-project and by CLARIAH-SEED within the Research Instruments for Media Studies-project. AVResearcherXL is an extended version of MeRDES, the tool developed in 2012 by the NWO-CATCH project BRIDGE. MeRDES was further developed into AVResearcher by the Netherlands Institute for Sound and Vision in 2013. AVResearcherXL is a collaborative project of Centre for Television in Transition (Utrecht University), Intelligent Systems Lab Amsterdam (University of Amsterdam) and the Netherlands Institute for Sound and Vision. The partners worked together with Dispectu for the development of the interface and back-end, and with Frontwise for the styling of the interface.

Bron, M., Gorp, J. van, Nack, F., Rijke, M. de, Vishneuski, Andrei and Leeuw, J.S. de (2012). A Subjunctive Exploratory Search Interface to Support Media Studies Researchers. SIGIR '12: 35th international ACM SIGIR conference on Research and development in information retrieval Portland, Oregon: ACM.

Huurnink, B., Bronner, A., Bron, M., Gorp, J. van, Goede, B. de and Wees, J. van (2013). AVResearcher: Exploring Audiovisual Metadata. DIR 2013: Dutch-Belgian Information Retrieval Conference Delft: DIR.

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Evaluating Repetitions, or how to Improve your Multilingual ASR System by doing Nothing

MIGMAP: Detailed interactive mapping of migration in The Netherlands in the 20th century.

Namescape Barcode Browser

TiCClops: Text-Induced Corpus Clean-up online processing system

DuELME: Search interface to the Dutch Electronic Lexicon of Multiword Expressions

Alpino: a dependency parser for Dutch

VK: Verrijkt Koninkrijk (Enriched Kingdom)

Automatic Transcription of Oral History Interviews

PILNAR: Pilgrimage Narratives - a corpus for studying the profile of the modern pilgrim

AVResearcherXL: Exploring audiovisual metadata in historical context

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording