CLARIN Tool Portal

698 record(s) found

Search results

Cornetto: Combinatorial and Relational Network as Toolkit for Dutch Language Technology

1 resources

Cornetto is a lexical resource for the Dutch language which combines two resources with different semantic organisations: the Dutch Wordnet with its synset organisation and the Dutch Reference Lexicon which includes definitions, usage constraints, selectional restrictions, syntactic behaviours, illustrative contexts, etc. The Cornetto database contains over 92K lemmas and almost 120K word meanings. The Cornetto lexical resource for Dutch covers the most generic and central part of the language. Cornetto combines the structures of the Princeton Wordnet, some of the features from the FrameNet for English and the information on morphological, syntactic, semantic and combinatorial features of lexemes normally found in dictionaries. The Cornetto resource is compiled by combining and aligning two existing semantic resources for Dutch: the Dutch wordnet (DWN) and the Referentie Bestand Nederlands (RBN). Recently, the resource is revised and extended with sentiment values in the From Text to Political Positions project , and with semantic annotations in SONAR, CGN and texts from the Web in the DutchSemCor project. The Cornetto Lexical Resource consists of two large repositories of lexicon data: the lexical entry repository and the synset repository. A Lexical Entry (LE) is a word-meaning pair (i.e. a single meaning of a certain word form), for which morphological, syntactical, semantical and combinatorial information is given. As such, LEs are word senses in the lexical semantic tradition, containing the linguistic knowledge that is needed to properly use the word in a specific meaning in a language. Since the LEs follow a word-to-meaning view, the semantical and combinatorial information for each meaning clarify the differences across the meanings. LEs focus on the polysemy of words and typically follow an approach to represent condensed and generalised meanings from which more specific ones can be derived. Each LE is aligned with a synset (set of synonyms) in the synset repository. As such, a synset can be seen as a set of LEs with the same meaning and every synset stands for a concept. The synsets in Cornetto are interconnected by different semantic relations such as hyponymy, antonymy and meronymy. The Cornet-to Resource is aligned with the English Wordnet, from which domain information was imported. The domains represent clusters of concepts that are related by a shared area of interest, such as sport, education or politics. The definitions of LEs from the same synset should be semantically equivalent and the LEs of a single word form should belong to different synsets. The LEs of a single word form typically differ in terms of connotation, pragmatics, syntax and semantics but synonymous words in the same synset can be differen-tiated along connotation, pragmatics and syntax but not semantics. This structure of the resource makes it possible to combine the very detailed information on form and usage of a specific LE or a group of LEs with the semantic relations which are specified in the corresponding synset(s). For an Open Source version lexico-semantic database for Dutch see the Open Source Dutch Wordnet (ODWN): http://wordpress.let.vupr.nl/odwn/

Vossen, P., I. maks, R. Segers, H. van der Vliet, M.F. Moens, K. Hofmann, E. Tjong Kim Sang, M. de Rijke (2013), Corntto: a lexical semantic database for Dutch, Chapter in: P. Spyns and J. Odijk (eds): Essential Speech and Language Technology for Dutch, Results by the STEVIN-programme, Publ. Springer series Theory and Applications of Natural Language Processing, ISBN 978-3-642-30909-0.

Vossen, P., I. Maks, R. Seegers and H. van der Vliet (2008). Integrating Lexical Units, Synsets, and Ontology in the Cornetto Database. In Proceedings of LREC-2008, Marrakech, Morocco.
General Dutch Dictionary

1 resources

Corpus based dictionary describing contemporary Dutch in the Netherlands and in Flanders of the period 1970-2019.

Modern Dutch Lemma

Describes the origin of a word

describes the meaning of a words

describes the structure of a word

Use "General Dutch Dictionary"
ISOcat

1 resources

This service is no longer operational! The ISO TC37 Data Category Registry (DCR) was created in 2008 as one of the first ISO standards delivered in the form of a database (ISOcat). The Max Planck Institute for Psycholinguistics (MPI) has provided development, hosting, and support services and acted as the Registration Authority (RA) until December 2014. For users from the European CLARIN research infrastructure, the Meertens Institute develops and hosts a new registry for CLARIN relevant concepts based on the corresponding ISOcat data categories, such as those used for the Component MetaData Infrastructure (CMDI). This can be found here: http://portal.clarin.nl/node/4216. ISO 12620 provides a framework for defining data categories compliant with the ISO/IEC 11179 family of standards. According to this model, each data category is assigned a unique administrative identifier, together with information on the status or decision-making process associated with the data category. In addition, data category specifications in the DCR contain linguistic descriptions, such as data category definitions, statements of associated value domains, and examples. Data category specifications can be associated with a variety of data element names and with language-specific versions of definitions, names, value domains and other attributes. For now the entries of the Data Category Registry are still available in a static manner, i.e., can't be changed anymore. All Data Category Peristent IDentifiers, e.g., http://www.isocat.org/datcat/DC-4146 (link is external), remain resolvable. The public part of the registry can be browsed via the Guest workspace: http://www.isocat.org/rest/user/guest/workspace . new location for this data category registry is http://www.datcatinfo.net/ .
INPOLDER: Integrated Parser and Lemmatizer Dutch in Retrospect

1 resources

INPOLDER (Integrated Parser and Lemmatizer of Dutch in Retrospect) provides a tool that assigns morphological tagging, lemmatization, and syntactic parsing for historical Dutch texts. It is built on the Adelheid tool (tagging and lemmatization) and Collins-Bikel statistical Parser. As an essential part of the Dutch cultural heritage, it is of vital importance that the Dutch historical record be made accessible for research into a wide range of historical and linguistic research questions. In the transition from the Middle Ages to the modern era, the Netherlands developed from speaking a diverse group of dialects (Hollandic, Brabantic, Flemish, North-eastern, Limburgian) to a country with a standard language, and there is good reason to believe that this process was an extremely dynamic one. Systematic research into these processes affecting syntax, phonology, morphology and spelling cannot be done without access to lemmatized, tagged and parsed corpora of historical Dutch. In recent years, a tagger-lemmatizer has been developed by Hans van Halteren (Adelheid, also available in the CLARIN infrastructure). INPOLDER complements these enrichment tool with a parser for historical Dutch. The INPOLDER parser is trained using a subset of the corpus of fourteenth-century texts (Corpus van Reenen/Mulder CRM, van Reenen and Mulder, 1993; Rem, 2003) and a subset of the Drenthe corpus (DC). CRM consists of 2700 charters from 345 places of origin. The corpus was designed as representative for the local language use of Middle Dutch and to be suitable for all types of linguistic research.
Fast and easy development of pronunciation lexicons for names

1 resources

The AUTONOMATA transcription tool set consists of a transcription tool and learning tools, with which one can enrich word lists with precise information on the pronunciation. Thee uses a general grapheme-to-phoneme converter (the g2p-converter).

This STEVIN project is about the investigation of new pronunciation modeling technologies that can improve the automatic recognition of spoken names in the context of a POI (Point-of-Interest) information providing business service. Collaboration with RU (Nijmegen), UiL (Utrecht), Nuance and TeleAtlas.

De AUTONOMATA-transcriptietoolset bestaat uit een transcriptietool en learning tools, waarmee men woordenlijsten kan verrijken met nauwkeurige uitspraakinformatie. De tool maakt gebruik van een algemene grafeem-naar-foneemomzetter (de g2p-omzetter).
SHEBANQ: System for HEBrew Text: ANnotations for Queries and Markup

2 resources

The WIVU (Werkgroep Informatica Vrije Universiteit) Hebrew Text Database contains the Biblia Hebraica Stuttgartensia (BHS) version of the text of the Hebrew Bible. Portions of other Semitic languages are included as well: the Aramaic sections of the Old Testament, two Syriac versions, and annotated portions of the Syriac and Aramaic translations. All these texts have been enriched with features that primarily result from linguistic analysis. The database can be queried by means of a language that is optimized to deal with data that is modeled as objects + features. SHEBANQ builds a bridge between the linguistically annotated Hebrew Text corpus and biblical scholars by (1) making this text, including its annotations, available to scholars; (2) demonstrating how queries can function to address research questions; the query saver and the metadata added to them will be a growing repository of valuable best practices of what queries are used in addressing research questions and how they contribute to answering these questions; (3) giving textual scholarship a more empirical basis, by creating the opportunity that claims made in scholarly articles (e.g.: “this syntactic pattern is not attested elsewhere in the Hebrew Bible”) can be accompanied by the unique identifiers that refer to the saved queries that have led to the claim. The WIVU database is a resource under long-term development. New features are being added, new corrections are being made over time.

Roorda, D. 2017. The Hebrew Bible as Data: Laboratory - Sharing - Experiences. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 217–229. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.18. License: CC-BY 4.0

Roorda, D. (2015). The Hebrew Bible as Data: Laboratory - Sharing - Experiences http://arxiv.org/abs/1501.01866

Roorda, D. (2014). LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an application to the Hebrew Bible, Computational Linguistics in the Netherlands Journal, Volume 4, December 2014, pp. 105-109 http://www.clinjournal.org/sites/clinjournal.org/files/08-Roorda-etal-CLIN2014.pdf and http://arxiv.org/abs/1410.0286
Namescape Named Entity Recognition

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47
Manual Oral History Annotation Tool

1 resources

The Oral History Annotation tool, developed by the Centre for Language and Speech Technology (CLST) at the Radboud University Nijmegen, enables one to annotate and search in oral history resources. The tool has been used to enrich a corpus of 250 interviews from the Living Oral History Workbench with commentary . All 250 interviews are searchable through a fragment finder and can be annotated. These annotations can be shared with other researchers, making the interviews available and easier accessible for a much wider range of researchers in the humanities in general and in linguistics in particular. The Annotation Tool is only available for scientific research and only after approval by the Veterans Institute. Interview data can be used in a number of ways, such as comparative research, restudy or follow-up study, re-analysis / secondary analysis, research design and methodological advancement, replication and validation of published work, and for teaching and learning. Recent experiences with the re-use of interview data show that there is an enormous potential for this type of data. Especially in the field of interview data related to the Second World War and other military conflicts multidisciplinary research is carried out. This corpus consists of (about) 30 interviews that are fully transcribed from the Veteran Tapes VP project, and 250 interviews resulting from the Living Oral History Workbench project: - 120 World War II interviews presenting a range of experiences and frames of reference of Dutch soldiers between 1935-1945; - 100 interviews with veterans of the Dutch East Indies. This collection exhibits a large diversity in experiences at the local level in guerilla warfare; - 30 interviews with veterans of New Guinea. This is a relatively unknown conflict with very interesting elements (soldiers left in uncertainty and isolation, and the pressure of the international community to decolonize the area). Each interview lasts between 1 and 1.5 hours.
GrNe: Greek-Dutch dictionary

1 resources

Online dictionary (ancient) Greek - Dutch for the letter Pi. Search functions include searches for Greek lemmata, search of Greek declined or conjugated word-forms that lead to the correct lemma ('lemmatizer'), searches for Dutch words leading to different Greek lemmata, and etymological searches. The dictionary is linked to Logeion, the international website of Greek dictionaries at the University of Chicago. The developers estimate that a complete version of the dictionary will be finished by the end of 2015 and that it will be published by the end of 2016. A new dictionary ancient Greek – Dutch is currently under construction at Leiden University. The dictionary is being financed through the 2010 Spinoza award of project director Ineke Sluiter. CLARIN funding enabled the digital production of the letter Pi. Currently, the letters beta, gamma, zeta, pi and sigma are available online. The developers estimate that a complete first version of the dictionary will be finished by the end of 2015 and that it will be published by the end of 2016. The corpus that is being covered by this dictionary covers Greek literature from its beginnings (Homer) and consists of ca. 3.680.000 words (tokens); it includes all classical authors from the 5th and 4th centuries BCE, and a selection of later Greek (selection based on the likelihood that the text will be used by our target groups), but all of the New Testament, Lucian and Plutarch. The dictionary will eventually contain ca. 52.500 headwords. It is based on a thorough comparison of state of the art dictionaries, supplemented with the help of the material from the Thesaurus Linguae Graecae. Greek morphology is complicated. In order to use a dictionary effectively, a rather high level of initial language competence is necessary for the user to be able to relate the word-form s/he finds in a text to the correct basic lemma form, where the definition of the word can be found. This digital dictionary however has an added ‘lemmatizer’ function, which enables the user to type in the word as found in the text and to be redirected to the correct lemma. The digital resource enables both Greek-Dutch searches and searches for the possible Greek equivalents of Dutch terms. This also makes it possible to explore the relation of semantic fields in Dutch and Greek. E.g., it is possible to locate all Greek words that have ‘courage’ as part of their definition. Furthermore, the digital resource makes it possible to locate different Greek words with the same etymological roots. And finally, the dictionary is linked to the website of the University of Chicago, where a comparison of all Greek-x dictionaries is supported. Here, one can enter a Greek word and be provided with the equivalents and definitions in all the dictionaries that are linked on this website.
COBWWWEB: Connections Between Women and Writings Within European Borders

1 resources

The WomenWriters database includes biographical data on more than 4.000 authors and over 22.000 references to reception data found in sources like the periodical press, early literary history and private correspondences. A significant part of the dataset was collected in the NWO digitizing project The International Reception of Women’s Writing (2004-2007), focusing on authors received in the Netherlands. A second NWO internationalising project called New approaches to European Women’s Writing (2007-2010) and the subsequent COST Action Women Writers in History (2009‐2013) brought together a large international community of scholars and used the Dutch data collection as an example for other colleagues. COBWWWEB enables a connection between the various national projects on this subject into a growing international data network. A virtual research environment on top of this network makes all material from participating data providers accessible for European and interdisciplinary research.

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Cornetto: Combinatorial and Relational Network as Toolkit for Dutch Language Technology

General Dutch Dictionary

ISOcat

INPOLDER: Integrated Parser and Lemmatizer Dutch in Retrospect

Fast and easy development of pronunciation lexicons for names

SHEBANQ: System for HEBrew Text: ANnotations for Queries and Markup

Namescape Named Entity Recognition

Manual Oral History Annotation Tool

GrNe: Greek-Dutch dictionary

COBWWWEB: Connections Between Women and Writings Within European Borders

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Session recording