CLARIN Tool Portal

OpenConvert

1 resources

The OpenConvert tools convert to TEI or FOLiA from a number of input formats (alto, text, word, HTML, ePub). The tools are available as a Java command line tool, a web service and a web application.The OpenConvert Tools were created by IVDNT in the OpenConvert project. The OpenConvert tools convert to TEI or FOLiA from a number of input formats (alto, text, word, HTML, ePub). The tools are available as a Java command line tool, a web service and a web application. Furthermore, as a proof of concept, the website currently provides two annotation tools: a simple Tokenizer for TEI files and a modern Dutch part of speech tagger.

The tool service can be called as a REST webservice which returns responses in XML, allowing it to be part of a webservice tool chain.

Input TEI, plain text, HTML

ALTO XML input

ePub input

directory containing files of a valid input type

zip file (with extension .zip) containing files of a valid input type

Free for academic use. Non-applicable for commercial parties

CLARIN based login required. The Clarin federation accepts login from many europian institutions. please seehttp://www.clarin.eu/content/service-provider-federation for more details

input file name (File upload)

Format of input file

Format of output file

to specify the tagger or tokeniser

input file mimetype is application/tei+xml

input file mimetype is text/html

input file mimetype is text/alto+xml

input file mimetype is application/msword

input file mimetype is application/epub+zip

input file mimetype is text/plain

output file mimetype is application/tei+xml

output file mimetype is text/folia+xml

Basic tagger-lemmatizer for modern Dutch

a TEI tokenizer

ePistolarium: A Web-based Humanities’ Collaboratory on Correspondences

1 resources

Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic (CKCC) investigates the circulation of knowledge in the 17th-century Dutch Republic. A multi-disciplinary project team consisting of historians, literature researchers, linguists and computer scientists works together in this project and created a web-based Humanities’ Collaboratory on Correspondences. This project, is carried out thanks to a NWO Medium investment subsidy and with CLARIN subsidies to make the resources available withing the CLARIN domain. A consortium of Dutch universities and cultural heritage institutions is building a web-based collaboratory (an online space for asynchronous collaboration) around a corpus of 20.000 letters of scholars who lived in the 17th-century Dutch Republic to answer the research question: how did knowledge circulate in the 17th century? Hereto, it will be necessary to analyze this large amount of correspondence systematically. Based on this (extendable) corpus, we will implement a content processing workflow that consists of iterative cycles of conceptual analysis, enrichment with several layers of annotation and visualization. With advice from CLARIN-EU in the first stage of the project a demonstrator was developed which implements techniques of keyword extraction. The second stage consists of evaluating existing more complex tools en techniques that can tackle one or more aspects of the targeted grammatical, content-related, and network complexity analysis, annotation, and visualization. The phase shall identify a set of tools that can be readily utilized in CKCC, as well as tools that need to be adapted or extended to the needs of CKCC; in short, by the end of this phase resources, requirements and risks shall become clear (deadline: December 2010). In the third stage the collaboratory is further developed according to the description in the CKCC project goals, centering around the technique of concept extraction. These three stages constitute the Work Package Analysis Tools, the core of the CKCC project, which was supported by CLARIN-NL. Other Work Packages provide data and software tools needed to create a complete system: the digital corpus of letters (WP6), the editing collaboratory that will contain the letters (WP1), and the archiving environment for data and software (WP2).

Ravenek, W, van den Heuvel, C and Gerritsen, G. 2017. The ePistolarium: Origins and Techniques. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 317–323. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.26. License: CC-BY 4.0

WFT-GTB: Integrating the Wurdboek fan ˈe Fryske Taal into the Geïntegreerde TaalBank

1 resources

The Dictionary of the Frisian Language (Wurdboek fan de Fryske Taal) is online available via the GTB dictionary web application. The GTB also holds other major Dutch historical dictionaries, such as the Dictionary of Old Dutch (ONW), the Dictionary of early Middle Dutch (VMNW), the Dictionary of Middle Dutch (MNW), and the Dictionary of the Dutch language (WNT). The digital surrounding enables extensive forms of free and structured search queries, including comparative studies with Dutch materials. The Wurdboek fan de Fryske Taal (Dictionary of the Frisian Language)-project includes the vocabulary of Modern West Frisian from the period 1800-1975. The dictionary’s metalanguage is Dutch. A volume of 400 pages comes out every year, the first one in 1984. The editorial phase was finalized in 2009, the final editing and publication phase in 2010.

Modern Dutch Lemma and Frisian lemma

Describes the origin of a word

describes the meaning of a words

describes the structure of a word

describes the possible spellings of a word

Depuydt, K, de Does, J, Duijff, P and Sijens, H. 2017. Making the Dictionary of the Frisian Language Available in the Dutch Historical Dictionary Portal. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 151–165. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.13. License: CC-BY 4.0

Use "WFT-GTB: Integrating the Wurdboek fan ˈe Fryske Taal into the Geïntegreerde TaalBank"

Cornetto: Combinatorial and Relational Network as Toolkit for Dutch Language Technology

1 resources

Cornetto is a lexical resource for the Dutch language which combines two resources with different semantic organisations: the Dutch Wordnet with its synset organisation and the Dutch Reference Lexicon which includes definitions, usage constraints, selectional restrictions, syntactic behaviours, illustrative contexts, etc. The Cornetto database contains over 92K lemmas and almost 120K word meanings. The Cornetto lexical resource for Dutch covers the most generic and central part of the language. Cornetto combines the structures of the Princeton Wordnet, some of the features from the FrameNet for English and the information on morphological, syntactic, semantic and combinatorial features of lexemes normally found in dictionaries. The Cornetto resource is compiled by combining and aligning two existing semantic resources for Dutch: the Dutch wordnet (DWN) and the Referentie Bestand Nederlands (RBN). Recently, the resource is revised and extended with sentiment values in the From Text to Political Positions project , and with semantic annotations in SONAR, CGN and texts from the Web in the DutchSemCor project. The Cornetto Lexical Resource consists of two large repositories of lexicon data: the lexical entry repository and the synset repository. A Lexical Entry (LE) is a word-meaning pair (i.e. a single meaning of a certain word form), for which morphological, syntactical, semantical and combinatorial information is given. As such, LEs are word senses in the lexical semantic tradition, containing the linguistic knowledge that is needed to properly use the word in a specific meaning in a language. Since the LEs follow a word-to-meaning view, the semantical and combinatorial information for each meaning clarify the differences across the meanings. LEs focus on the polysemy of words and typically follow an approach to represent condensed and generalised meanings from which more specific ones can be derived. Each LE is aligned with a synset (set of synonyms) in the synset repository. As such, a synset can be seen as a set of LEs with the same meaning and every synset stands for a concept. The synsets in Cornetto are interconnected by different semantic relations such as hyponymy, antonymy and meronymy. The Cornet-to Resource is aligned with the English Wordnet, from which domain information was imported. The domains represent clusters of concepts that are related by a shared area of interest, such as sport, education or politics. The definitions of LEs from the same synset should be semantically equivalent and the LEs of a single word form should belong to different synsets. The LEs of a single word form typically differ in terms of connotation, pragmatics, syntax and semantics but synonymous words in the same synset can be differen-tiated along connotation, pragmatics and syntax but not semantics. This structure of the resource makes it possible to combine the very detailed information on form and usage of a specific LE or a group of LEs with the semantic relations which are specified in the corresponding synset(s). For an Open Source version lexico-semantic database for Dutch see the Open Source Dutch Wordnet (ODWN): http://wordpress.let.vupr.nl/odwn/

Vossen, P., I. maks, R. Segers, H. van der Vliet, M.F. Moens, K. Hofmann, E. Tjong Kim Sang, M. de Rijke (2013), Corntto: a lexical semantic database for Dutch, Chapter in: P. Spyns and J. Odijk (eds): Essential Speech and Language Technology for Dutch, Results by the STEVIN-programme, Publ. Springer series Theory and Applications of Natural Language Processing, ISBN 978-3-642-30909-0.

Vossen, P., I. Maks, R. Seegers and H. van der Vliet (2008). Integrating Lexical Units, Synsets, and Ontology in the Cornetto Database. In Proceedings of LREC-2008, Marrakech, Morocco.

ISOcat

1 resources

This service is no longer operational! The ISO TC37 Data Category Registry (DCR) was created in 2008 as one of the first ISO standards delivered in the form of a database (ISOcat). The Max Planck Institute for Psycholinguistics (MPI) has provided development, hosting, and support services and acted as the Registration Authority (RA) until December 2014. For users from the European CLARIN research infrastructure, the Meertens Institute develops and hosts a new registry for CLARIN relevant concepts based on the corresponding ISOcat data categories, such as those used for the Component MetaData Infrastructure (CMDI). This can be found here: http://portal.clarin.nl/node/4216. ISO 12620 provides a framework for defining data categories compliant with the ISO/IEC 11179 family of standards. According to this model, each data category is assigned a unique administrative identifier, together with information on the status or decision-making process associated with the data category. In addition, data category specifications in the DCR contain linguistic descriptions, such as data category definitions, statements of associated value domains, and examples. Data category specifications can be associated with a variety of data element names and with language-specific versions of definitions, names, value domains and other attributes. For now the entries of the Data Category Registry are still available in a static manner, i.e., can't be changed anymore. All Data Category Peristent IDentifiers, e.g., http://www.isocat.org/datcat/DC-4146 (link is external), remain resolvable. The public part of the registry can be browsed via the Guest workspace: http://www.isocat.org/rest/user/guest/workspace . new location for this data category registry is http://www.datcatinfo.net/ .

Fast and easy development of pronunciation lexicons for names

1 resources

The AUTONOMATA transcription tool set consists of a transcription tool and learning tools, with which one can enrich word lists with precise information on the pronunciation. Thee uses a general grapheme-to-phoneme converter (the g2p-converter).

This STEVIN project is about the investigation of new pronunciation modeling technologies that can improve the automatic recognition of spoken names in the context of a POI (Point-of-Interest) information providing business service. Collaboration with RU (Nijmegen), UiL (Utrecht), Nuance and TeleAtlas.

De AUTONOMATA-transcriptietoolset bestaat uit een transcriptietool en learning tools, waarmee men woordenlijsten kan verrijken met nauwkeurige uitspraakinformatie. De tool maakt gebruik van een algemene grafeem-naar-foneemomzetter (de g2p-omzetter).

SHEBANQ: System for HEBrew Text: ANnotations for Queries and Markup

2 resources

The WIVU (Werkgroep Informatica Vrije Universiteit) Hebrew Text Database contains the Biblia Hebraica Stuttgartensia (BHS) version of the text of the Hebrew Bible. Portions of other Semitic languages are included as well: the Aramaic sections of the Old Testament, two Syriac versions, and annotated portions of the Syriac and Aramaic translations. All these texts have been enriched with features that primarily result from linguistic analysis. The database can be queried by means of a language that is optimized to deal with data that is modeled as objects + features. SHEBANQ builds a bridge between the linguistically annotated Hebrew Text corpus and biblical scholars by (1) making this text, including its annotations, available to scholars; (2) demonstrating how queries can function to address research questions; the query saver and the metadata added to them will be a growing repository of valuable best practices of what queries are used in addressing research questions and how they contribute to answering these questions; (3) giving textual scholarship a more empirical basis, by creating the opportunity that claims made in scholarly articles (e.g.: “this syntactic pattern is not attested elsewhere in the Hebrew Bible”) can be accompanied by the unique identifiers that refer to the saved queries that have led to the claim. The WIVU database is a resource under long-term development. New features are being added, new corrections are being made over time.

Roorda, D. 2017. The Hebrew Bible as Data: Laboratory - Sharing - Experiences. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 217–229. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.18. License: CC-BY 4.0

Roorda, D. (2015). The Hebrew Bible as Data: Laboratory - Sharing - Experiences http://arxiv.org/abs/1501.01866

Roorda, D. (2014). LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an application to the Hebrew Bible, Computational Linguistics in the Netherlands Journal, Volume 4, December 2014, pp. 105-109 http://www.clinjournal.org/sites/clinjournal.org/files/08-Roorda-etal-CLIN2014.pdf and http://arxiv.org/abs/1410.0286

Namescape Named Entity Recognition

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

Manual Oral History Annotation Tool

1 resources

The Oral History Annotation tool, developed by the Centre for Language and Speech Technology (CLST) at the Radboud University Nijmegen, enables one to annotate and search in oral history resources. The tool has been used to enrich a corpus of 250 interviews from the Living Oral History Workbench with commentary . All 250 interviews are searchable through a fragment finder and can be annotated. These annotations can be shared with other researchers, making the interviews available and easier accessible for a much wider range of researchers in the humanities in general and in linguistics in particular. The Annotation Tool is only available for scientific research and only after approval by the Veterans Institute. Interview data can be used in a number of ways, such as comparative research, restudy or follow-up study, re-analysis / secondary analysis, research design and methodological advancement, replication and validation of published work, and for teaching and learning. Recent experiences with the re-use of interview data show that there is an enormous potential for this type of data. Especially in the field of interview data related to the Second World War and other military conflicts multidisciplinary research is carried out. This corpus consists of (about) 30 interviews that are fully transcribed from the Veteran Tapes VP project, and 250 interviews resulting from the Living Oral History Workbench project: - 120 World War II interviews presenting a range of experiences and frames of reference of Dutch soldiers between 1935-1945; - 100 interviews with veterans of the Dutch East Indies. This collection exhibits a large diversity in experiences at the local level in guerilla warfare; - 30 interviews with veterans of New Guinea. This is a relatively unknown conflict with very interesting elements (soldiers left in uncertainty and isolation, and the pressure of the international community to decolonize the area). Each interview lasts between 1 and 1.5 hours.

COBWWWEB: Connections Between Women and Writings Within European Borders

1 resources

The WomenWriters database includes biographical data on more than 4.000 authors and over 22.000 references to reception data found in sources like the periodical press, early literary history and private correspondences. A significant part of the dataset was collected in the NWO digitizing project The International Reception of Women’s Writing (2004-2007), focusing on authors received in the Netherlands. A second NWO internationalising project called New approaches to European Women’s Writing (2007-2010) and the subsequent COST Action Women Writers in History (2009‐2013) brought together a large international community of scholars and used the Dutch data collection as an example for other colleagues. COBWWWEB enables a connection between the various national projects on this subject into a growing international data network. A virtual research environment on top of this network makes all material from participating data providers accessible for European and interdisciplinary research.

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

OpenConvert

ePistolarium: A Web-based Humanities’ Collaboratory on Correspondences

WFT-GTB: Integrating the Wurdboek fan ˈe Fryske Taal into the Geïntegreerde TaalBank

Cornetto: Combinatorial and Relational Network as Toolkit for Dutch Language Technology

ISOcat

Fast and easy development of pronunciation lexicons for names

SHEBANQ: System for HEBrew Text: ANnotations for Queries and Markup

Namescape Named Entity Recognition

Manual Oral History Annotation Tool

COBWWWEB: Connections Between Women and Writings Within European Borders

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording