CLARIN Tool Portal

SHEBANQ: System for HEBrew Text: ANnotations for Queries and Markup

2 resources

The WIVU (Werkgroep Informatica Vrije Universiteit) Hebrew Text Database contains the Biblia Hebraica Stuttgartensia (BHS) version of the text of the Hebrew Bible. Portions of other Semitic languages are included as well: the Aramaic sections of the Old Testament, two Syriac versions, and annotated portions of the Syriac and Aramaic translations. All these texts have been enriched with features that primarily result from linguistic analysis. The database can be queried by means of a language that is optimized to deal with data that is modeled as objects + features. SHEBANQ builds a bridge between the linguistically annotated Hebrew Text corpus and biblical scholars by (1) making this text, including its annotations, available to scholars; (2) demonstrating how queries can function to address research questions; the query saver and the metadata added to them will be a growing repository of valuable best practices of what queries are used in addressing research questions and how they contribute to answering these questions; (3) giving textual scholarship a more empirical basis, by creating the opportunity that claims made in scholarly articles (e.g.: “this syntactic pattern is not attested elsewhere in the Hebrew Bible”) can be accompanied by the unique identifiers that refer to the saved queries that have led to the claim. The WIVU database is a resource under long-term development. New features are being added, new corrections are being made over time.

Roorda, D. 2017. The Hebrew Bible as Data: Laboratory - Sharing - Experiences. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 217–229. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.18. License: CC-BY 4.0

Roorda, D. (2015). The Hebrew Bible as Data: Laboratory - Sharing - Experiences http://arxiv.org/abs/1501.01866

Roorda, D. (2014). LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an application to the Hebrew Bible, Computational Linguistics in the Netherlands Journal, Volume 4, December 2014, pp. 105-109 http://www.clinjournal.org/sites/clinjournal.org/files/08-Roorda-etal-CLIN2014.pdf and http://arxiv.org/abs/1410.0286

Namescape Named Entity Recognition

1 resources

Searching and visualizing Named Entities in modern Dutch novels. The named entity (NE) tagging and resolution in NameScape enables quantitative and repeatable research where previously only guesswork and anecdotal evidence was feasible. The visualisation module enables researchers with a less technical background to draw conclusions about functions of names in literary work and help them to explore the material in search of more interesting questions (and answers). Users from other communities (sociolinguistics, sentiment analysis, …) also benefit from the NE tagged data, especially since the NE recognizer is available as a web service, enabling researchers to annotate their own research data. Datasets in NameScape (total of 1.129 books): Corpus Sanders: A corpus of 582 Dutch novels written and published between 1970 and 2009 will. Corpus Huygens: Consists of 22 novels manually tagged with detailed named entity information. IPR for this corpus do not allow distribution. Corpus eBooks: Consists of 7000+ Dutch eBooks tagged automatically with basic NER features and person name Part information. IPR for this corpus do not allow distribution. Corpus SoNaR Books: 105 Dutch books; NE tagged. Corpus Gutenberg Dutch: Consists of 530 NE tagged TEI files converted from the Epub versions of the corresponding Gutenberg documents. Recent research has conclusively proven names in literary works can only be put fully into perspective when studied in a wider context (landscape) of names either in the same text or in related material (the onymic landscape or “namescape”). Research on large corpora is needed to gain a better understanding of e.g. what is characteristic for a certain period, genre, author or cultural region. The data necessary for research on this scale simply does not exist yet. NameScape aims to fill the need by providing a substantial amount of literary works annotated with a rich tag set, thereby enabling researchers to perform their research in more depth than previously possible. Several exploratory visualization tools help the scholar to answer old questions and uncover many more new ones, which can be addressed using the demonstrator.

de Does, J, Depuydt, K, van Dalen-Oskam, K and Marx, M. 2017. Namescape: Named Entity Recognition from a Literary Perspective. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 361–370. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.30. License: CC-BY 4.0

Karina van Dalen-Oskam (2013), Nordic Noir: a background check on Inspector Van Veeteren, 31 May 2012, http://blog.namescape.nl/?p=47

Manual Oral History Annotation Tool

1 resources

The Oral History Annotation tool, developed by the Centre for Language and Speech Technology (CLST) at the Radboud University Nijmegen, enables one to annotate and search in oral history resources. The tool has been used to enrich a corpus of 250 interviews from the Living Oral History Workbench with commentary . All 250 interviews are searchable through a fragment finder and can be annotated. These annotations can be shared with other researchers, making the interviews available and easier accessible for a much wider range of researchers in the humanities in general and in linguistics in particular. The Annotation Tool is only available for scientific research and only after approval by the Veterans Institute. Interview data can be used in a number of ways, such as comparative research, restudy or follow-up study, re-analysis / secondary analysis, research design and methodological advancement, replication and validation of published work, and for teaching and learning. Recent experiences with the re-use of interview data show that there is an enormous potential for this type of data. Especially in the field of interview data related to the Second World War and other military conflicts multidisciplinary research is carried out. This corpus consists of (about) 30 interviews that are fully transcribed from the Veteran Tapes VP project, and 250 interviews resulting from the Living Oral History Workbench project: - 120 World War II interviews presenting a range of experiences and frames of reference of Dutch soldiers between 1935-1945; - 100 interviews with veterans of the Dutch East Indies. This collection exhibits a large diversity in experiences at the local level in guerilla warfare; - 30 interviews with veterans of New Guinea. This is a relatively unknown conflict with very interesting elements (soldiers left in uncertainty and isolation, and the pressure of the international community to decolonize the area). Each interview lasts between 1 and 1.5 hours.

GrNe: Greek-Dutch dictionary

1 resources

Online dictionary (ancient) Greek - Dutch for the letter Pi. Search functions include searches for Greek lemmata, search of Greek declined or conjugated word-forms that lead to the correct lemma ('lemmatizer'), searches for Dutch words leading to different Greek lemmata, and etymological searches. The dictionary is linked to Logeion, the international website of Greek dictionaries at the University of Chicago. The developers estimate that a complete version of the dictionary will be finished by the end of 2015 and that it will be published by the end of 2016. A new dictionary ancient Greek – Dutch is currently under construction at Leiden University. The dictionary is being financed through the 2010 Spinoza award of project director Ineke Sluiter. CLARIN funding enabled the digital production of the letter Pi. Currently, the letters beta, gamma, zeta, pi and sigma are available online. The developers estimate that a complete first version of the dictionary will be finished by the end of 2015 and that it will be published by the end of 2016. The corpus that is being covered by this dictionary covers Greek literature from its beginnings (Homer) and consists of ca. 3.680.000 words (tokens); it includes all classical authors from the 5th and 4th centuries BCE, and a selection of later Greek (selection based on the likelihood that the text will be used by our target groups), but all of the New Testament, Lucian and Plutarch. The dictionary will eventually contain ca. 52.500 headwords. It is based on a thorough comparison of state of the art dictionaries, supplemented with the help of the material from the Thesaurus Linguae Graecae. Greek morphology is complicated. In order to use a dictionary effectively, a rather high level of initial language competence is necessary for the user to be able to relate the word-form s/he finds in a text to the correct basic lemma form, where the definition of the word can be found. This digital dictionary however has an added ‘lemmatizer’ function, which enables the user to type in the word as found in the text and to be redirected to the correct lemma. The digital resource enables both Greek-Dutch searches and searches for the possible Greek equivalents of Dutch terms. This also makes it possible to explore the relation of semantic fields in Dutch and Greek. E.g., it is possible to locate all Greek words that have ‘courage’ as part of their definition. Furthermore, the digital resource makes it possible to locate different Greek words with the same etymological roots. And finally, the dictionary is linked to the website of the University of Chicago, where a comparison of all Greek-x dictionaries is supported. Here, one can enter a Greek word and be provided with the equivalents and definitions in all the dictionaries that are linked on this website.

COBWWWEB: Connections Between Women and Writings Within European Borders

1 resources

The WomenWriters database includes biographical data on more than 4.000 authors and over 22.000 references to reception data found in sources like the periodical press, early literary history and private correspondences. A significant part of the dataset was collected in the NWO digitizing project The International Reception of Women’s Writing (2004-2007), focusing on authors received in the Netherlands. A second NWO internationalising project called New approaches to European Women’s Writing (2007-2010) and the subsequent COST Action Women Writers in History (2009‐2013) brought together a large international community of scholars and used the Dutch data collection as an example for other colleagues. COBWWWEB enables a connection between the various national projects on this subject into a growing international data network. A virtual research environment on top of this network makes all material from participating data providers accessible for European and interdisciplinary research.

PaQu - Parse and Query

1 resources

PaQu uses the Alpino parser to make treebanks of your own text corpus, and to search in these treebanks using an interface based on the LASSY Word Relations Search interface (http://dev.clarin.nl/node/1966). Several treebanks are already available in the application, such as: Lassy Klein (1M words, manually checked syntactic analysis) and Lassy Groot (700M words, syntactic analysis automatically assigned by Alpino). PaQu offers two ways to search through the syntactically annotated texts. The first option is to use the search bar to look for word pairs, optionally complemented by their syntactic relationship. The second search option is to use the query language XPath.

Odijk, J, van Noord, G, Kleiweg, P and Tjong Kim Sang, E. 2017. The Parse and Query (PaQu) Application. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 281–297. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.23. License: CC-BY 4.0

A Distributed Lemmatizer for Historical Dutch

1 resources

With this web-application an end user can have historical Dutch texts tokenized, lemmatized and part-of-speech tagged, using the most appropriate resources (such as lexica) for the text in question. For each specific text, the user can select the best resources from those available in CLARIN, wherever they might reside, and where necessary supplemented by own lexica. The software can also be used as a web service.

Use "A Distributed Lemmatizer for Historical Dutch"

Arthurian Fiction

1 resources

This research tool provides information on medieval Arthurian narratives and the manuscripts in which they are transmitted throughout Europe. The tool discloses a database consists of linked records on over two hundred texts, more than thousand manuscripts and two hundred persons. The database is work in progress: a considerable number of records have yet to be completed, while fresh discoveries of narratives and manuscripts invite new entries. The compilers of the database hope that this tool will contribute to further research into Arthurian fiction as a pan-European phenomenon. The Arthurian Fiction web application enables searching for manuscripts, narratives and persons from the Arthurian Fiction narratives and manuscripts metadata database Arthurian Fiction Data. Each of these object types can be searched for using facets specific to the object type. These include: - for manuscripts: institute, date, origin, physical form, extant leave, leaf sizes, illustration type, scripts, scribe, patron and several more; - for narratives: date, origin, languages, cycle, manuscript, author, patron, verse type, meter, length, intertextuality properties and many more; - for persons: name, gender, subtype, background, manuscript, and narratives. The user can, if desired, select a subset of the facets to work with. In addition, keyword search is possible for all fields, query results can be sorted by a variety of keys and queries can be saved. There is also a web service with an API for the Arthurian Fiction narratives and manuscripts database. This web service makes use of SOLR queries via HTTP POST requests.

This movie is in Dutch with English subtitles.

Besamusca, A.A.M. and Quinlan, J. (2012). The Fringes of Arthurian Fiction. Arthurian literature, 29, 191-241.

Boot, P. (2012), Manuscripten koning Arthur op tafel, E-Data & Research 7(1), 2012.

Dalen-Oskam, K. van and Besamusca, B. (2011), Arthurian Fiction in Medieval Europe: Narratives and Manuscripts, presentation held at the CLARIN-NL Kick-off meeting Call 2, Utrecht, February 9, 2011.

Dalen-Oskam, K. van (2011), ArthurianFiction, presentation held at the Call 3 information session, Utrecht, August 25, 2011.

Gabmap is a free web-based application for dialectometry. It measures the differences in sets of phonetic (or phonemic) transcriptions via edit distance. Gabmap has a graphical user interface that makes string comparison facility available as a web application.

1 resources

Gabmap is a free web-based application for dialectometry. It measures the differences in sets of phonetic (or phonemic) transcriptions via edit distance. Gabmap has a graphical user interface that makes string comparison facility available as a web application. This enables wider experimentation with the techniques. Gabmap (a.k.a. ADEPT) measures pronunciation distances based on transcriptions and aligns pronunciation transcription data. Because the measurements are numeric, they can be aggregated in order to obtain an estimation of overall pronunciation differences among varieties. The software uses a range of edit distance (or Levenshtein) algorithms. It is useful for dialectologists, and has been used extensively in dialectology. It has occasionally been used for other purposes, e.g. trying to identify loan words automatically (Paris, Musée de l’Homme, central Asian project involving Turkic and also Indo-Iranian languages). The software has also been used as the basis of a program to multi-align pronunciation data for the purpose of phylogenetic analysis. The Gabmap developers claim that the program could also be used to measure deviant pronunciation e.g. of second-language learners, or of speakers with speech defects. A variety of related algorithms are implemented in the package of C programs (and R programs) the developers turned into a web application, including a basic version regarding segments only as same or different, and other versions variously respecting consonant/vowel distinctions; using phonetic segment distances as provided via an assignment of phonetic or phonological features to segments; using segment distances as learned from refining alignment correspondences; and applying weightings derived from (inverse) frequency (derived from Goebl’s work) or depending on the position within a word. There are useful auxiliary programs aimed at assisting users in converting phonetic data to X-SAMPA and at spotting errors. (In working with users in the past, the developers have noted that data conversion is a major hurdle.) There are additional meta-analytical calculations aimed at gauging how reliable the signal is from a given set of data, and aimed at comparing various options with respect to the degree to which they capture the geographic cohesion one assumes in dialectology. Gabmap was developed in the CLARIN-NL project ADEPT: Assaying Differences via Edit-Distance of Pronunciation Transcriptions.

Nerbonne, J., Colen, R., Gooskens, C., Kleiweg, P., and Leinonen, T. (2011). Gabmap — A Web Application for Dialectology. Dialectologica, Special issue II, 65-89.

T. Leinonen, Ç. Çöltekin, J. Nerbonne, Using Gabmap. Lingua Vol. 178, 71-83, doi:10.1016/j.lingua.2015.02.004

Usage

3 resources

The system here allows you to convert your book pages' images into editable text, presented in a particular text format called XML (eXtended Markup Language) of a particular type called Text-Encoding Initiative or TEI XML. This particular format was developed specifically for being able to mark-up or annotate the text you want to work on, i.e. to add all manner of further information to the actual text, e.g. to build a critical edition of it, which is most likely exactly what you want to do with your author's work.

Betti, A, Reynaert, M and van den Berg, H. 2017. @PhilosTEI: Building Corpora for Philosophers. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 379–392. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.32. License: CC-BY 4.0

Result filters

Metadata provider

Language

Resource type

Tool task

Field of study

Availability

Organisation

Project

Keywords

Active filters:

Search results

SHEBANQ: System for HEBrew Text: ANnotations for Queries and Markup

Namescape Named Entity Recognition

Manual Oral History Annotation Tool

GrNe: Greek-Dutch dictionary

COBWWWEB: Connections Between Women and Writings Within European Borders

PaQu - Parse and Query

A Distributed Lemmatizer for Historical Dutch

Arthurian Fiction

Gabmap is a free web-based application for dialectometry. It measures the differences in sets of phonetic (or phonemic) transcriptions via edit distance. Gabmap has a graphical user interface that makes string comparison facility available as a web application.

Usage

Result filters

Metadata provider

Language

Resource type

Tool task

Field of study

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording