CLARIN Tool Portal

WebLicht Dep Parsing DE

1 resources

WebLicht Easy Chain for Dependency Parsing (German). The pipeline makes use of WebLicht's TCF converter, the IMS tokenizer, the POS Tagger from the OpenNLP projet, and the MaltParser, a system for data-driven dependency parsing. WebLicht's Tundra can be used to visualize the result.

NameTag

1 resources

NameTag is an open-source tool for named entity recognition (NER). NameTag identifies proper names in text and classifies them into predefined categories, such as names of persons, locations, organizations, etc.

LINDAT Translation

1 resources

The input file size is limited to 100kB. Translates from->to: Czech->English, Hindi, French, Russian, German English->Russsian, German, Czech, Hindi, French Russian->German, French, Czech, Hindi, English German->Russian, Hindi, Czech, English, French French->Russian, German, Czech, English, Hindi

WebLicht NamedEntities DE

1 resources

WebLicht Easy Chain for German Named Entity Recognition (German). The pipeline makes use of WebLicht's TCF converter, the IMS tokenizer, the IMS TreeTagger, and a German Named Entity Recognizer that has been trained based on a maximum entropy approach using the OpenNLP maxent library. WebLicht's Tundra can be used to visualize the result.

WebLicht Morphology DE

1 resources

WebLicht Easy Chain for Morphological Analysis (German). The pipeline makes use of WebLicht's TCF converter, the IMS tokenizer, and the IMS tool on German morphology. WebLicht's Tundra can be used to visualize the result.

Wikipedia Search DE

1 resources

Wikipedia is an online free-content encyclopedia that you can edit and contribute to. Wikipedia co-founder Jimmy Wales has described Wikipedia as "an effort to create and distribute a free encyclopedia of the highest possible quality to every single person on the planet in their own language." Wikipedia exists to bring knowledge to everyone who seeks it.

WebLicht All In One (DE)

1 resources

WebLicht Easy Chain for all German text annotations: part of speech, lemmas, morphology, dependency parsing, topological fields and named entities. The pipeline makes use of WebLicht's TCF converter, the SoMaJo tokenizer, and the SfS Sticker. WebLicht's Tundra can be used to visualize the result.

UDPipe

1 resources

UDPipe is an trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. UDPipe is language-agnostic and can be trained given only annotated data in CoNLL-U format. Trained models are provided for nearly all UD treebanks.

TiCClops: Text-Induced Corpus Clean-up online processing system

3 resources

TICCL (Text Induced Corpus Clean-up) is a system that is designed to search a corpus for all existing variants of (potentially) all words occurring in the corpus. This corpus can be one text, or several, in one or more directories, located on one or more machines. TICCL creates word frequency lists, listing for each word type how often the word occurs in the corpus. These frequencies of the normalized word forms are the sum of the frequencies of the actual word forms found in the corpus. TICCL is a system that is intended to detect and correct typographical errors (misprints) and OCR errors (optical character recognition) in texts. When books or other texts are scanned from paper by a machine, that then turns these scans, i.e. images, into digital text files, errors occur. For instance, the letter combination `in' can be read as `m', and so the word `regeering' is incorrectly reproduced as `regeermg'. TICCL can be used to detect these errors and to suggest a correct form. Text-Induced Corpus Clean-up (TICCL) was developed first as a prototype at the request of the Koninklijke Bibliotheek - The Hague (KB) and reworked into a production tool according to KB specifications (currently at production version 2.0) mainly during the second half of 2008. It is a fully functional environment for processing possibly very large corpora in order to largely remove the undesirable lexical variation in them. It has provisions for various input and output formats, is flexible and robust and has very high recall and acceptable precision. As a spelling variation detection system it is to the developer’s knowledge unique in making principled use of the input text as possible source for target output canonical forms. As such it is far less domain-sensitive than other approaches: the domain is largely covered by the input text collection. TICCL comes in two variants: one with a classic CLAM web application interface, and one with the PhilosTEI interface.

Reynaert, M. (2008). All, and only, the errors: More complete and consistent spelling and OCR-error correction evaluation. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), Marrakech, Morocco.

Reynaert, M. (2010). Character confusion versus focus word-based correction of spelling and ocr variants in corpora. International Journal on Document Analysis and Recognition, pp 1-15, URL http://dx.doi.org/10.1007/s10032-010-0133-5

Usage

3 resources

The system here allows you to convert your book pages' images into editable text, presented in a particular text format called XML (eXtended Markup Language) of a particular type called Text-Encoding Initiative or TEI XML. This particular format was developed specifically for being able to mark-up or annotate the text you want to work on, i.e. to add all manner of further information to the actual text, e.g. to build a critical edition of it, which is most likely exactly what you want to do with your author's work.

Betti, A, Reynaert, M and van den Berg, H. 2017. @PhilosTEI: Building Corpora for Philosophers. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 379–392. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.32. License: CC-BY 4.0

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

WebLicht Dep Parsing DE

NameTag

LINDAT Translation

WebLicht NamedEntities DE

WebLicht Morphology DE

Wikipedia Search DE

WebLicht All In One (DE)

UDPipe

TiCClops: Text-Induced Corpus Clean-up online processing system

Usage

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording