CLARIN Tool Portal

Ucto Tokeniser Engine

1 resources

The Ucto tokenisation engine is a language-independent engine that, given an external configuration file with tokenisation rules for a specifc language ,yields a tokenizer for that language that tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extensible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

The Typological Database System (TDS)

1 resources

The Typological Database System (TDS) is a web-based service that provides integrated access to a collection of independently developed typological databases. Unified querying is supported with the help of an integrated ontology. The component databases of the TDS are cross-linguistic databases, developed for research in language typology and linguistics. Together they contain some 1200 different descriptive properties, with information about more than 1000 languages. (Because of the heterogeneous nature of the collection, most properties are only filled for a fraction of the languages). Most of the data is in the form of high-level "analytical" properties, but there are also a few collections of example sentences (with glosses) illustrating particular phenomena. Language typology, the study of the range of language variation and universals, is a data-intensive discipline that increasingly relies on electronic databases. Improved availability of the data collected in the TDS enhances its potential to support linguistic research. The TDS can be used to help answer questions such as "which languages have the basic word order Verb-Object-Subject", "what kind of phonological stress systems are common" "are languages with subject-verb agreement more likely to allow null subjects than languages without it" etc. The system is not an oracle: In all cases, only partial information is returned, as collected and deposited in the system by the creators of the component databases. But this information can be invaluable to other researchers, either as a complete answer to a specific question or as the starting point for further research. Given that the collected data represents linguistic analysis and often novel theoretical approaches, it is impossible to map it to a single "consensus" standard. While in some limited cases it is possible to completely reconcile data from different sources, the system places a premium on preserving the theoretical orientations and analyses of the component databases, which are presented side by side as alternative datasets in the same topical group. The TDS project was carried out by a research group of the Netherlands Graduate School of Linguistics (LOT), with members representing the University of Amsterdam, Leiden University, Radboud University Nijmegen, and Utrecht University. It was developed with support from NWO (Netherlands Organization for Scientific Research) grant 380-30-004 / INV-03-12 and from participating universities. The initial phase of the project was started in September 2000, and the project entered the implementation phase on 1 May 2004. Originally scheduled to run for three years, it was extended until 31 December 2007. The TDS server and data collections continued to be augmented until 2009. While the original TDS web server is still operational, web technologies evolve rapidly. The system had begun to show its age even before the end of the project in 2009, motivating migration of the data collection to an archival platform. But due to the complexity and diversity of the component databases, the data cannot be usefully navigated without specialized supporting software; useful archiving necessitates a software access point alongside the static data. Under the "TDS Curator" project, supported by a CLARIN-NL Call 1 grant, the TDS has migrated to a new platform, hosted by the Data Archiving and Networked Services (DANS), that conforms to CLARIN infrastructural requirements. Both versions of the system remain in operation.

Windhouwer, M, Dimitriadis, A and Akerman, V. 2017. Curating the Typological Database System. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 123–132. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.11. License: CC-BY 4.0

A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans, T. Bíró. How to integrate databases without starting a typology war: The Typological Database System. In S. Musgrave, M. Everaert and A. Dimitriadis (eds.), The use of databases in cross-linguistic research, Mouton de Gruyter, March 2009.

M. Windhouwer, A. Dimitriadis. Sustainable operability: Keeping complex resources alive. In Proceedings of the LREC workshop on Sustainability of Language Resources and Tools for Natural Language Processing (SustainableNLP08 ), Marrakech, Morocco, May 31, 2008.

A. Dimitriadis. Managing Differences: The TDS Approach. In Proceedings of the E-MELD Workshop on Toward the Interoperability of Language Resources (E-MELD 2007 ), Stanford, CA, July 13-15, 2007. Position paper.

A. Dimitriadis, A. Saulwick, M. Windhouwer. Semantic relations in ontology mediated linguistic data integration. In Proceedings of the E-MELD Workshop on Morphosyntactic Annotation and Terminology: Linguistic Ontologies and Data Categories for Linguistic Resources (E-MELD 2005 ), Cambridge, Massachusetts, July 1-3, 2005.

A. Saulwick, M. Windhouwer, A. Dimitriadis, R. Goedemans. Distributed tasking in ontology mediated integration of typological databases for linguistic research. In J. Castro and E. Teniente, Proceedings of the CAiSE'05 Workshops (International Workshop on Data Integration and the Semantic Web (DISWeb'05) in conjuction with CAiSE'05 ), Volume I, pp 303-317, Porto, Portugal, June 14, 2005.

A. Dimitriadis, P. Monachesi. Integrating Different Data Types in a Typological Database System. In P. Austin, H. Dry and P. Wittenburg (eds.), Proceedings of the International Workshop on Resources and Tools in Field Linguistics, Las Palmas, Canary Islands, Spain, 2002.

P. Monachesi, A. Dimitriadis, R. Goedemans, A. Mineur, M. Pinto. A Unified System for Accessing Typological Databases. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 3), Las Palmas, Canary Islands, Spain, 2002.

Ucto Tokeniser

1 resources

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extendible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Ucto

ELAN Multimedia Annotator

1 resources

ELAN is a professional tool for the creation of complex annotations on video and audio resources. With ELAN a user can add an unlimited number of annotations to audio and/or video streams. An annotation can be a sentence, word or gloss, a comment, translation or a description of any feature observed in the media. Annotations can be created on multiple layers, called tiers. Tiers can be hierarchically interconnected. An annotation can either be time-aligned to the media or it can refer to other existing annotations. The textual content of annotations is always in Unicode and the transcription is stored in an XML format. ELAN provides several different views on the annotations, each view is connected and synchronized to the media playhead. Up to 4 video files can be associated with an annotation document. Each video can be integrated in the main document window or displayed in its own resizable window. ELAN delegates media playback to an existing media framework, like Windows Media Player, QuickTime or JMF (Java Media Framework). As a result a wide variety of audio and video formats is supported and high performance media playback can be achieved. ELAN is written in the Java programming language and the sources are available for non-commercial use. It runs on Windows, Mac OS X and Linux. ELAN has been functionally extended with the help of the following CLARIN-NL-funded projects: - ColTime: Collaboration on Time-Based Resources. - EXILSEA: Exploiting ISOcat's Language Sections in ELAN and ANNEX. - MultiCon: Multilayer Concordance Functions in ELAN and ANNEX. - SignLinC: Linking lexical databases and annotated corpora of signed languages. Over the years, many funders have contributed to the development of ELAN in several projects, such as the Volkswagen Foundation, the Royal Netherlands Academy of Arts and Sciences, the Berlin-Brandenburg Academy of Sciences and Humanities, the German Federal Ministry of Education and Research, the Max Planck Society and the ARC Centre of Excellence for the Dynamics of Language.

WebStylo

2 resources

Web based, open stylometry system based on Multilevel Text Analysis. Runs cluto and stylo (R system) clusterisation methods. Based on Natural Language Processing Workflow engine (included in the distribution).

Use "WebStylo"

The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0

2 resources

This model for lemmatisation of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the lemma annotations is ~99.11. The difference to the previous version of the model is that the model was trained using the SUK training corpus and uses new embeddings and the new version of the Slovene morphological lexicon Sloleks 3.0 (http://hdl.handle.net/11356/1745).

Use "The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0"

Toposław 2 (2016-05-31)

3 resources

Toposław 2 is an editor of multi-world unit inflection lexicons.

Use "Toposław 2 (2016-05-31)"

ENIAM

4 resources

ENIAM: Categorial Syntactic-Semantic Parser for Polish

Use "ENIAM"

Liner2.6 model NER NKJP

3 resources

Liner2.6 NER NKJP model The package contains a pre-trained Liner2 (https://github.com/CLARIN-PL/Liner2) model for recognition named entities according to NKJP guidelines. The model was trained on the NKJP corpus (http://nkjp.pl/) and evaluated in the PolEval 2018 Task 2 (http://poleval.pl/tasks/). The model won third place with the following results: Exact — 0.778, Overlap — 0.818, Final — 0.810. References: * NKJP corpus in TEI format — http://clip.ipipan.waw.pl/NationalCorpusOfPolish?action=AttachFile&do=view&target=NKJP-PodkorpusMilionowy-1.2.tar.gz * PolEval 2018 Task 2 evaluation corpus — http://mozart.ipipan.waw.pl/~axw/poleval2018/

Use "Liner2.6 model NER NKJP"

The CLASSLA-Stanza model for lemmatisation of non-standard Croatian 2.1

2 resources

The model for lemmatisation of non-standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus (http://hdl.handle.net/11356/1792) and the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1793), using the hrLex inflectional lexicon (http://hdl.handle.net/11356/1232). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~94.23. The difference to the previous version of the model is that this version is trained on a combination of two corpora (hr500k, ReLDI-NormTagNER-hr).

Use "The CLASSLA-Stanza model for lemmatisation of non-standard Croatian 2.1"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Ucto Tokeniser Engine

The Typological Database System (TDS)

Ucto Tokeniser

ELAN Multimedia Annotator

WebStylo

The CLASSLA-Stanza model for lemmatisation of standard Slovenian 2.0

Toposław 2 (2016-05-31)

ENIAM

Liner2.6 model NER NKJP

The CLASSLA-Stanza model for lemmatisation of non-standard Croatian 2.1

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording