CLARIN Tool Portal

698 record(s) found

Search results

FLAT: FoLiA-Linguistic-Annotation-Tool

1 resources

FLAT is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure. Features Web-based, multi-user environment Server-side document storage, divided into 'namespaces', by default each user has his own namespace. Active documents are held in memory server-side. Read and write permissions to access namespaces are fully configurable. Concurrency (multiple users may edit the same document similtaneously) Full versioning control for documents (using git), allowing limitless undo operations. (in foliadocserve) Full annotator information, with timestamps, is stored in the FoLiA XML and can be displayed by the interface. The git log also contains verbose information on annotations. Annotators can indicate their confidence for each annotation Highly configurable interface; interface options may be disabled on a per-configuration basis. Multiple configurations can be deployed on a single installation Displays and retains document structure (divisions, paragraphs, sentence, lists, etc) Support for corrections, of text or linguistic annotations, and alternative annotations. Corrections are never mere substitutes, originals are always preserved! Spelling corrections for runons, splits, insertions and deletions are supported. Supports FoLiA Set Definitions to display label sets. Sets are not predefined in FoLiA and anybody can create their own. Supports Token Annotation and Span Annotation Supports complex span annotations such as dependency relations, syntactic units (constituents), predicates and semantic roles, sentiments, stratements/attribution, observations Simple metadata editor for editing/adding arbitrary metadata to a document. Selected metadata fields can be shown in the document index as well. User permission model featuring groups, group namespaces, and assignable permissions File/document management functions (copying, moving, deleting) Allows converter extensions to convert from other formats to FoLiA on upload In-document search (CQL or FQL), advanced searches can be predefined by administrators Morphosyntactic tree visualisation (constituency parses and morphemes) Higher-order annotation: associate features, comments, descriptions with any linguistic annotation
Web-based Annotation Explorer

1 resources

Annex (Annotation Explorer) is a web-based tool for exploring and viewing annotated multimedia recordings in an archive. ANNEX can play audio and video files in a web browser along with annotations in a variety of formats: ELAN (EAF), Shoebox/Toolbox text, CHAT (CHILDES annotation format), Plain text, CSV, PDF, SubRip, Praat TextGrid, HTML and XML. Annex will visualise the annotations in synchrony with the media files as long as time-alignment information is available. If no time-alignment information is available, a default segment duration is assumed. Annex has a graphical interface that resembles the interface of the ELAN annotation tool to some extent, with a number of different view modes such as subtitle view, timeline view and grid view. Annex runs in any modern web browser with the Adobe Flash plugin (> version 10) installed. ANNEX has been functionally extended with the help of the following CLARIN-NL-funded projects: - ColTime: Collaboration on Time-Based Resources. - EXILSEA: Exploiting ISOcat's Language Sections in ELAN and ANNEX. - MultiCon: Multilayer Concordance Functions in ELAN and ANNEX. - SignLinC: Linking lexical databases and annotated corpora of signed languages. Over the years, many funders have contributed to the development of ANNEX in several projects, such as the Volkswagen Foundation, the Royal Netherlands Academy of Arts and Sciences, the Berlin-Brandenburg Academy of Sciences and Humanities, the German Federal Ministry of Education and Research and the the Max Planck Society.
CLARIN Vocabulary Service

1 resources

The CLARIN Vocabulary Service is a running instance of the OpenSKOS exchange and publication platform for SKOS vocabularies. OpenSKOS offers several ways to publish SKOS vocabularies (upload SKOS file, harvest from another OpenSKOS instance with OAI-PMH, construct using the RESTful API) and to use vocabularies (search and autocomplete using the API, harvest using OAI-PMH, inspect in the interactive Editor or consult as Linked Data). This CLARIN OpenSKOS instance is hosted by the Meertens Institute. Contents This OpenSKOS instance currently publishes SKOS versions of three vocabularies: - ISO-639-3 language codes, as published by SIL. - Closed and simple Data Categories from the ISOcat metadata profile. - A manually constructed and curated list of Organizations, based on the CLARIN VLO. .

Brugman, H. 2017. CLAVAS: A CLARIN Vocabulary and Alignment Service. In: Odijk J. & van Hessen A, CLARIN in the Low Countries, ch 5, pp 61-69. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.5
Ucto Tokeniser Engine

1 resources

The Ucto tokenisation engine is a language-independent engine that, given an external configuration file with tokenisation rules for a specifc language ,yields a tokenizer for that language that tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extensible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.
The Typological Database System (TDS)

1 resources

The Typological Database System (TDS) is a web-based service that provides integrated access to a collection of independently developed typological databases. Unified querying is supported with the help of an integrated ontology. The component databases of the TDS are cross-linguistic databases, developed for research in language typology and linguistics. Together they contain some 1200 different descriptive properties, with information about more than 1000 languages. (Because of the heterogeneous nature of the collection, most properties are only filled for a fraction of the languages). Most of the data is in the form of high-level "analytical" properties, but there are also a few collections of example sentences (with glosses) illustrating particular phenomena. Language typology, the study of the range of language variation and universals, is a data-intensive discipline that increasingly relies on electronic databases. Improved availability of the data collected in the TDS enhances its potential to support linguistic research. The TDS can be used to help answer questions such as "which languages have the basic word order Verb-Object-Subject", "what kind of phonological stress systems are common" "are languages with subject-verb agreement more likely to allow null subjects than languages without it" etc. The system is not an oracle: In all cases, only partial information is returned, as collected and deposited in the system by the creators of the component databases. But this information can be invaluable to other researchers, either as a complete answer to a specific question or as the starting point for further research. Given that the collected data represents linguistic analysis and often novel theoretical approaches, it is impossible to map it to a single "consensus" standard. While in some limited cases it is possible to completely reconcile data from different sources, the system places a premium on preserving the theoretical orientations and analyses of the component databases, which are presented side by side as alternative datasets in the same topical group. The TDS project was carried out by a research group of the Netherlands Graduate School of Linguistics (LOT), with members representing the University of Amsterdam, Leiden University, Radboud University Nijmegen, and Utrecht University. It was developed with support from NWO (Netherlands Organization for Scientific Research) grant 380-30-004 / INV-03-12 and from participating universities. The initial phase of the project was started in September 2000, and the project entered the implementation phase on 1 May 2004. Originally scheduled to run for three years, it was extended until 31 December 2007. The TDS server and data collections continued to be augmented until 2009. While the original TDS web server is still operational, web technologies evolve rapidly. The system had begun to show its age even before the end of the project in 2009, motivating migration of the data collection to an archival platform. But due to the complexity and diversity of the component databases, the data cannot be usefully navigated without specialized supporting software; useful archiving necessitates a software access point alongside the static data. Under the "TDS Curator" project, supported by a CLARIN-NL Call 1 grant, the TDS has migrated to a new platform, hosted by the Data Archiving and Networked Services (DANS), that conforms to CLARIN infrastructural requirements. Both versions of the system remain in operation.

Windhouwer, M, Dimitriadis, A and Akerman, V. 2017. Curating the Typological Database System. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 123–132. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.11. License: CC-BY 4.0

A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans, T. Bíró. How to integrate databases without starting a typology war: The Typological Database System. In S. Musgrave, M. Everaert and A. Dimitriadis (eds.), The use of databases in cross-linguistic research, Mouton de Gruyter, March 2009.

M. Windhouwer, A. Dimitriadis. Sustainable operability: Keeping complex resources alive. In Proceedings of the LREC workshop on Sustainability of Language Resources and Tools for Natural Language Processing (SustainableNLP08 ), Marrakech, Morocco, May 31, 2008.

A. Dimitriadis. Managing Differences: The TDS Approach. In Proceedings of the E-MELD Workshop on Toward the Interoperability of Language Resources (E-MELD 2007 ), Stanford, CA, July 13-15, 2007. Position paper.

A. Dimitriadis, A. Saulwick, M. Windhouwer. Semantic relations in ontology mediated linguistic data integration. In Proceedings of the E-MELD Workshop on Morphosyntactic Annotation and Terminology: Linguistic Ontologies and Data Categories for Linguistic Resources (E-MELD 2005 ), Cambridge, Massachusetts, July 1-3, 2005.

A. Saulwick, M. Windhouwer, A. Dimitriadis, R. Goedemans. Distributed tasking in ontology mediated integration of typological databases for linguistic research. In J. Castro and E. Teniente, Proceedings of the CAiSE'05 Workshops (International Workshop on Data Integration and the Semantic Web (DISWeb'05) in conjuction with CAiSE'05 ), Volume I, pp 303-317, Porto, Portugal, June 14, 2005.

A. Dimitriadis, P. Monachesi. Integrating Different Data Types in a Typological Database System. In P. Austin, H. Dry and P. Wittenburg (eds.), Proceedings of the International Workshop on Resources and Tools in Field Linguistics, Las Palmas, Canary Islands, Spain, 2002.

P. Monachesi, A. Dimitriadis, R. Goedemans, A. Mineur, M. Pinto. A Unified System for Accessing Typological Databases. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 3), Las Palmas, Canary Islands, Spain, 2002.
Ucto Tokeniser

1 resources

Ucto tokenizes text files: it separates words from punctuation, and splits sentences. This is one of the first tasks for almost any Natural Language Processing application. Ucto offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. The tokeniser engine is language independent. By supplying language-specific tokenisation rules in an external configuration file a tokeniser can be created for a specific language. Ucto comes with tokenization rules for English, Dutch, French, Italian, and Swedish; it is easily extendible to other languages. It recognizes dates, times, units, currencies, abbreviations. It recognizes paired quote spans, sentences, and paragraphs. It produces UTF8 encoding and NFC output normalization, optionally accepts other encodings as input. Optional conversion to all lowercase or uppercase. Ucto supports FoLiA XML.

Ucto
ELAN Multimedia Annotator

1 resources

ELAN is a professional tool for the creation of complex annotations on video and audio resources. With ELAN a user can add an unlimited number of annotations to audio and/or video streams. An annotation can be a sentence, word or gloss, a comment, translation or a description of any feature observed in the media. Annotations can be created on multiple layers, called tiers. Tiers can be hierarchically interconnected. An annotation can either be time-aligned to the media or it can refer to other existing annotations. The textual content of annotations is always in Unicode and the transcription is stored in an XML format. ELAN provides several different views on the annotations, each view is connected and synchronized to the media playhead. Up to 4 video files can be associated with an annotation document. Each video can be integrated in the main document window or displayed in its own resizable window. ELAN delegates media playback to an existing media framework, like Windows Media Player, QuickTime or JMF (Java Media Framework). As a result a wide variety of audio and video formats is supported and high performance media playback can be achieved. ELAN is written in the Java programming language and the sources are available for non-commercial use. It runs on Windows, Mac OS X and Linux. ELAN has been functionally extended with the help of the following CLARIN-NL-funded projects: - ColTime: Collaboration on Time-Based Resources. - EXILSEA: Exploiting ISOcat's Language Sections in ELAN and ANNEX. - MultiCon: Multilayer Concordance Functions in ELAN and ANNEX. - SignLinC: Linking lexical databases and annotated corpora of signed languages. Over the years, many funders have contributed to the development of ELAN in several projects, such as the Volkswagen Foundation, the Royal Netherlands Academy of Arts and Sciences, the Berlin-Brandenburg Academy of Sciences and Humanities, the German Federal Ministry of Education and Research, the Max Planck Society and the ARC Centre of Excellence for the Dynamics of Language.
RÚV-DI Speaker Diarization (21.10)

1 resources

These are a set of speaker diarization recipes which depend on the speech toolkit Kaldi. There are two types of recipes here. First are recipes used for decoding unseen audio. The second type of recipes are for training diarization models on the Rúv-di data. This tool also lists the DER for the Rúv-di dataset on most of the recipes. All DERs within this tool have no unscored collars and include overlapping speech Þessi pakki inniheldur forskriftir fyrir samræðugreind fyrir hugbúnaðarumhverfið Kaldi. Pakkinn inniheldur tvær tegundir af forskriftum. Annars vegar forskriftir sem greina samræður í nýjum hljóðskrám og hins vegar forskriftir til að þjálfa ný samræðugreindarlíkön með Rúv-di-gagnasafninu. Hluti forskriftanna innihalda villutíðni (DER) fyrir Rúv-di-gagnasettið.

Use "RÚV-DI Speaker Diarization (21.10)"
python-g419wikitools-1.0

1 resources

Zestaw skryptów w języku Python do wygenerowania słownika odmiany fraz w oparciu o linki wewnętrzne Wikipedii. Efektem analizy dumpa Wikipedii jest zestaw plików, zawierających: A) wikilinks-difflen.txt — frazy mają różną liczbę tokenów, B) wikilinks-samelen* — frazy mają taką samą liczbę tokenów, 1. wikilinks-samelen-textbase.txt — każda para tokenów ma przynajmniej jedną wspólną formę bazową, 1.a) wikilinks-samelen-textbase-one.txt — frazy zawierają po jednym tokenie, 1.b) wikilinks-samelen-textbase-multi.txt — frazy zawierają więcej niż jeden token, 2. wikilinks-samelen-rules.txt — co najmniej jedna para tokenów nie została dopasowana przez formy bazowe, tylko przez zastosowanie reguł podmiany końcówek dla formy tekstowej. 3. wikilinks-samelen-different.txt — pozostałe frazy, które nie zostały dopasowane. Przykład zawartości pliku wikilinks-samelen-textbase-multi.txt: Transformacja ustrojowa transformacji ustrojowej transformacji ustrojowych Konstytucja ZSRR Konstytucji ZSRR Rajd Tatrzański Rajdzie Tatrzańskim Macierz dyskowa macierzą dyskową macierzy dyskowych Osiedle Ptasie Osiedle Ptasie objaw Brudzińskiego objawy Brudzińskiego Chłopskie Stronnictwo Radykalne Chłopskiego Stronnictwa Radykalnego Melanie Klein Melanią Klein Jakub Sokołowski Jakuba Sokołowskiego Letnie Igrzyska Olimpijskie Młodzieży 2010 Letnich Igrzysk Olimpijskich Młodzieży 2010 wyrabianie ciasta wyrabiania ciasta bitwa nad rzeką Czoroch bitwie nad rzeką Czoroch Nerw błędny nerwu błędnego nerwów błędnych Pakt trzech paktu trzech Paktu Trzech Paktu trzech Komisja Episkopatu Polski ds. Ekumenizmu Komisji Episkopatu Polski ds. Ekumenizmu Flaga Albanii flagę Albanii flagi Albanii Bitwa pod Chrobrzem bitwie pod Chrobrzem Patriarcha Indii Zachodnich patriarchę Indii Zachodnich procesy fizjologiczne proces fizjologiczny energetyka jądrowa energetykę jądrową energetyce jądrowej energetyką jądrową energetyki jądrowej zdanie syntetyczne zdania syntetyczne Franciszek Ksawery Franciszek Ksawery Franciszka Ksawerego Franciszkiem Ksawerym Obwód Tirana obwodzie Tirana

Use "python-g419wikitools-1.0"
VIADAT-GIS (2019-12-31)

2 resources

A VIADAT module; VIADAT-GIS connects the platform with maps. Developed in cooperation with ÚSD AV ČR and NFA.

Use "VIADAT-GIS (2019-12-31)"

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

FLAT: FoLiA-Linguistic-Annotation-Tool

Web-based Annotation Explorer

CLARIN Vocabulary Service

Ucto Tokeniser Engine

The Typological Database System (TDS)

Ucto Tokeniser

ELAN Multimedia Annotator

RÚV-DI Speaker Diarization (21.10)

python-g419wikitools-1.0

VIADAT-GIS (2019-12-31)

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Session recording