CLARIN Tool Portal

698 record(s) found

Search results

EdUKate translation software 1

2 resources

This software package includes three tools: web frontend for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server and a tool for translation of documents with markup (html, docx, odt, pptx, odp,...). These tools are used in the Charles Translator service (https://translator.cuni.cz). This software was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools.

Use "EdUKate translation software 1"
The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian

2 resources

The model for lemmatisation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and using the Sloleks inflectional lexicon (http://hdl.handle.net/11356/1230). The estimated F1 of the lemma annotations is ~99.0.

Use "The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian"
MorphoDiTa-based tagger for Polish language

4 resources

MorphoDiTa-based tagger for Polish language. It is a tool for morphosyntactic unification for the Polish language, according to the NKJP tagset.

Use "MorphoDiTa-based tagger for Polish language"
ENIAMtoolkit

2 resources

ENIAMtoolkit is a collection of libraries that: - perform tokenization, lemmatization, part of speech tagging; - detect MWE and abbreviations; - split text into sentences.

Use "ENIAMtoolkit"
CorpoGrabber

1 resources

CorpoGrabber: The Toolchain to Automatic Acquiring and Extraction of the Website Content Jan Kocoń, Wroclaw University of Technology CorpoGrabber is a pipeline of tools to get the most relevant content of the website, including all subsites (up to the user-defined depth). The proposed toolchain can be used to build a big Web corpora of text documents. It requires only the list of the root websites as the input. Tools composing CorpoGrabber are adapted to Polish, but most subtasks are language independent. The whole process can be run in parallel on a single machine and includes the following tasks: downloading of the HTML subpages of each input page URL [1], extracting of plain text from each subpage by removing boilerplate content (such as navigation links, headers, footers, advertisements from HTML pages) [2], deduplication of plain text [2], removing of bad quality documents utilizing Morphological Analysis Converter and Aggregator (MACA) [3], tagging of documents using Wrocław CRF Tagger (WCRFT) [4]. Last two steps are available only for Polish. The result is a corpora as a set of tagged documents for each website. References [1] https://www.httrack.com/html/faq.html [2] J. Pomikalek. 2011. Removing Boilerplate and Duplicate Content from Web Corpora. Ph.D. Thesis. Masaryk University, Faculcy of Informatics. Brno. [3] A. Radziszewski, T. Sniatowski. 2011. Maca – a configurable tool to integrate Polish morphological data. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation. Barcelona, Spain. [4] A. Radziszewski. 2013. A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer Verlag.

Use "CorpoGrabber"
The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1

3 resources

The model for UD dependency parsing of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the UD-parsed portion of the BulTreeBank training corpus (https://clarino.uib.no/korpuskel/corpora) and using the CLARIN.SI-embed.bg word embeddings (http://hdl.handle.net/11356/1796). The estimated LAS of the parser is ~91.18. The difference to the previous version of the parser is that this version was trained using the new version of the Bulgarian word embeddings.

Use "The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1"
The CLASSLA-StanfordNLP model for UD dependency parsing of standard Croatian

3 resources

The model for UD dependency parsing of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the UD-parsed portion of the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). The estimated LAS of the parser is ~85.9.

Use "The CLASSLA-StanfordNLP model for UD dependency parsing of standard Croatian"
The CLASSLA-StanfordNLP model for named entity recognition of standard Croatian 1.0

3 resources

This model for named entity recognition of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205).

Use "The CLASSLA-StanfordNLP model for named entity recognition of standard Croatian 1.0"
EdUKate Czech-Ukrainian translation model 2024

2 resources

This package includes Czech-to-Ukrainian translation model adapted for the educational domain. The model is exported into the TensorFlow Serving format (using Tensor2tensor version 1.6.6), so it can be used in the Charles Translator service (https://translator.cuni.cz) and in the web portal Škola s nadhledem. This model was developed within the EdUKate project, which aims to help mitigate language barriers between non-Czech-speaking children in the Czech Republic and the education in the Czech school system. The project focuses on the development and dissemination of multilingual digital learning materials for students in primary and secondary schools.

Use "EdUKate Czech-Ukrainian translation model 2024"
GlowTTS models for Talrómur1 (22.10)

6 resources

This release contains GlowTTS models for four different voices from the Talrómur 1 [1] corpus. The models were trained using the Coqui TTS library after it was adapted for Icelandic. Included is the model, model configuration, log file for the training and the recipe used for each model. Þessi útgáfa inniheldur þjálfuð GlowTTS módel fyrir fjórar mismunandi raddir úr Talrómur 1 [1] gagnasafninu. Módelin voru þjálfuð með Coqui TTS verkfærakistunni sem búið var að aðlaga fyrir íslensku. Innifalið fyrir hverja rödd er módelið, skjal með stillingum á módelinu, þjálfunarsaga og forskriftin sem var notuð. [1] http://hdl.handle.net/20.500.12537/104

Use "GlowTTS models for Talrómur1 (22.10)"

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

EdUKate translation software 1

The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian

MorphoDiTa-based tagger for Polish language

ENIAMtoolkit

CorpoGrabber

The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1

The CLASSLA-StanfordNLP model for UD dependency parsing of standard Croatian

The CLASSLA-StanfordNLP model for named entity recognition of standard Croatian 1.0

EdUKate Czech-Ukrainian translation model 2024

GlowTTS models for Talrómur1 (22.10)

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Session recording