CLARIN Tool Portal

The Trankit model for linguistic processing of standard Slovenian

2 resources

This is a retrained Slovenian standard model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, feature prediction, and dependency parsing in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a dataset published by Universal Dependencies in release 2.12 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.12). Due to the larger training dataset compared to the original Trankit v1.1.1 model, this version yields superior results and achieves state-of-the art parsing performance for Slovenian (https://slobench.cjvt.si/leaderboard/view/11). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base.

Use "The Trankit model for linguistic processing of standard Slovenian"

The Trankit model for linguistic process of standard written Slovenian 1.1

2 resources

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the reference SSJ UD treebank featuring fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a dataset published by Universal Dependencies in release 2.14 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. This version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14) than the previous version of the model and produces similar results.

Use "The Trankit model for linguistic process of standard written Slovenian 1.1"

Slovenian commonsense reasoning model SloMET-ATOMIC 2020

2 resources

The SloMET-ATOMIC 2020 is a Slovene commonsense reasoning model that is able to predict commonsense descriptions in a natural language for a given input sentence. The model is an adaptation of the Slovene GPT-2 model (https://huggingface.co/cjvt/gpt-sl-base) that has been finetuned using the SloATOMIC 2020 corpus (http://hdl.handle.net/11356/1724), consisting of 1.33M everyday interence knowledge tuples about entities and events. The released model is a pytorch neural network model, intended for usage with the transformers library (https://github.com/huggingface/transformers).

Use "Slovenian commonsense reasoning model SloMET-ATOMIC 2020"

Corpus extraction tool LIST 1.3

2 resources

The LIST corpus extraction tool is a Java program for extracting lists from text corpora on the levels of characters, word parts, words, and word sets. It supports VERT and TEI P5 XML formats and outputs .CSV files that can be imported into Microsoft Excel or similar statistical processing software. Version 1.3 adds support for the KOST 2.0 Slovene Learner Corpus (http://hdl.handle.net/11356/1887) in XML format. It also allows program execution using the command line (see 00README.txt for details), and uses a later version of Java (tested using JDK 21). In addition, Windows users no longer need to have Java installed on their computers to run the program.

Use "Corpus extraction tool LIST 1.3"

Text classification model fastText-Trendi-Topics 1.0

2 resources

The fastText-Trendi-Topics model is a text classification model for categorizing news texts with one of 13 topic labels. It was trained on a set of approx. 36,000 Slovene texts from various Slovene news sources included in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) such as "rtvslo.si", "sta.si", "delo.si", "dnevnik.si", "vecer.com", "24ur.com", "siol.net", "gorenjskiglas.si", etc. The texts were semi-automatically categorized into 13 categories based on the sections under which they were published (i.e. URLs). The set of labels was developed in accordance with related categorization schemas used in other corpora and comprises the following topics: "črna kronika" (crime and accidents), "gospodarstvo, posel, finance" (economy, business, finance), "izobraževanje" (education), "okolje" (environment), "prosti čas" (free time), "šport" (sport), "umetnost, kultura" (art, culture), "vreme" (weather), "zabava" (entertainment), "zdravje" (health), "znanost in tehnologija" (science and technology), "politika" (politics), and "družba" (society). The categorization process is explained in more detail in Kosem et al. (2022): https://nl.ijs.si/jtdh22/pdf/JTDH2022_Kosem-et-al_Spremljevalni-korpus-Trendi.pdf The model was trained on the labeled texts using the word embeddings CLARIN.SI-embed.sl 1.0 (http://hdl.handle.net/11356/1204) and validated on a development set of 1,293 texts using the fastText library, 1000 epochs, and default values for the rest of the hyperparameters (see https://github.com/TajaKuzman/FastText-Classification-SLED for the full code). The model achieves a macro-F1-score of 0.85 on a test set of 1,295 texts (best for "vreme" at 0.97, worst for "prosti čas" at 0.67). Please note that the SloBERTa-Trendi-Topics 1.0 text classification model is also available (http://hdl.handle.net/11356/1709) that achieves higher classification accuracy, but is slower and computationally more demanding.

Use "Text classification model fastText-Trendi-Topics 1.0"

ZRCola 2

2 resources

ZRCola is an input system designed mainly, although not exclusively, for linguistic use. It allows the user to combine basic letters with any diacritic marks and insert the resulting complex characters into the texts with ease. The system is comprised of an input program and a font, which can also be installed separately. The font is based on the Unicode standard and includes a vastly enlarged set of Latin, Cyrillic and other characters for Slavic writing systems in the Private Use Area.

Use "ZRCola 2"

ELMo embeddings model, Slovenian

3 resources

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on entire Gigafida 2.0 corpus (https://viri.cjvt.si/gigafida/System/Impressum) for 10 epochs. 1,364,064 most common tokens were provided as vocabulary during the training. The model can also infer OOV words, since the neural network input is on the character level.

Use "ELMo embeddings model, Slovenian"

Slovene Conformer CTC BPE E2E Automated Speech Recognition model PROTOVERB-ASR-E2E 1.0

2 resources

This Conformer CTC BPE E2E Automated Speech Recognition model was trained following the NVIDIA NeMo Conformer-CTC fine-tuning recipe (for details see the official NVIDIA NeMo NMT documentation, https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html, and NVIDIA NeMo GitHub repository https://github.com/NVIDIA/NeMo). It provides functionality for transcribing Slovene speech to text. The starting point was the Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0, which was fine-tuned on the Protoverb closed dataset. The model was fine-tuned for 20 epochs, which improved the performance on the Protoverb test dataset for 9.8% relative WER, and for 3.3% relative WER on the Slobench dataset.

Use "Slovene Conformer CTC BPE E2E Automated Speech Recognition model PROTOVERB-ASR-E2E 1.0"

Slovene Punctuation and Capitalisation model RSDO-DS2-P&C 3.6

2 resources

This Punctuation and Capitalisation model was trained following the NVIDIA NeMo Punctuation and Capitalisation recipe (for details see the official NVIDIA NeMo P&C documentation, https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/nlp/punctuation_and_capitalization.html, and NVIDIA NeMo GitHub repository https://github.com/NVIDIA/NeMo). It provides functionality for restoring punctuation (,.!?) and capital letters in lowercased non-punctuated Slovene text. The training corpus was built from publicly available datasets, as well as a small portion of proprietary data. In total the training corpus consisted of 38.829.529 sentences and the validation corpus consisted of 2.092.497 sentences.

Use "Slovene Punctuation and Capitalisation model RSDO-DS2-P&C 3.6"

Corpus extraction tool LIST 1.0

2 resources

The LIST corpus extraction tool is a Java program for extracting lists from text corpora on the levels of characters, word parts, words, and word sets. It supports VERT and TEI P5 XML formats and outputs .CSV files that can be imported into Microsoft Excel or similar statistical processing software.

Use "Corpus extraction tool LIST 1.0"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

The Trankit model for linguistic processing of standard Slovenian

The Trankit model for linguistic process of standard written Slovenian 1.1

Slovenian commonsense reasoning model SloMET-ATOMIC 2020

Corpus extraction tool LIST 1.3

Text classification model fastText-Trendi-Topics 1.0

ZRCola 2

ELMo embeddings model, Slovenian

Slovene Conformer CTC BPE E2E Automated Speech Recognition model PROTOVERB-ASR-E2E 1.0

Slovene Punctuation and Capitalisation model RSDO-DS2-P&C 3.6

Corpus extraction tool LIST 1.0

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording