CLARIN Tool Portal

Active filters:

Language: Slovenian

77 record(s) found

Search results

Slovene Text Normalizator RSDO-DS2-NORM 1.0

2 resources

This Text Normalisator converts Slovene text from written-form into its spoken-form. Traditionally it is an essential preprocessing step before text-to-speech (TTS). As input it accepts text as a string, and returns a dictionary with fields "input_text", "normalised_text", "status" and "logs". Example: normalize_text("Sodobna definicija Celzijeve temperaturne lestvice, ki velja od leta 1954, je, da je temperatura trojne točke vode enaka 0,01 °C.") {'input_text': 'Sodobna definicija Celzijeve temperaturne lestvice, ki velja od leta 1954, je, da je temperatura trojne točke vode enaka 0,01 °C.', 'normalized_text': 'Sodobna definicija Celzijeve temperaturne lestvice, ki velja od leta tisoč devetsto štiriinpetdeset, je, da je temperatura trojne točke vode enaka nič celih nič ena stopinje Celzija.', 'status': 1, 'logs': [('1954', 'tisoč devetsto štiriinpetdeset'), ('0,01', 'nič celih nič ena'), ('°C', 'stopinje Celzija')]} For further details see README.md.

Use "Slovene Text Normalizator RSDO-DS2-NORM 1.0"
Slovene Text Denormalizator RSDO-DS2-DENORM 1.0

2 resources

This Text Denormalisator converts Slovene spoken-form text into written-form text. Typically it is used as a post-processing step in Automatic Speech Recognition, which traditionally outputs spoken-form text. As input it accepts text in either string form, list of tokens, or a list of dictionaries with a mandatory "text" field. The output is a dictionary. Example of use: denormalize("Danes, osmega sedmega dva tisoč dvaindvajset, je lep sončen dan, saj je zunaj prijetnih petindvajset stopinj Celzija.") {'denormalized_content': [{'text': 'Danes', 'index': [0]}, {'text': ',', 'index': [1]}, {'text': '8.', 'index': [2]}, {'text': '7.', 'index': [3]}, {'text': '2022', 'index': [4, 5, 6]}, {'text': ',', 'index': [7]}, {'text': 'je', 'index': [8]}, {'text': 'lep', 'index': [9]}, {'text': 'sončen', 'index': [10]}, {'text': 'dan', 'index': [11]}, {'text': ',', 'index': [12]}, {'text': 'saj', 'index': [13]}, {'text': 'je', 'index': [14]}, {'text': 'zunaj', 'index': [15]}, {'text': 'prijetnih', 'index': [16]}, {'text': '25', 'index': [17]}, {'text': '°C', 'index': [18, 19]}, {'text': '.', 'index': [20]}], 'denormalized_string': 'Danes, 8. 7. 2022, je lep sončen dan, saj je zunaj prijetnih 25 °C.'}

Use "Slovene Text Denormalizator RSDO-DS2-DENORM 1.0"
Slovene Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0

2 resources

This Conformer CTC BPE E2E Automated Speech Recognition model was trained following the NVIDIA NeMo Conformer-CTC recipe (for details see the official NVIDIA NeMo NMT documentation, https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html, and NVIDIA NeMo GitHub repository https://github.com/NVIDIA/NeMo). It provides functionality for transcribing Slovene speech to text. The training, development and test datasets were based on the Artur dataset and consisted of 630.38, 16.48 and 15.12 hours of transcribed speech in standardised form, respectively. The model was trained for 200 epochs and reached WER 0.0429 on the development and WER 0.0558 on the test dataset.

Use "Slovene Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0"
Character-level part-of-speech tagger of Slovene language

15 resources

Part-of-speech tagger for Slovene language implemented using convolutional and LSTM neural networks. Tagger uses character-level representation of sentences. The tagger has been trained on the ssj500k 2.1 corpus, http://hdl.handle.net/11356/1181.

Use "Character-level part-of-speech tagger of Slovene language"
Slavic Forest, Norwegian Wood (scripts)

11 resources

Tools and scripts used to create the cross-lingual parsing models submitted to VarDial 2017 shared task (https://bitbucket.org/hy-crossNLP/vardial2017), as described in the linked paper. The trained UDPipe models themselves are published in a separate submission (https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1971). For each source (SS, e.g. sl) and target (TT, e.g. hr) language, you need to add the following into this directory: - treebanks (Universal Dependencies v1.4): SS-ud-train.conllu TT-ud-predPoS-dev.conllu - parallel data (OpenSubtitles from Opus): OpenSubtitles2016.SS-TT.SS OpenSubtitles2016.SS-TT.TT !!! If they are originally called ...TT-SS... instead of ...SS-TT..., you need to symlink them (or move, or copy) !!! - target tagging model TT.tagger.udpipe All of these can be obtained from https://bitbucket.org/hy-crossNLP/vardial2017 You also need to have: - Bash - Perl 5 - Python 3 - word2vec (https://code.google.com/archive/p/word2vec/); we used rev 41 from 15th Sep 2014 - udpipe (https://github.com/ufal/udpipe); we used commit 3e65d69 from 3rd Jan 2017 - Treex (https://github.com/ufal/treex); we used commit d27ee8a from 21st Dec 2016 The most basic setup is the sl-hr one (train_sl-hr.sh): - normalization of deprels - 1:1 word-alignment of parallel data with Monolingual Greedy Aligner - simple word-by-word translation of source treebank - pre-training of target word embeddings - simplification of morpho feats (use only Case) - and finally, training and evaluating the parser Both da+sv-no (train_ds-no.sh) and cs-sk (train_cs-sk.sh) add some cross-tagging, which seems to be useful only in specific cases (see paper for details). Moreover, cs-sk also adds more morpho features, selecting those that seem to be very often shared in parallel data. The whole pipeline takes tens of hours to run, and uses several GB of RAM, so make sure to use a powerful computer.

Use "Slavic Forest, Norwegian Wood (scripts)"
The Trankit model for linguistic processing of written and spoken Slovenian 1.2

2 resources

This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15), thus producing significantly better results for spoken data. In contrast to the previous versions of this model (1.0, 1.1), the model 1.2 was trained on a new SST train-dev-test split introduced in UD v2.15.

Use "The Trankit model for linguistic processing of written and spoken Slovenian 1.2"
ELMo embeddings models for seven languages

8 resources

ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level. Each model is in its own .tar.gz archive, consisting of two files: pytorch weights (.hdf5) and options (.json). Both are needed for model inference, using allennlp (https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) python library.

Use "ELMo embeddings models for seven languages"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Slovene Text Normalizator RSDO-DS2-NORM 1.0

Slovene Text Denormalizator RSDO-DS2-DENORM 1.0

Slovene Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0

Character-level part-of-speech tagger of Slovene language

Slavic Forest, Norwegian Wood (scripts)

The Trankit model for linguistic processing of written and spoken Slovenian 1.2

ELMo embeddings models for seven languages

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Slovene Text Normalizator RSDO-DS2-NORM 1.0

Slovene Text Denormalizator RSDO-DS2-DENORM 1.0

Slovene Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0

Character-level part-of-speech tagger of Slovene language

Slavic Forest, Norwegian Wood (scripts)

The Trankit model for linguistic processing of written and spoken Slovenian 1.2

ELMo embeddings models for seven languages

Session recording