Result filters

Metadata provider

Language

Resource type

Availability

Active filters:

  • Keywords: language model
Loading...
92 record(s) found

Search results

  • Trankit model for SST 2.15 1.1

    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank of spoken Slovenian (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15) featuring transcriptions of spontaneous speech in various everyday settings. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological feature prediction, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). Please note this model has been published for archiving purposes only. For production use, we recommend using the state-of-the art Trankit model available here: http://hdl.handle.net/11356/1965 (v1.2 or newest). The latter was trained on both spoken (SST) and written (SSJ) data, and demonstrates a significantly higher performance to the model featured in this submission. In comparison with version 1.0, this model was trained on a new train-dev-test split of the SST treebank introduced in release UD v2.15.
  • Slovenian RoBERTa contextual embeddings model: SloBERTa 1.0

    The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers.
  • Face-domain-specific automatic speech recognition models

    This entry contains all the files required to implement face-domain-specific automatic speech recognition (ASR) applications using the Kaldi ASR toolkit (https://github.com/kaldi-asr/kaldi), including the acoustic model, language model, and other relevant files. It also includes all the scripts and configuration files needed to use these models for implementing face-domain-specific automatic speech recognition. The acoustic model was trained using the relevant Kaldi ASR tools (https://github.com/kaldi-asr/kaldi) and the Artur speech corpus (http://hdl.handle.net/11356/1776; http://hdl.handle.net/11356/1772). The language model was trained using the domain-specific text data involving face descriptions obtained by translating the Face2Text English dataset (https://github.com/mtanti/face2text-dataset) into the Slovenian language. These models, combined with other necessary files like the HCLG.fst and decoding scripts, enable the implementation of face-domain-specific ASR applications. Two speech corpora ("test" and "obrazi") and two Kaldi ASR models ("graph_splosni" and "graph_obrazi") can be selected for conducting speech recognition tests by setting the variable "graph" and "test_sets" in the "local/test_recognition.sh" script. Acoustic speech features can be extracted and speech recognition tests can be conducted using the "local/test_recognition.sh" script. Speech recognition test results can be obtained using the "results.sh" script. The KALDI_ROOT environment variable also needs to be set in the script "path.sh" to set the path to the Kaldi ASR toolkit installation folder.
  • Trankit model for linguistic processing of spoken Slovenian

    This is a retrained Slovenian spoken language model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, feature prediction, and dependency parsing in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a combination of two datasets published by Universal Dependencies in release 2.12, the spoken SST treebank (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.12) and the written SSJ treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.12). Its evaluation on the spoken SST test set yields an F1 score of 97.78 for lemmas, 97.19 for UPOS, 95.05 for XPOS and 81.26 for LAS, a significantly better performance in comparison to the counterpart model trained on written SSJ data only (http://hdl.handle.net/11356/1870). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base.
  • The Trankit model for linguistic processing of spoken and written Slovenian 1.1

    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an almost identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/dev), thus producing significantly better results for spoken data.
  • Trankit model for SST 2.15

    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the SST treebank of spoken Slovenian (UD v2.15, https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/dev) featuring transcriptions of spontaneous speech in various everyday settings. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological feature prediction, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). Please note this model has been published for archiving purposes only. For production use, we recommend using the state-of-the art Trankit model available here: http://hdl.handle.net/11356/1965. The latter was trained on both spoken (SST) and written (SSJ) data, and demonstrates a significantly higher performance to the model featured in this submission.
  • The Trankit model for linguistic processing of standard Slovenian

    This is a retrained Slovenian standard model for Trankit v1.1.1 library (https://pypi.org/project/trankit/). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, feature prediction, and dependency parsing in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a dataset published by Universal Dependencies in release 2.12 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.12). Due to the larger training dataset compared to the original Trankit v1.1.1 model, this version yields superior results and achieves state-of-the art parsing performance for Slovenian (https://slobench.cjvt.si/leaderboard/view/11). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base.
  • The Trankit model for linguistic process of standard written Slovenian 1.1

    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the reference SSJ UD treebank featuring fiction, non-fiction, periodical and Wikipedia texts in standard modern Slovenian. It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). The model was trained using a dataset published by Universal Dependencies in release 2.14 (https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14). To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. This version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14) than the previous version of the model and produces similar results.
  • Slovenian commonsense reasoning model SloMET-ATOMIC 2020

    The SloMET-ATOMIC 2020 is a Slovene commonsense reasoning model that is able to predict commonsense descriptions in a natural language for a given input sentence. The model is an adaptation of the Slovene GPT-2 model (https://huggingface.co/cjvt/gpt-sl-base) that has been finetuned using the SloATOMIC 2020 corpus (http://hdl.handle.net/11356/1724), consisting of 1.33M everyday interence knowledge tuples about entities and events. The released model is a pytorch neural network model, intended for usage with the transformers library (https://github.com/huggingface/transformers).