Result filters

Metadata provider

Language

Resource type

Availability

Loading...
698 record(s) found

Search results

  • Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data (https://hdl.handle.net/11234/1-3226). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_26_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
  • The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1

    The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus (http://hdl.handle.net/11356/1792) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1790). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.87. The difference to the previous version of the model is that this version was trained using the new version of the hr500k corpus and the new version of the Croatian word embeddings.
  • The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Serbian 1.2

    The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200) and using the CLARIN.SI-embed.sr word embeddings (http://hdl.handle.net/11356/1206). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~95.2. The difference to the previous version of the model is that the pre-trained embeddings are limited to 250 thousand entries and adapted to the new code base.
  • PyTorch model for Slovenian Coreference Resolution

    Slovenian model for coreference resolution: a neural network based on a customized transformer architecture, usable with the code published on https://github.com/matejklemen/slovene-coreference-resolution. The model is based on the Slovenian CroSloEngual BERT 1.1 model (http://hdl.handle.net/11356/1330). It was trained on the SUK 1.0 training corpus (http://hdl.handle.net/11356/1747), specifically the SentiCoref subcorpus. Using the evaluation setting where entity mentions are assumed to be correctly pre-detected, the model achieves the following metric values: MUC: precision = 0.931, recall = 0.957, F1 = 0.943 BCubed: precision = 0.887, recall = 0.947, F1 = 0.914 CEAFe: precision = 0.945, recall = 0.893, F1 = 0.916 CoNLL-12: precision = 0.921, recall = 0.932, F1 = 0.924
  • SuperMatrix

    SuperMatrix is a system to support automatic extraction of semantic relations, based on the analysis of large text corpora. System was developed as a tool for expansion of Polish wordnet (Słowosieć).Expansion consist of two steps: system suggests a potential links between lexical units. Linguist verify these suggestions and decide which form will go to wordnet. This speeded up the work and preserve the integrity of data entry.
  • Grafon

    Representation of sentence semantic with deepened semantic graphs. Graphs are composed based on the output of saper tool https://clarin-pl.eu/dspace/handle/11321/278
  • CORDEX inflectional lookup data 1.0

    The inflectional data lookup module serves as an optional component within the cordex library (https://github.com/clarinsi/cordex/) that significantly improves the quality of the results. The module consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms. Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of https://gitea.cjvt.si/generic/conversion_utils and its frequency within the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320), or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni").