Result filters

Metadata provider

Resource type

Availability

Active filters:

  • Project: Language Resources and Technologies for Slovene
Loading...
105 record(s) found

Search results

  • Dependency tree extraction tool STARK 1.0

    STARK is a python-based command-line tool for extraction of dependency trees from parsed corpora, aimed at corpus-driven linguistic investigations of syntactic phenomena of various kinds. It supports the CONLL-U format (https://universaldependencies.org/format.html) as input and returns a list of all relevant dependency trees, frequencies, and other associated information in the form of a tab-separated .tsv file. For installation, execution and the description of various user-defined parameter settings, see the official project page at: https://gitea.cjvt.si/lkrsnik/STARK. This entry corresponds to commit 421f12cac6 in the Git repository.
  • Q-CAT Corpus Annotation Tool 1.2

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system. Version 1.1 enables the automatic attribution of token IDs and personalized font adjustments. Version 1.2 supports the CONLL-U format and working with UD POS tags.
  • Q-CAT Corpus Annotation Tool 1.0

    The Q-CAT (Querying-Supported Corpus Annotation Tool) is a computational tool for manual annotation of language corpora, which also enables advanced queries on top of these annotations. The tool has been used in various annotation campaigns related to the ssj500k reference training corpus of Slovenian (http://hdl.handle.net/11356/1210), such as named entities, dependency syntax, semantic roles and multi-word expressions, but it can also be used for adding new annotation layers of various types to this or other language corpora. Q-CAT is a .NET application, which runs on Windows operating system
  • The Orange workflow for observing collocation clusters ColEmbed 1.0

    The Orange Workflow for Observing Collocation Clusters ColEmbed 1.0 ColEmbed is a workflow (.OWS file) for Orange Data Mining (an open-source machine learning and data visualization software: https://orangedatamining.com/) that allows the user to observe clusters of collocation candidates extracted from corpora. The workflow consists of a series of data filters, embedding processors, and visualizers. As input, the workflow takes a tab-separated file (.TSV/.TAB) with data on collocations extracted from a corpus, along with their relative frequencies by year of publication and other optional values (such as information on temporal trends). The workflow allows the user to select the features which are then used in the workflow to cluster collocation candidates, along with the embeddings generated based on the selected lemmas (either one lemma or both lemmas can be selected, depending on our clustering criteria; for instance, if we wish to cluster adjective+noun candidates based on the similarities of their noun components, we only select the second lemma to be taken into account in embedding generation). The obtained embedding clusters can be visualized and further processed (e.g. by finding the closest neighbors of a reference collocation). The workflow is described in more detail in the accompanying README file. The entry also contains three .TAB files that can be used to test the workflow. The files contain collocation candidates (along with their relative frequencies per year of publication and four measures describing their temporal trends; see http://hdl.handle.net/11356/1424 for more details) extracted from the Gigafida 2.0 Corpus of Written Slovene (https://viri.cjvt.si/gigafida/) with three different syntactic structures (as defined in http://hdl.handle.net/11356/1415): 1) p0-s0 (adjective + noun, e.g. rezervni sklad), 2) s0-s2 (noun + noun in the genitive case, e.g. ukinitev lastnine), and 3) gg-s4 (verb + noun in the accusative case, e.g. pripraviti besedilo). It should be noted that only collocation candidates with absolute frequency of 15 and above were extracted. Please note that the ColEmbed workflow requires the installation of the Text Mining add-on for Orange. For installation instructions as well as a more detailed description of the different phases of the workflow and the measures used to observe the collocation trends, please consult the README file.
  • The Trankit model for linguistic processing of written and spoken Slovenian 1.2

    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15), thus producing significantly better results for spoken data. In contrast to the previous versions of this model (1.0, 1.1), the model 1.2 was trained on a new SST train-dev-test split introduced in UD v2.15.