Result filters

Metadata provider

Language

Resource type

Availability

Loading...
698 record(s) found

Search results

  • The Orange workflow for observing collocation clusters ColEmbed 1.0

    The Orange Workflow for Observing Collocation Clusters ColEmbed 1.0 ColEmbed is a workflow (.OWS file) for Orange Data Mining (an open-source machine learning and data visualization software: https://orangedatamining.com/) that allows the user to observe clusters of collocation candidates extracted from corpora. The workflow consists of a series of data filters, embedding processors, and visualizers. As input, the workflow takes a tab-separated file (.TSV/.TAB) with data on collocations extracted from a corpus, along with their relative frequencies by year of publication and other optional values (such as information on temporal trends). The workflow allows the user to select the features which are then used in the workflow to cluster collocation candidates, along with the embeddings generated based on the selected lemmas (either one lemma or both lemmas can be selected, depending on our clustering criteria; for instance, if we wish to cluster adjective+noun candidates based on the similarities of their noun components, we only select the second lemma to be taken into account in embedding generation). The obtained embedding clusters can be visualized and further processed (e.g. by finding the closest neighbors of a reference collocation). The workflow is described in more detail in the accompanying README file. The entry also contains three .TAB files that can be used to test the workflow. The files contain collocation candidates (along with their relative frequencies per year of publication and four measures describing their temporal trends; see http://hdl.handle.net/11356/1424 for more details) extracted from the Gigafida 2.0 Corpus of Written Slovene (https://viri.cjvt.si/gigafida/) with three different syntactic structures (as defined in http://hdl.handle.net/11356/1415): 1) p0-s0 (adjective + noun, e.g. rezervni sklad), 2) s0-s2 (noun + noun in the genitive case, e.g. ukinitev lastnine), and 3) gg-s4 (verb + noun in the accusative case, e.g. pripraviti besedilo). It should be noted that only collocation candidates with absolute frequency of 15 and above were extracted. Please note that the ColEmbed workflow requires the installation of the Text Mining add-on for Orange. For installation instructions as well as a more detailed description of the different phases of the workflow and the measures used to observe the collocation trends, please consult the README file.
  • The Trankit model for linguistic processing of written and spoken Slovenian 1.2

    This is a retrained Slovenian model for the Trankit v1.1.1 library for multilingual natural language processing (https://pypi.org/project/trankit/), trained on the concatenation of the SSJ UD treebank of written Slovenian (featuring fiction, non-fiction, periodicals and Wikipedia texts) and the SST UD treebank of spoken Slovenian (featuring transcriptions of spontaneous speech in various settings). It is able to predict sentence segmentation, tokenization, lemmatization, language-specific morphological annotation (MULTEXT-East morphosyntactic tags), as well as universal part-of-speech tagging, morphological features, and dependency parses in accordance with the Universal Dependencies annotation scheme (https://universaldependencies.org/). In comparison to its counterpart models trained on SSJ (http://hdl.handle.net/11356/1963) or SST datasets only, this model yields a significantly better performance on spoken transcripts and an identical state-of-the-art performance on written texts. The model can therefore be recommended as the default, 'universal' Trankit model for processing Slovenian, regardless of the data type. To utilize this model, please follow the instructions provided in our github repository (https://github.com/clarinsi/trankit-train) or refer to the Trankit documentation (https://trankit.readthedocs.io/en/latest/training.html#loading). This ZIP file contains models for both xlm-roberta-large (which delivers better performance but requires more hardware resources) and xlm-roberta-base. In comparison to the previous version, this version was trained on a newer, slightly improved version of the SSJ UD treebank (UD v2.14, https://github.com/UniversalDependencies/UD_Slovenian-SSJ/tree/r2.14) and a substantially extended and improved version of the SST UD treebank (https://github.com/UniversalDependencies/UD_Slovenian-SST/tree/r2.15), thus producing significantly better results for spoken data. In contrast to the previous versions of this model (1.0, 1.1), the model 1.2 was trained on a new SST train-dev-test split introduced in UD v2.15.
  • ELMo embeddings models for seven languages

    ELMo language model (https://github.com/allenai/bilm-tf) used to produce contextual word embeddings, trained on large monolingual corpora for 7 languages: Slovenian, Croatian, Finnish, Estonian, Latvian, Lithuanian and Swedish. Each language's model was trained for approximately 10 epochs. Corpora sizes used in training range from over 270 M tokens in Latvian to almost 2 B tokens in Croatian. About 1 million most common tokens were provided as vocabulary during the training for each language model. The model can also infer OOV words, since the neural network input is on the character level. Each model is in its own .tar.gz archive, consisting of two files: pytorch weights (.hdf5) and options (.json). Both are needed for model inference, using allennlp (https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md) python library.
  • VU University Diachronic News text Corpus

    The diachronic corpus has been brought in line with current standards and formats as used in the STEVIN Nederlandstalig Referentiecorpus (SoNaR, under development), which has been adapted to the more general FoLiA format (documented by Van Gompel, 2012). These standards and formats have been extended with new layers of annotation. As a result the corpus adheres to the current day CLARIN infrastructure.
  • OpenSONAR: a 500 MW reference corpus of Contemporary Written Dutch

    SoNaR is a 500-million-word reference corpus of contemporary written Dutch for use in different types of linguistic (incl. lexicographic) and HLT research and the development of applications. The STEVIN funded SoNaR project (2008-2011) built on the results obtained in the D-Coi and Corea projects which were awarded funding in the first call of proposals within the STEVIN programme. SONAR contains over 500 million words (i.e. word tokens) of full texts from a wide variety of text types including both texts from conventional media and texts from the new media. All texts except for texts from the social media (Twitter, Chat, SMS) have been tokenized, tagged for part of speech and lemmatized, while in the same set the Named Entities have been labelled. All annotations were produced automatically, no manual verification took place. The texts are enriched with several annotations (Part of Speech and lemma information) and are available as FoLiA xml files (folia.xml). The system relies on BlackLab server as back-end and WhiteLab as user-interface. OpenSONAR is an online application for exploration of and searching in the SoNaR corpus.
    van de Camp, M, Reynaert,MandOostdijk, N. 2017.WhiteLab 2.0: AWeb Interface for Corpus Exploitation. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 231–243. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.19. License: CC-BY 4.0
    de Does, J, Niestadt, J and Depuydt, K. 2017. Creating Research Environments with BlackLab. In: Odijk, J and van Hessen, A. (eds.) CLARIN in the Low Countries, Pp. 245–257. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi.20. License: CC-BY 4.0
    Oostdijk, N., Reynaert, M., Hoste, V., Schuurman, I. (2013) The Construction of a 500 Million Word Reference Corpus of Contemporary Written Dutch in: Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme (eds. P. Spyns, J. Odijk), Springer Verlag.
  • CLARIN Concept Registry

    The CCR is a concept registry according to the W3C SKOS recommendation. It was chosen by CLARIN to serve as a semantic registry to overcome semantic interoperability issues with CMDI metadata and different annotation tag sets used for linguistic annotation. The CCR is part of the CMDI metadata infrastructure. The W3C SKOS recommendation, and the OpenSKOS implementation thereof, provides the means for ‘data-sharing, bridging several different fields of knowledge, technology and practice’. According to this model, each concept is assigned a unique administrative identifier, together with information on the status or decision-making process associated with the concept. In addition, concept specifications in the CCR contain linguistic descriptions, such as definitions and examples, and can be associated with a variety of labels. .
  • CMDI to RDF conversion

    There is growing amount of on-line information available in RDF format as Linked Open Data (LOD) and a strong community very actively promotes its use. The publication of information as LOD is also considered an important signal that the publisher is actively searching for information sharing with a world full of new potential users. Added advantages of LOD, when well used, are the explicit semantics and high interoperability. But the problematic modelling by non-expert users offsets these advantages, which is a reason why modelling systems as CMDI are used. The CMDI2RDF project aims to bring the LOD advantages to the CMDI world and make the huge store of CMDI information available to new groups of users and at the same time offer CLARIN a powerful tool to experiment with new metadata discovery possibilities. The CMD2RDFservice was created to allow connection with the growing LOD world, and facilitate experiments within CLARIN merging CMDI with other, RDF based, information sources. One of the promises of LOD is the ease to link data sets together and answer queries based on this ‘cloud’ of LOD datasets. Thus in the enrichment and use cases part of the project we looked at other datasets to link to the CLARIN joint metadata domain. We used the WALS N3 RDF dump for one of the use cases. Although it is in the end relatively easy to go from a specific typological feature to the CMD records via a shared URI, it still showcased a weakness of the Linked Data approach. One has to carefully inspect the property paths involved. And in this case the path was broken as there was no clear way to go from the WALS feature data to the WALS language info except for extracting the WALS language code from the feature URI pattern and insert it the language URI pattern. This showcases that although the big LOD cloud shows potential for knowledge discovery by crossing dataset boundaries, design decisions in the individual datasets can still hamper algorithms and manual inspection is needed. The CMD2RDF service was developed at the TLA/MPI for Psycholinguistics and DANS and later moved to Meertens Institute where the expertise remains.
  • Assamese POS Tagger

    Assamese POS tagger is a CRF++ based POS Tagger. CRF++ is a customizable open source Conditional Random Fields for tagging/labeling continuos text. CRF++ is implemented for generic purpose and can be applied to any natural language provided the tagset. CRF++ tool is designed in C++ language. ------- 1. These Assamese NLP resources including the Tools and Applications are developed during Research and Development Projects as well as Masters and Ph.D. thesis works. 2. These are mainly developed or generated at Gauhati University Department of Computer Science and Department of Information Technology. 3. These resources are used by students and researchers for further studies, researches, as well as for design and development of tools and applications. 4. Computational Linguistics in Assamese is not rich, and Natural Language Processing works have mainly started during last two decades, and most of the resources are first generation resources, and with ample scope for upgrading, enriching, and purifying. 5. These are very good and essential resources for all the researchers in Assamese NLP, as the language requires more and more NLP works to make Assamese a rich media for the digital world. 6. Anyone interested, or in need of such resources may express their interest for the required resources, and the way of availability will be advised/informed accordingly. 7. These are purely research materials and could only be used for further research only. 8. Researchers may visit the NLP Lab of Department of Information Technology, Gauhati University, Guwahati, India or contact us. 9. Researchers interested in collaborative works, and also students for project works, are welcome. 10. Contact person is Professor Shikhar Kr. Sarma, Department of Information Technology, Gauhati University, Guwahati 781014, Assam, India. Email- sks@gauhati.ac.in