CLARIN Tool Portal

Lingua::Interset 2.026

2 resources

Lingua::Interset is a universal morphosyntactic feature set to which all tagsets of all corpora/languages can be mapped. Version 2.026 covers 37 different tagsets of 21 languages. Limited support of the older drivers for other languages (which are not included in this package but are available for download elsewhere) is also available; these will be fully ported to Interset 2 in future. Interset is implemented as Perl libraries. It is also available via CPAN.

Use "Lingua::Interset 2.026"

Paralela corpus and search engine

3 resources

Paralela is as an open-ended, opportunistic parallel corpus of Polish-English and English-Polish translations. It currently contains 262 million words in 10,877,000 translation segments. The Paralela online search engine supports the SlopeQ query syntax for bilingual Polish-English corpus queries for the full dataset. Both the full texts and query results can be accessed and exported through the online application at http://paralela.clarin-pl.eu.

Use "Paralela corpus and search engine"

Long Context Translation Models for English-Icelandic translations (22.09)

10 resources

ENGLISH: These models are capable of translating between English and Icelandic, in both directions. They are capable of translating several sentences at once and are robust to some input errors such as spelling errors. The models are based on the pretrained mBART25 model (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210) and finetuned on bilingual EN-IS data and backtranslated data (including http://hdl.handle.net/20.500.12537/260). The full backtranslation data used includes texts from the following sources: The Icelandic Gigaword Corpus (Without sport) (IGC), The Icelandic Common Crawl Corpus (IC3), Student theses (skemman.is), Greynir News, Wikipedia, Icelandic sagas, Icelandic e-books, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. The true parallel long context data used is from European Economic Area (EEA) regulations, document-level Icelandic Student Theses Abstracts corpus (IPAC), Stúdentablaðið (university student magazine), The report of the Special Investigation Commision (Rannsóknarnefnd Alþingis), The Bible and Jehovah’s witnesses corpus (JW300). Provided here are model files, a SentencePiece subword-tokenizing model and dictionary files for running the model locally along with scripts for translating sentences on the command line. We refer to the included README for instructions on running inference. ÍSLENSKA: Þessi líkön geta þýtt á milli ensku og íslensku. Líkönin geta þýtt margar málsgreinar í einu og eru þolin gagnvart villum og smávægilegu fráviki í inntaki. Líkönin eru áframþjálfuð þýðingarlíkön sem voru þjálfuð frá mBART25 líkaninu (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210). Þjálfunargögin eru samhlíða ensk-íslensk gögn ásamt bakþýðingum (m.a. http://hdl.handle.net/20.500.12537/260). Einmála gögn sem voru bakþýdd og nýtt í þjálfanir eru fengin úr: Risamálheildinni (án íþróttafrétta), Icelandic Common Crawl Corpus (IC3), ritgerðum af skemman.is, fréttum í fréttagrunni Greynis, Wikipedia, íslendingasögurnar, opnar íslenskar rafbækur, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. Samhliða raungögn eru fengin upp úr European Economic Area (EEA) reglugerðum, samröðuðum útdráttum úr ritgerðum nemenda (IPAC), Stúdentablaðið, Skýrsla Rannsóknarnefndar Alþingis, Biblíunni og samhliða málheild unna úr Varðturninum (JW300). Útgefin eru líkönin sjálf, orðflísunarlíkan og orðabók fyrir flísunina, ásamt skriptum til að keyra þýðingar frá skipanalínu. Nánari leiðbeiningar eru í README skjalinu.

Use "Long Context Translation Models for English-Icelandic translations (22.09)"

MCSQ Translation Models (en-ru) (v1.0)

2 resources

En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->ru: 64.3 (train: genuine in-domain MCSQ data) ru->en: 74.7 (train: additional backtranslated in-domain MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)

Use "MCSQ Translation Models (en-ru) (v1.0)"

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

2 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data (https://hdl.handle.net/11234/1-3226). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_26_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .

Use "Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)"

CorpoGrabber

1 resources

CorpoGrabber: The Toolchain to Automatic Acquiring and Extraction of the Website Content Jan Kocoń, Wroclaw University of Technology CorpoGrabber is a pipeline of tools to get the most relevant content of the website, including all subsites (up to the user-defined depth). The proposed toolchain can be used to build a big Web corpora of text documents. It requires only the list of the root websites as the input. Tools composing CorpoGrabber are adapted to Polish, but most subtasks are language independent. The whole process can be run in parallel on a single machine and includes the following tasks: downloading of the HTML subpages of each input page URL [1], extracting of plain text from each subpage by removing boilerplate content (such as navigation links, headers, footers, advertisements from HTML pages) [2], deduplication of plain text [2], removing of bad quality documents utilizing Morphological Analysis Converter and Aggregator (MACA) [3], tagging of documents using Wrocław CRF Tagger (WCRFT) [4]. Last two steps are available only for Polish. The result is a corpora as a set of tagged documents for each website. References [1] https://www.httrack.com/html/faq.html [2] J. Pomikalek. 2011. Removing Boilerplate and Duplicate Content from Web Corpora. Ph.D. Thesis. Masaryk University, Faculcy of Informatics. Brno. [3] A. Radziszewski, T. Sniatowski. 2011. Maca – a configurable tool to integrate Polish morphological data. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation. Barcelona, Spain. [4] A. Radziszewski. 2013. A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer Verlag.

Use "CorpoGrabber"

CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

2 resources

The `corpipe23-corefud1.2-240906` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 23 <https://github.com/ufal/crac2023-corpipe>. It is released under the CC BY-NC-SA 4.0 license. The model is language agnostic (no corpus id on input), so it can be in theory used to predict coreference in any `mT5` language. However, the model expects empty nodes to be already present on input, predicted by the https://www.kaggle.com/models/ufal-mff/crac2024_zero_nodes_baseline/. This model was present in the CorPipe 24 paper as an alternative to a single-stage approach, where the empty nodes are predicted joinly with coreference resolution (via http://hdl.handle.net/11234/1-5672), an approach circa twice as fast but of slightly worse quality.

Use "CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)"

Universal Dependencies 1.2 Models for UDPipe

2 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for all Universal Depenencies 1.2 Treebanks, created solely using UD 1.2 data (http://hdl.handle.net/11234/1-1548). To use these models, you need UDPipe binary, which you can download from http://ufal.mff.cuni.cz/udpipe.

Use "Universal Dependencies 1.2 Models for UDPipe"

CroSloEngual BERT 1.1

4 resources

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (i.e. to be used with pytorch library). Changes in version 1.1: fixed vocab.txt file, as previous verson had an error causing very bad results during fine-tuning and/or evaluation.

Use "CroSloEngual BERT 1.1"

UPSKILLS Teaching and Learning Content

14 resources

This is a collection of modular teaching and learning content created in the UPSKILLS project ( UPgrading the SKIlls of Linguistics and Language Students) and downloaded from the Moodle platform in .mbz format. The learning content can be reused and adapted by curriculum designers, lecturers, and instructors of courses in linguistics and language-related subjects. Different blocks or individual units within a block can be combined to create new learning paths at the BA and MA levels. Some of the learning content is also suitable for the PhD level. Students can also use the content for self-study, considering this is not a MOOC (Massive Open Online Course). Before downloading the files, it is recommended to: - use the project URL to read the descriptions of each learning block on the UPSKILLS project website - use the demo link to preview the learning content on the Moodle platform and decide which learning blocks you would like to download. Each learning block in Moodle contains several units on different topics, including presentations, learning activities, assignments, and a final student project. Furthermore, we have included a short guide explaining how the materials are organised, and how they can be used and cited. Please note that the .mbz files can be used exclusively on Moodle systems, version 3.8+. The material can be directly imported in MBZ format without changes. If help is required, please consult the Moodle User Guide > Course Restore: https://docs.moodle.org/402/en/Course_restore. The "Processing Texts and Corpora" and "Introduction to Language Data: Standards and Repositories" contain interactive presentations and quizzes created in H5p, which means that the H5p plugin should be available in your Moodle instance to be able to view and reuse the content (both in code and as a plugin), tiles formats, stashes and badges. The badges are given as a separate downloadable file. Nevertheless, the H5P content can be downloaded directly from the UPSKILLS Moodle platform and reused outside Moodle. H5P is richer HTML5, which has become famous for creating interactive learning objects (e.g. presentations, videos, gamified learning activities). It is a free and open format, which can be used as a plugin in Learning Management Systems, such as Moodle, Blackboard, Brightspace, OpenEdX, etc., and Content Management Systems, such as WordPress, Drupal, and Canvas. See the H5P administrators' guides for more information:https://help.h5p.com/hc/en-us/sections/7556764070429-Guides. All UPSKILLS learning content is made available under the CC-BY 4.0 International license. This means you can copy and share it with others in any medium or format, even for commercial purposes. However, it is required that you give appropriate credit to the source, include the license link, and indicate whether any changes were made to the original content. To learn more about the UPSKILLS project, please visit the project website and the following guides: 1. Research-Based Teaching: Guidelines and Best Practices 2. Integrating Research Infrastructures into Teaching (this guide is especially relevant if you are interested in reusing the learning content created by CLARIN, namely Introduction to Language Data: Standards and Repositories) 3. Integrating Industry-Based Research into Teaching Finally, all project deliverables are accessible in the UPSKILLS Community on Zenodo: https://zenodo.org/communities/upskills/?page=1&size=20.

Use "UPSKILLS Teaching and Learning Content"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Lingua::Interset 2.026

Paralela corpus and search engine

Long Context Translation Models for English-Icelandic translations (22.09)

MCSQ Translation Models (en-ru) (v1.0)

Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

CorpoGrabber

CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

Universal Dependencies 1.2 Models for UDPipe

CroSloEngual BERT 1.1

UPSKILLS Teaching and Learning Content

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording