CLARIN Tool Portal

Grafon

1 resources

Representation of sentence semantic with deepened semantic graphs. Graphs are composed based on the output of saper tool https://clarin-pl.eu/dspace/handle/11321/278

Use "Grafon"

MorphoDiTa-based tagger for Polish language

4 resources

MorphoDiTa-based tagger for Polish language. It is a tool for morphosyntactic unification for the Polish language, according to the NKJP tagset.

Use "MorphoDiTa-based tagger for Polish language"

ENIAMtoolkit

2 resources

ENIAMtoolkit is a collection of libraries that: - perform tokenization, lemmatization, part of speech tagging; - detect MWE and abbreviations; - split text into sentences.

Use "ENIAMtoolkit"

CorpoGrabber: The Toolchain to Automatic Acquiring and Extraction of the Website Content Jan Kocoń, Wroclaw University of Technology CorpoGrabber is a pipeline of tools to get the most relevant content of the website, including all subsites (up to the user-defined depth). The proposed toolchain can be used to build a big Web corpora of text documents. It requires only the list of the root websites as the input. Tools composing CorpoGrabber are adapted to Polish, but most subtasks are language independent. The whole process can be run in parallel on a single machine and includes the following tasks: downloading of the HTML subpages of each input page URL [1], extracting of plain text from each subpage by removing boilerplate content (such as navigation links, headers, footers, advertisements from HTML pages) [2], deduplication of plain text [2], removing of bad quality documents utilizing Morphological Analysis Converter and Aggregator (MACA) [3], tagging of documents using Wrocław CRF Tagger (WCRFT) [4]. Last two steps are available only for Polish. The result is a corpora as a set of tagged documents for each website. References [1] https://www.httrack.com/html/faq.html [2] J. Pomikalek. 2011. Removing Boilerplate and Duplicate Content from Web Corpora. Ph.D. Thesis. Masaryk University, Faculcy of Informatics. Brno. [3] A. Radziszewski, T. Sniatowski. 2011. Maca – a configurable tool to integrate Polish morphological data. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation. Barcelona, Spain. [4] A. Radziszewski. 2013. A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer Verlag.

Use "CorpoGrabber"

Semi-supervised Icelandic-Polish Translation System (22.09)

8 resources

This Icelandic-Polish translation model (bi-directional) was trained using fairseq (https://github.com/facebookresearch/fairseq) by means of semi-supervised translation by starting with the mBART50 model. The model was then trained using a multi-task curriculum to first learn to denoise sentences. Then the model was trained to translate using aligned parallel texts. Finally the model was provided with monolingual texts in both Icelandic and Polish with which it iteratively creates back-translations. For the PL-IS direction the model achieves a BLEU score of 27.60 on held out true parallel training data and 15.30 on the out-of-domain Flores devset. For the IS-PL direction the model achieves a score of 27.70 on the true data and 13.30 on the Flores devset. -- Þetta íslensk-pólska þýðingarlíkan (tvíátta) var þjálfað með fairseq (https://github.com/facebookresearch/fairseq) með hálf-sjálfvirkum aðferðum frá mBART50 líkaninu. Líkanið var þjálfað á þremur verkefnum, afruglun, samhliða þýðingum og bakþýðingum sem voru myndaðar á þjálfunartíma. Fyrir PL-IS áttina fæst BLEU skor 27.60 á raun gögnum sem voru tekin til hliðar og 15.30 á Flores þróunargögnunum. Fyrir IS-PL áttina fæst skor 27.70 á raun gögnunum og 13.30 á Flores þróunargögnunum.

Use "Semi-supervised Icelandic-Polish Translation System (22.09)"

CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

2 resources

The `corpipe23-corefud1.2-240906` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 23 <https://github.com/ufal/crac2023-corpipe>. It is released under the CC BY-NC-SA 4.0 license. The model is language agnostic (no corpus id on input), so it can be in theory used to predict coreference in any `mT5` language. However, the model expects empty nodes to be already present on input, predicted by the https://www.kaggle.com/models/ufal-mff/crac2024_zero_nodes_baseline/. This model was present in the CorPipe 24 paper as an alternative to a single-stage approach, where the empty nodes are predicted joinly with coreference resolution (via http://hdl.handle.net/11234/1-5672), an approach circa twice as fast but of slightly worse quality.

Use "CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)"

Plumper

1 resources

Ontology mapper. Mapping plWordNet onto SUMO ontology.

Use "Plumper"

HaskPL

2 resources

HaskPL is a Polish phraseological database designed for language professionals including linguists, language teachers, lexicographers, language materials developers and translators. Query results can be visualised and exported as spreadsheets. A complementary tool is HaskProof (http://pelcra.clarin-pl.eu:9894/#/lang/pl) identifying potential collocations in any text inserted by the user.

Use "HaskPL"

Universal Dependencies 2.5 Models for UDPipe (2019-12-06)

97 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks, created solely using UD 2.5 data (http://hdl.handle.net/11234/1-3105). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_25_models . To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe . In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.

Use "Universal Dependencies 2.5 Models for UDPipe (2019-12-06)"

XLM-RoBERTa-LARGE events relation recognition

1 resources

A set of basic language tools for the Polish language. Z4.2a Improving the quality of recognition of relations between events using Transformer-type deep networks.

Use "XLM-RoBERTa-LARGE events relation recognition"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Grafon

MorphoDiTa-based tagger for Polish language

ENIAMtoolkit

CorpoGrabber

Semi-supervised Icelandic-Polish Translation System (22.09)

CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

Plumper

HaskPL

Universal Dependencies 2.5 Models for UDPipe (2019-12-06)

XLM-RoBERTa-LARGE events relation recognition

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording