CLARIN Tool Portal

MCSQ Translation Models (en-de) (v1.0)

2 resources

En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->de: 67.5 (train: genuine in-domain MCSQ data only) de->en: 75.0 (train: additional in-domain backtranslated MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)

Use "MCSQ Translation Models (en-de) (v1.0)"

The CLASSLA-StanfordNLP model for named entity recognition of non-standard Slovenian 1.0

2 resources

This model for named entity recognition of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and the Janes-Tag training corpus (http://hdl.handle.net/11356/1238), using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204). The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.

Use "The CLASSLA-StanfordNLP model for named entity recognition of non-standard Slovenian 1.0"

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Bulgarian 1.1

3 resources

This model for morphosyntactic annotation of standard Bulgarian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the BulTreeBank training corpus (http://hdl.handle.net/11495/D93F-C6E9-65D9-2) and using the CoNLL2017 word embeddings (http://hdl.handle.net/11234/1-1989). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.8. The difference to the previous version of the model is that the pre-trained embeddings are limited to 250 thousand entries and adapted to the new code base.

Use "The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Bulgarian 1.1"

The CLASSLA-Stanza model for lemmatisation of standard Macedonian 2.1

2 resources

The model for lemmatisation of standard Macedonian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the 1984 training corpus expanded with the Macedonian SETimes corpus (to be published). The estimated F1 of the lemma annotations is ~98.81. The difference from the previous version is that this version was trained using a larger training dataset.

Use "The CLASSLA-Stanza model for lemmatisation of standard Macedonian 2.1"

Skiptir (20.10)

2 resources

A simple command-line tool that uses Pyphen (https://pyphen.org) to hyphenate text according to the newest hyphenation patterns from the Icelandic Hyphenation Dictionary (http://hdl.handle.net/20.500.12537/86). Can also be used as a module in Python.

Use "Skiptir (20.10)"

The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.2

2 resources

The model for lemmatisation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the hrLex inflectional lexicon (http://hdl.handle.net/11356/1232). The estimated F1 of the lemma annotations is ~97.6. The difference to the previous version is that now it relies solely on XPOS annotations, and not on a combination of UPOS, FEATS (lexicon lookup) and XPOS (lemma prediction) annotations.

Use "The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.2"

Information extraction from EIA documents

2 resources

Environmental impact assessment (EIA) is the formal process used to predict the environmental consequences of a plan. We present a rule-based extraction system to mine Czech EIA documents. The extraction rules work with a set of documents enriched with morphological information and manually created vocabularies of terms supposed to be extracted from the documents, e.g. basic information about the project (address, ID company, ...), data on the impacts and outcomes (waste substances, endangered species, ...), a final opinion. The documents Notice of Intent contains the section BI2 with the information on the scope (capacity) of the plan.

Use "Information extraction from EIA documents"

Paralela corpus and search engine

3 resources

Paralela is as an open-ended, opportunistic parallel corpus of Polish-English and English-Polish translations. It currently contains 262 million words in 10,877,000 translation segments. The Paralela online search engine supports the SlopeQ query syntax for bilingual Polish-English corpus queries for the full dataset. Both the full texts and query results can be accessed and exported through the online application at http://paralela.clarin-pl.eu.

Use "Paralela corpus and search engine"

PELCRA for National Corpus of Polish Search Engine 2

2 resources

The PELCRA for NKJP search engine 2 provides access to the full National Corpus of Polish dataset (over 1.5 billion word tokens). In addition to linguistically motivated corpus queries, it supports a number of data exploration and visualisation features. Most of the functionality of the search engine is available through a REST web service. Access to the API is available upon request.

Use "PELCRA for National Corpus of Polish Search Engine 2"

MCSQ Translation Models (en-ru) (v1.0)

2 resources

En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->ru: 64.3 (train: genuine in-domain MCSQ data) ru->en: 74.7 (train: additional backtranslated in-domain MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)

Use "MCSQ Translation Models (en-ru) (v1.0)"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

MCSQ Translation Models (en-de) (v1.0)

The CLASSLA-StanfordNLP model for named entity recognition of non-standard Slovenian 1.0

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Bulgarian 1.1

The CLASSLA-Stanza model for lemmatisation of standard Macedonian 2.1

Skiptir (20.10)

The CLASSLA-StanfordNLP model for lemmatisation of standard Croatian 1.2

Information extraction from EIA documents

Paralela corpus and search engine

PELCRA for National Corpus of Polish Search Engine 2

MCSQ Translation Models (en-ru) (v1.0)

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording