CLARIN Tool Portal

698 record(s) found

Search results

Multilingual text genre classification model X-GENRE

2 resources

The X-GENRE classifier is a text classification model that can be used for automatic genre identification. The model classifies texts to one of 9 genre labels: Information/Explanation, News, Instruction, Opinion/Argumentation, Forum, Prose/Lyrical, Legal, Promotion and Other (refer to the provided README file for the details on the labels). The model was shown to provide high classification performance on Albanian, Catalan, Croatian, Greek, English, Icelandic, Macedonian, Slovenian, Turkish and Ukrainian, and the zero-shot cross-lingual experiments indicate that it will likely provide comparable performance on all other languages that are supported by the XLM-RoBERTa model (see Appendix in the following paper for the list of covered languages: https://arxiv.org/abs/1911.02116). The model is based on the base-sized XLM-RoBERTa model (https://huggingface.co/FacebookAI/xlm-roberta-base). It was fine-tuned on the training split of an English-Slovenian X-GENRE dataset (http://hdl.handle.net/11356/1960), comprising of around 1,800 instances of Slovenian and English texts. Fine-tuning was performed with the simpletransformers library (https://simpletransformers.ai/) and the following hyperparameters were used: Train batch size: 8 Learning rate: 1e-5 Max. sequence length: 512 Number of epochs: 15 For the optimum performance, the genre classifier should be applied to documents of sufficient length (the rule of thumb is at least 75 words), the predictions of label "Other" should be disregarded, and only predictions, predicted with confidence higher than 0.8, should be used. With these post-processing steps, the model was shown to reach macro-F1 scores of 0.92 and 0.94 on English and Slovenian test sets respectively (cross-dataset scenario), macro-F1 scores between 0.88 and 0.95 on Croatian, Macedonian, Turkish and Ukrainian, and macro-F1 scores between 0.80 and 0.85 on Albanian, Catalan, Greek, and Icelandic (zero-shot cross-lingual scenario). Refer to the provided README file for instructions with code examples on how to use the model.

Use "Multilingual text genre classification model X-GENRE"
Keyword Extractor

1 resources

Tool for extracting key phrases for text, using TextRank algorithm.

Use "Keyword Extractor"
Cinderella - tool for Clustering and Classifications of Texts in Polish

2 resources

System for clustering and classifications of Texts in Polish. Source code.

Use "Cinderella - tool for Clustering and Classifications of Texts in Polish"
The CLASSLA-StanfordNLP model for lemmatisation of standard Serbian 1.2

2 resources

The model for lemmatisation of standard Serbian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200) and using the srLex inflectional lexicon (http://hdl.handle.net/11356/1233). The estimated F1 of the lemma annotations is ~97.9. The difference to the previous version is that now it relies solely on XPOS annotations, and not on a combination of UPOS, FEATS (lexicon lookup) and XPOS (lemma prediction) annotations.

Use "The CLASSLA-StanfordNLP model for lemmatisation of standard Serbian 1.2"
EVALD 3.0 for Foreigners – Evaluator of Discourse

3 resources

EVALD 3.0 for Foreigners is a software for automatic evaluation of surface coherence (cohesion) in Czech texts written by non-native speakers of Czech.

Use "EVALD 3.0 for Foreigners – Evaluator of Discourse"
The CLASSLA-StanfordNLP model for lemmatisation of standard Macedonian 1.0

2 resources

The model for lemmatisation of standard Macedonian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the 1984 training corpus (to be published). The estimated F1 of the lemma annotations is ~99.1.

Use "The CLASSLA-StanfordNLP model for lemmatisation of standard Macedonian 1.0"
LINDAT Translation service

1 resources

Source code of the LINDAT Translation service frontend. The service provides a UI and a simple rest api that accesses machine translation models served by tensorflow serving. The most recent version of the code is available at https://github.com/ufal/lindat_translation.

Use "LINDAT Translation service"
MCSQ Translation Models (en-de) (v1.0)

2 resources

En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->de: 67.5 (train: genuine in-domain MCSQ data only) de->en: 75.0 (train: additional in-domain backtranslated MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)

Use "MCSQ Translation Models (en-de) (v1.0)"
The CLASSLA-StanfordNLP model for named entity recognition of non-standard Slovenian 1.0

2 resources

This model for named entity recognition of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and the Janes-Tag training corpus (http://hdl.handle.net/11356/1238), using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204). The training corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed.

Use "The CLASSLA-StanfordNLP model for named entity recognition of non-standard Slovenian 1.0"
VIADAT-ANALYZE

2 resources

A VIADAT module; VIADAT-ANALYZE is an interactive environment that enables the end user to work with stored, annotated and indexed audio recordings. Allowing visualization and extraction of results. Developed in cooperation with ÚSD AV ČR and NFA.

Use "VIADAT-ANALYZE"

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Multilingual text genre classification model X-GENRE

Keyword Extractor

Cinderella - tool for Clustering and Classifications of Texts in Polish

The CLASSLA-StanfordNLP model for lemmatisation of standard Serbian 1.2

EVALD 3.0 for Foreigners – Evaluator of Discourse

The CLASSLA-StanfordNLP model for lemmatisation of standard Macedonian 1.0

LINDAT Translation service

MCSQ Translation Models (en-de) (v1.0)

The CLASSLA-StanfordNLP model for named entity recognition of non-standard Slovenian 1.0

VIADAT-ANALYZE

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Session recording