CLARIN Tool Portal

698 record(s) found

Search results

Image Annotation Tool

1 resources

Image annotation tool is a web application that allows users to mark zones of interest in an image. These zones are then converted to TEI P5 code snippet that can be used in your document to connect the image and the text. This tool was developed to help students and teachers at the Faculty of Arts, Charles University to mark and annotate images of manuscripts.

Use "Image Annotation Tool"
Tagger SentiOne - version 2

2 resources

This is the second version of the morpho-syntactic tagger for the Polish language, adapted to UGC-processing. It has been enriched with some heuristics to improve its accuracy and a tokenizer.

Use "Tagger SentiOne - version 2"
CSTlemma version 8.1.2

2 resources

CSTlemma is a lemmatizer that treats pre- in- and suffixes alike. The CST's lemmatizer can be (and already is) trained for tens of languages, also ones that require lemmatization rules that change words by adding or removing prefixes and/or infixes to obtain the lemma for the word. In Dutch, for example, the word "afgemaakt" has the lemma "afmaken", so the "ge" has to be removed, an "a" has to be inserted and the "t"-ending must be replaced by "en". New in version 8 of CSTlemma is the possibility to output the rule by which a given word is transformed to its lemma. It is also possible to just output a unique identifier for that rule - in practice, this identifier is just some kind of pointer in the datastructure that comprises the rule set. Rules for CSTlemma must be created with the affixtrain program (https://github.com/kuhumcst/affixtrain), but ready-made rules can be obtained from the net. For example, the https://github.com/kuhumcst/texton-linguistic-resources repo contains rules for about 30 languages. If you want to build CSTlemma, you not only need the source code contained in https://github.com/kuhumcst/cstlemma, but also some source code files from https://github.com/kuhumcst/letterfunc and from https://github.com/kuhumcst/parsesgml, The easiest and best way to go forward is to copy https://github.com/kuhumcst/cstlemma/blob/master/doc/makecstlemma.bash to a (linux, Mac?) folder and run that script. That will fetch all needed repositories and build cstlemma.

Use "CSTlemma version 8.1.2"
Binary Error Classifier for Icelandic Sentences (22.09)

6 resources

The model is a fine-tuned byT5-base Transformer model for error detection in natural language. It is tuned for sentence classification using parallel synthetic error data and real error data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The synthetic error data (35M lines of parallel data) was created by filtering and then scrambling the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/192) to simulate real grammatical and typographical errors. The pretrained byT5 model was trained on the synthetic data and finally fine-tuned on the real error data from IceEC. The objective was to train a grammatical error detection model that could classify whether a sentence contains an error or not. The overall F1 score is 72.8% (precision: 76.3, recall: 71.7). --- Líkanið er byT5-base Transformer-líkan þjálfað til setningaflokkunar á samhliða gervivillugögnum og raunverulegum villum úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Gervivillugögnin (35 milljón línur af samhliða gögnum) voru búin til með því að sía og svo rugla íslensku Risamálheildinni (http://hdl.handle.net/20.500.12537/192) með því að nota margs konar villumynstur til að líkja eftir raunverulegum málfræði- og ritunarvillum. Forþjálfaða byT5-líkanið var þjálfað á gervivillugögnunum og svo fínþjálfað á raungögnum úr villumálheildunum. Tilgangurinn var að þjálfa líkan sem gæti sagt til um hvort líklegt væri að setning innihéldi villu eða ekki. F1 fyrir líkanið er 72,8% (nákvæmni: 76,3, heimt: 71,7).

Use "Binary Error Classifier for Icelandic Sentences (22.09)"
The CLASSLA-StanfordNLP model for UD dependency parsing of standard Slovenian

3 resources

The model for UD dependency parsing of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the UD-parsed portion of the ssj500k training corpus (http://hdl.handle.net/11356/1210) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204). The estimated LAS of the parser is ~92.7.

Use "The CLASSLA-StanfordNLP model for UD dependency parsing of standard Slovenian"
Samrómur-Children Demonstration Scripts 22.01

2 resources

The "Samrómur-Children Demonstration Scripts 22.01" is a set of three code recipes intended to show how to integrate the corpus "Samrómur Children's Icelandic Speech Data 21.09" and the "Icelandic Language Models with Pronunciations 22.01" to create automatic speech recognition systems using the Kaldi toolkit. „Samrómur-Sýnisforskriftir fyrir börn 22.01“ er safn af þremur talgreiningarforskriftum sem sýna hvernig má beita talmálheildinni „Samrómur-Íslensk talgögn frá börnum 21.09“ ásamt „Íslenskum mállíkönum með framburðarorðabók 22.01“ til þess að byggja talgreiningarkerfi með verkfærakistunni Kaldi.

Use "Samrómur-Children Demonstration Scripts 22.01"
The CLASSLA-StanfordNLP model for lemmatisation of non-standard Slovenian 1.0

2 resources

The model for lemmatisation of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and the Janes-Tag corpus (http://hdl.handle.net/11356/1238), using the Sloleks inflectional lexicon (http://hdl.handle.net/11356/1230). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~98.86.

Use "The CLASSLA-StanfordNLP model for lemmatisation of non-standard Slovenian 1.0"
Entailment

1 resources

Entailment is a tool for recognizing semantic relations between sentences.

Use "Entailment"
The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1

2 resources

The model for lemmatisation of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the BulTreeBank training corpus (https://clarino.uib.no/korpuskel/corpora) and using the Bulgarian inflectional lexicon (Popov, Simov, and Vidinska 1998). The estimated F1 of the lemma annotations is ~98.93. The difference to the previous version of the lemmatizer is that this version was trained using the new version of the Bulgarian word embeddings.

Use "The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1"
The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian 1.2

3 resources

The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.1. The difference to the previous version of the model is that the pre-trained embeddings are limited to 250 thousand entries and adapted to the new code base.

Use "The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian 1.2"

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Image Annotation Tool

Tagger SentiOne - version 2

CSTlemma version 8.1.2

Binary Error Classifier for Icelandic Sentences (22.09)

The CLASSLA-StanfordNLP model for UD dependency parsing of standard Slovenian

Samrómur-Children Demonstration Scripts 22.01

The CLASSLA-StanfordNLP model for lemmatisation of non-standard Slovenian 1.0

Entailment

The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian 1.2

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Session recording