CLARIN Tool Portal

Terminal-based CoNLL-file viewer, v2

4 resources

A simple way of browsing CoNLL format files in your terminal. Fast and text-based. To open a CoNLL file, simply run: ./view_conll sample.conll The output is piped through less, so you can use less commands to navigate the file; by default the less searches for sentence beginnings, so you can use "n" to go to next sentence and "N" to go to previous sentence. Close by "q". Trees with a high number of non-projective edges may be difficult to read, as I have not found a good way of displaying them intelligibly. If you are on Windows and don't have less (but have Python), run like this: python view_conll.py sample.conll For complete instructions, see the README file. You need Python 2 to run the viewer.

Use "Terminal-based CoNLL-file viewer, v2"

CombiTagger

3 resources

The main purpose of CombiTagger is to read datafiles generated by individual taggers and use them to develop a combined tagger according to a specified algorithm. The system provides algorithms for simple and weighted voting, but it is extensible so that other combination algorithms can be added easily CombiTagger is implemented in Java.

Use "CombiTagger"

Image Annotation Tool

1 resources

Image annotation tool is a web application that allows users to mark zones of interest in an image. These zones are then converted to TEI P5 code snippet that can be used in your document to connect the image and the text. This tool was developed to help students and teachers at the Faculty of Arts, Charles University to mark and annotate images of manuscripts.

Use "Image Annotation Tool"

Tagger SentiOne - version 2

2 resources

This is the second version of the morpho-syntactic tagger for the Polish language, adapted to UGC-processing. It has been enriched with some heuristics to improve its accuracy and a tokenizer.

Use "Tagger SentiOne - version 2"

CSTlemma version 8.1.2

2 resources

CSTlemma is a lemmatizer that treats pre- in- and suffixes alike. The CST's lemmatizer can be (and already is) trained for tens of languages, also ones that require lemmatization rules that change words by adding or removing prefixes and/or infixes to obtain the lemma for the word. In Dutch, for example, the word "afgemaakt" has the lemma "afmaken", so the "ge" has to be removed, an "a" has to be inserted and the "t"-ending must be replaced by "en". New in version 8 of CSTlemma is the possibility to output the rule by which a given word is transformed to its lemma. It is also possible to just output a unique identifier for that rule - in practice, this identifier is just some kind of pointer in the datastructure that comprises the rule set. Rules for CSTlemma must be created with the affixtrain program (https://github.com/kuhumcst/affixtrain), but ready-made rules can be obtained from the net. For example, the https://github.com/kuhumcst/texton-linguistic-resources repo contains rules for about 30 languages. If you want to build CSTlemma, you not only need the source code contained in https://github.com/kuhumcst/cstlemma, but also some source code files from https://github.com/kuhumcst/letterfunc and from https://github.com/kuhumcst/parsesgml, The easiest and best way to go forward is to copy https://github.com/kuhumcst/cstlemma/blob/master/doc/makecstlemma.bash to a (linux, Mac?) folder and run that script. That will fetch all needed repositories and build cstlemma.

Use "CSTlemma version 8.1.2"

The CLASSLA-StanfordNLP model for UD dependency parsing of standard Slovenian

3 resources

The model for UD dependency parsing of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the UD-parsed portion of the ssj500k training corpus (http://hdl.handle.net/11356/1210) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204). The estimated LAS of the parser is ~92.7.

Use "The CLASSLA-StanfordNLP model for UD dependency parsing of standard Slovenian"

The CLASSLA-StanfordNLP model for lemmatisation of non-standard Slovenian 1.0

2 resources

The model for lemmatisation of non-standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and the Janes-Tag corpus (http://hdl.handle.net/11356/1238), using the Sloleks inflectional lexicon (http://hdl.handle.net/11356/1230). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~98.86.

Use "The CLASSLA-StanfordNLP model for lemmatisation of non-standard Slovenian 1.0"

Entailment

1 resources

Entailment is a tool for recognizing semantic relations between sentences.

Use "Entailment"

The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1

2 resources

The model for lemmatisation of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the BulTreeBank training corpus (https://clarino.uib.no/korpuskel/corpora) and using the Bulgarian inflectional lexicon (Popov, Simov, and Vidinska 1998). The estimated F1 of the lemma annotations is ~98.93. The difference to the previous version of the lemmatizer is that this version was trained using the new version of the Bulgarian word embeddings.

Use "The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1"

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian 1.2

3 resources

The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the hr500k training corpus (http://hdl.handle.net/11356/1183) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1205). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.1. The difference to the previous version of the model is that the pre-trained embeddings are limited to 250 thousand entries and adapted to the new code base.

Use "The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian 1.2"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Terminal-based CoNLL-file viewer, v2

CombiTagger

Image Annotation Tool

Tagger SentiOne - version 2

CSTlemma version 8.1.2

The CLASSLA-StanfordNLP model for UD dependency parsing of standard Slovenian

The CLASSLA-StanfordNLP model for lemmatisation of non-standard Slovenian 1.0

Entailment

The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1

The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Croatian 1.2

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording