Result filters

Metadata provider

Language

Resource type

Availability

Loading...
703 record(s) found

Search results

  • Czech image captioning, machine translation, and sentiment analysis (Neural Monkey models)

    This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving three NLP tasks: machine translation, image captioning, and sentiment analysis. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The models are described in the accompanying paper. The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd There are several separate ZIP archives here, each containing one model solving one of the tasks for one language. To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory. Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization). The 'experiment.ini' file, which was used to train the model, is also included. Then there are files containing the model itself, files containing the input and output vocabularies, etc. For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/ For the machine translation, you do not need to tokenize the data, as this is done by the model. For image captioning, you need to: - download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz - clone the git repository with TensorFlow models: https://github.com/tensorflow/models - preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script Feel free to contact the authors of this submission in case you run into problems!
  • WebStylo

    Web based, open stylometry system based on Multilevel Text Analysis. Runs cluto and stylo (R system) clusterisation methods. Based on Natural Language Processing Workflow engine (included in the distribution).
  • Open morphology of Finnish

    Omorfi is free and open source project containing various tools and data for handling Finnish texts in a linguistically motivated manner. The main components of this repository are: 1) a lexical database containing hundreds of thousands of words (c.f. lexical statistics), 2) a collection of scripts to convert lexical database into formats used by upstream NLP tools (c.f. lexical processing), 3) an autotools setup to build and install (or package, or deploy): the scripts, the database, and simple APIs / convenience processing tools, and 4) a collection of relatively simple APIs for a selection of languages and scripts to apply the NLP tools and access the database
  • Word embeddings CLARIN.SI-embed.mk 2.0

    CLARIN.SI-embed.mk contains word embeddings induced from a large collection of Macedonian texts crawled from the .mk top-level domain. The embeddings are based on the skip-gram model of fastText trained on 933,231,582 tokens of running text for 986,670 lowercased surface forms. The difference to the previous version of the embeddings is that this version was trained on the original dataset expanded with the MaCoCu-mk web crawl corpus (http://hdl.handle.net/11356/1512).
  • The CLASSLA-StanfordNLP model for morphosyntactic annotation of standard Slovenian 1.2

    This model for morphosyntactic annotation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~97.06. The difference to the previous version of the model is that the pre-trained embeddings are limited to 250 thousand entries and adapted to the new code base.
  • The CLASSLA-Stanza model for morphosyntactic annotation of standard Croatian 2.1

    The model for morphosyntactic annotation of standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus (http://hdl.handle.net/11356/1792) and using the CLARIN.SI-embed.hr word embeddings (http://hdl.handle.net/11356/1790). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~94.87. The difference to the previous version of the model is that this version was trained using the new version of the hr500k corpus and the new version of the Croatian word embeddings.