CLARIN Tool Portal

Parallel Corpora from Comparable Corpora tool

2 resources

Script consists of 2 parts: article parser aligner Required software (install before using script): yalign additional Ubuntu packages: mongodb ipython python-nose python-werkzeug Wiki article parser Article parser works in 2 steps: Extracts articles from wiki dumps Saves extracted articles to local DB (Mongo DB) Before using parser, wiki dumps should be downloaded and extracted to some directory (directory should contain *.xml, *.sql files). For each language 2 dump files should be downloaded - articles and language link dumps, here is examples: PL: http://dumps.wikimedia.org/plwiki/latest/plwiki-latest-pages-articles.xml.bz2 http://dumps.wikimedia.org/plwiki/latest/plwiki-latest-langlinks.sql.gz EN: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-langlinks.sql.gz IMPORTANT NOTE: Engilsh dumps after extraction will require about 50 Gb of free space. During parsing parser can require up to 8 Gb ram. Article parser have option "main language" - its language for which articles extracted from other languages only if it exist in main language. Eg. if main language is PL, then article extractor first extracts all article for PL, then article for other languages and only if such articles exists in PL translation. This reduces space requirements. For help use: $ python parse_wiki_dumps.py -h Example command: $ python parse_wiki_dumps.py -d ~/temp/wikipedia_dump/ -l pl -v Wikipedia aligner Aligner can be used when article extracted from dumps. Aligner takes article pairs for given language pair, aligns text and saves parallel corpara to 2 files. Option "-s" can be used to limit number of symbols in file (by default size is 50000000 symbols, thats around 50-60Mb) By default aligner tries to continue aligning where it was stopped, to force aligning from begining need to use "--restart" key For help use: $ python align.py -h Example command: $ python align.py -o wikipedia -l en-pl -v Euronews crawler Crawler finds links to articles using euronews archive http://euronews.com/2004/, and in parallel extracts and saves article texts to DB. For help use: $ python parse_euronews.py -h Example command: $ python parse_euronews.py -l en,pl -v Euronews aligner Starting aligner for euronews articles: $ python align.py -o euronews -l en-pl -v Saving articles in plain text Script "save_plain_text.py" can be used to save all articles in plain text format, it accepts path for saving articles, languages of articles to be saved, and source of articles (euronews, wikipedia). For help use: $ python save_plain_text.py -h Example command: $ python save_plain_text.py -l en,pl -r [path] -o euronews Yalign selection This script tries random parameters for model of yalign in order to get best parameters for aligning provided text samples. Before using yalign_selection script need to prepare article samples using prepare_random_sampling.py script. Creating folder with article samples can be done with this command: $ python prepare_random_sampling.py -o wikipedia -c 10 -l ru-en -v -o wikipedia - source of articles can be wikipedia or euronews -c 10 - number of articles to extract -l ru-en - languages to extract This script will create "article_samples" folder with articles files, then you can create manually aligned files (you need align article of second language), for this example you need to align "en" file, files named "_orig" - should be left unmodified Then manual aligning is ready you can run selection script here is example: $ python yalign_selection.py --samples article_samples/ --lang1 ru --lang2 en --threshold 0.1536422609112349e-6 --threshold_step 0.0000001 --threshold_step_count 10 --penalty 0.014928930455303857 --penalty_step 0.0001 --penalty_step_count 1 -m ru-en Here is what each parameter means: --samples article_samples/ - path to article samples folder --lang1 ru --lang2 en - languages to align (articles of second language should be aligned manually, script will be using "??_orig" files, align them automatically and will compare with manually aligned) --threshold 0.1536422609112349e-6 - threshold value of model, selection will be made around this value --threshold_step 0.0000001 - step of changing value --threshold_step_count 10 - number of steps to check below and above vaule, eg if value 10, step 1, and count 2, script will check 8 9 10 11 12 same parameters for penalty -m ru-en - path to yalign model Also you can use (to tweak comparison of text lines in files): --length and --similarity --length - min diffirence in length in order to mark lines similar, 1 - same length, 0.5 - at least half of length --similarity - similarity of text in lines, 1 - exactly same, 0 - completely different. For similarity check sentences compared as sequence of characters. It has multiprocessing support already. Use -t option to set number of threads, by default it sets number of threads equal to number of CPU. for additional parameters you can use '-h' key. Then yalign_selection.py script will finish work it will produce csv file, with first column equal to threshold, second column equal to penalty, and third is similarity for this parameters. Align with HUNALING method In order to use hunalign you need add "--hunalign" option in align.py script, here is example: $ python align.py -l li-hu -r align_result -o wikipedia --hunalign In my empirical study it provides better results when articles are translations of each other or simillar in leghth and content. Align From fodler For aligning already aligned texts using hunalign: Command exmaple is: $ python align_aligned_using_hunalign.py source/ target/ Final info Wołk, K., & Marasek, K. (2015, September). Tuned and GPU-accelerated parallel data mining from comparable corpora. In International Conference on Text, Speech, and Dialogue (pp. 32-40). Springer International Publishing. http://arxiv.org/pdf/1509.08639 For more detailed usage instruction see howto.pdf. For any questions: | Krzysztof Wolk | krzysztof@wolk.pl

Use "Parallel Corpora from Comparable Corpora tool"

CroSloEngual BERT

4 resources

Trilingual BERT (Bidirectional Encoder Representations from Transformers) model, trained on Croatian, Slovenian, and English data. State of the art tool representing words/tokens as contextually dependent word embeddings, used for various NLP classification tasks by finetuning the model end-to-end. CroSloEngual BERT are neural network weights and configuration files in pytorch format (ie. to be used with pytorch library).

Use "CroSloEngual BERT"

Terminal-based CoNLL-file viewer, v2

4 resources

A simple way of browsing CoNLL format files in your terminal. Fast and text-based. To open a CoNLL file, simply run: ./view_conll sample.conll The output is piped through less, so you can use less commands to navigate the file; by default the less searches for sentence beginnings, so you can use "n" to go to next sentence and "N" to go to previous sentence. Close by "q". Trees with a high number of non-projective edges may be difficult to read, as I have not found a good way of displaying them intelligibly. If you are on Windows and don't have less (but have Python), run like this: python view_conll.py sample.conll For complete instructions, see the README file. You need Python 2 to run the viewer.

Use "Terminal-based CoNLL-file viewer, v2"

Image Annotation Tool

1 resources

Image annotation tool is a web application that allows users to mark zones of interest in an image. These zones are then converted to TEI P5 code snippet that can be used in your document to connect the image and the text. This tool was developed to help students and teachers at the Faculty of Arts, Charles University to mark and annotate images of manuscripts.

Use "Image Annotation Tool"

HaskEN

2 resources

HaskEN is an English phraseological database designed for language professionals including linguists, language teachers, lexicographers, language materials developers and translators. Query results can be visualised and exported as spreadsheets.

Use "HaskEN"

GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English (1.0)

7 resources

Provided are a general domain IS-EN and EN-IS translation models developed by Miðeind ehf. They are based on a multilingual BART model (https://arxiv.org/pdf/2001.08210.pdf) and finetuned for translation on parallel and backtranslated data. The model is trained using the Fairseq sequence modeling toolkit by PyTorch. Provided here are a model files, sentencepiece subword-tokenizing model and dictionary files for running the model locally. You can run the scripts infer-enis.sh and infer-isen.sh to test the model by translating sentences command-line. For translating documents and evaluating results you will need to binarize the data using fairseq-preprocess and use fairseq-generate for translating. Please refer to the Fairseq documentation for further information on running a pre-trained model: https://fairseq.readthedocs.io/en/latest/ - Pakkinn inniheldur almenn þýðingarlíkön fyrir áttirnar IS-EN og EN-IS þróuð af Miðeind ehf. Þau eru byggð á margmála BART líkani (https://arxiv.org/pdf/2001.08210.pdf) og fínþjálfuð fyrir þýðingar. Líkönin eru þjálfað með Fairseq og PyTorch. Líkönin sjálf og ásamt sentencepiece tilreiðingarlíkani eru gerð aðgengileg. Skripturnar infer-enis.sh og infer-isen.sh gefa dæmi um hvernig er hægt að keyra líkönin á skipanalínu. Til að þýða stór skjöl og meta niðurstöður þarf að nota fairseq-preprocess skipunina ásamt fairseq-generate. Frekari upplýsingar er að finna í Fairseq leiðbeiningunum: https://fairseq.readthedocs.io/en/latest/

Use "GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English (1.0)"

Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

2 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for 147 treebanks of 78 languages of Universal Depenencies 2.15 Treebanks, created solely using UD 2.15 data (https://hdl.handle.net/11234/1-5787). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_215_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .

Use "Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)"

Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)

9 resources

This submission contains trained end-to-end models for the Neural Monkey toolkit for Czech and English, solving four NLP tasks: machine translation, image captioning, sentiment analysis, and summarization. The models are trained on standard datasets and achieve state-of-the-art or near state-of-the-art performance in the tasks. The models are described in the accompanying paper. The same models can also be invoked via the online demo: https://ufal.mff.cuni.cz/grants/lsd In addition to the models presented in the referenced paper (developed and published in 2018), we include models for automatic news summarization for Czech and English developed in 2019. The Czech models were trained using the SumeCzech dataset (https://www.aclweb.org/anthology/L18-1551.pdf), the English models were trained using the CNN-Daily Mail corpus (https://arxiv.org/pdf/1704.04368.pdf) using the standard recurrent sequence-to-sequence architecture. There are several separate ZIP archives here, each containing one model solving one of the tasks for one language. To use a model, you first need to install Neural Monkey: https://github.com/ufal/neuralmonkey To ensure correct functioning of the model, please use the exact version of Neural Monkey specified by the commit hash stored in the 'git_commit' file in the model directory. Each model directory contains a 'run.ini' Neural Monkey configuration file, to be used to run the model. See the Neural Monkey documentation to learn how to do that (you may need to update some paths to correspond to your filesystem organization). The 'experiment.ini' file, which was used to train the model, is also included. Then there are files containing the model itself, files containing the input and output vocabularies, etc. For the sentiment analyzers, you should tokenize your input data using the Moses tokenizer: https://pypi.org/project/mosestokenizer/ For the machine translation, you do not need to tokenize the data, as this is done by the model. For image captioning, you need to: - download a trained ResNet: http://download.tensorflow.org/models/resnet_v2_50_2017_04_14.tar.gz - clone the git repository with TensorFlow models: https://github.com/tensorflow/models - preprocess the input images with the Neural Monkey 'scripts/imagenet_features.py' script (https://github.com/ufal/neuralmonkey/blob/master/scripts/imagenet_features.py) -- you need to specify the path to ResNet and to the TensorFlow models to this script The summarization models require input that is tokenized with Moses Tokenizer (https://github.com/alvations/sacremoses) and lower-cased. Feel free to contact the authors of this submission in case you run into problems!

Use "Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)"

Universal Dependencies 2.12 models for UDPipe 2 (2023-07-17)

2 resources

Tokenizer, POS Tagger, Lemmatizer and Parser models for 131 treebanks of 72 languages of Universal Depenencies 2.12 Treebanks, created solely using UD 2.12 data (https://hdl.handle.net/11234/1-5150). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_212_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .

Use "Universal Dependencies 2.12 models for UDPipe 2 (2023-07-17)"

Depfix: Automatic Post-editing of SMT

4 resources

Depfix, a tool for Automatic Post-editing of SMT. See the project website for more information.

Use "Depfix: Automatic Post-editing of SMT"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Parallel Corpora from Comparable Corpora tool

CroSloEngual BERT

Terminal-based CoNLL-file viewer, v2

Image Annotation Tool

HaskEN

GreynirTranslate - mBART25 NMT models for Translations between Icelandic and English (1.0)

Universal Dependencies 2.15 models for UDPipe 2 (2024-11-21)

Czech image captioning, machine translation, sentiment analysis and summarization (Neural Monkey models)

Universal Dependencies 2.12 models for UDPipe 2 (2023-07-17)

Depfix: Automatic Post-editing of SMT

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording