CLARIN Tool Portal

698 record(s) found

Search results

Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (22.09)

7 resources

This Byte-Level Neural Error Correction Model for Icelandic is a fine-tuned byT5-base Transformer model for error correction in natural language. It acts as a machine translation model in that it “translates” from deficient Icelandic to correct Icelandic. The model is trained on parallel synthetic error data and real error data from the iceErrorCorpus (IceEC, http://hdl.handle.net/20.500.12537/73) and the three specialised error corpora (L2: http://hdl.handle.net/20.500.12537/131, dyslexia: http://hdl.handle.net/20.500.12537/132, child language: http://hdl.handle.net/20.500.12537/133). The synthetic error data (35M lines of parallel data) was created by filtering and then scrambling the Icelandic Gigaword Corpus (IGC, http://hdl.handle.net/20.500.12537/192) to simulate real grammatical and typographical errors. The pretrained byT5 model was trained on the synthetic data and finally fine-tuned on the real error data from IceEC. It can correct a variety of textual errors, even in texts containing many errors, such as those written by people with dyslexia. Measured on the iceEC test data, the model scores 0.862917 on the GLEU metric (modified BLEU for grammatical error correction) and 0.06% in TER (translation error rate). --- Þetta leiðréttingarlíkan fyrir íslensku er fínþjálfað byT5-base Transformer-líkan. Það er í raun þýðingalíkan sem þýðir úr íslenskum texta með villum yfir í texta án villna. Líkanið er þjálfað á samhliða gervivillugögnum og raunverulegum villum úr íslensku villumálheildinni (http://hdl.handle.net/20.500.12537/73) og sérhæfðu villumálheildunum þremur (íslenska sem erlent mál: http://hdl.handle.net/20.500.12537/131, lesblinda: http://hdl.handle.net/20.500.12537/132, barnatextar: http://hdl.handle.net/20.500.12537/133). Gervivillugögnin (35 milljón línur af samhliða gögnum) voru búin til með því að sía og svo rugla íslensku Risamálheildinni (http://hdl.handle.net/20.500.12537/192) með því að nota margs konar villumynstur til að líkja eftir raunverulegum málfræði- og ritunarvillum. Forþjálfaða byT5-líkanið var þjálfað á gervivillugögnunum og svo fínþjálfað á raungögnum úr villumálheildunum. Það getur leiðrétt fjölbreyttar textavillur, jafnvel í texta sem inniheldur mjög margar villur, svo sem frá fólki með lesblindu. Líkanið skorar 0.862917 GLEU-stig (BLEU nema lagað að málrýni) og er með 0.06% villuhlutfall í þýðingu (translation error rate), þegar það er metið á prófunarhluta íslensku villumálheildarinnar.

Use "Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (22.09)"
EXMARaLDA

1 resources

**EXMARaLDA** is a system for working with oral corpora on a computer. It consists of a transcription and annotation tool ([Partitur-Editor](https://exmaralda.org/en/partitur-editor-2/ "Partitur Editor")), a tool for managing corpora ([Corpus-Manager](https://exmaralda.org/en/corpus-manager-coma-2/ "Corpus-Manager (Coma)")) and a query and analysis tool ([EXAKT](https://exmaralda.org/en/exakt-3/ "EXAKT")). **EXMARaLDA's** features include, for instance: - time-aligned transcription of digital audio or video - flexible annotation for freely choosable categories, - systematic documentation of a corpus through metadata - flexible output of transcription data in various layouts and formats (notation, document) - computer-assisted querying of transcription, annotation and metadata - interoperable as it works XML based data formats that allow for data exchange with other tools (like Praat, ELAN, Transcriber etc.) and enable a flexible processing and sustainable usage of the data. **EXMARaLDA** is used by [researchers world wide](https://exmaralda.org/en/projects/ "Projekte") in different contexts in which spoken language is analysed, these include: - conversation and discourse analysis, - study of language acquisition and multilingualism, - phonetics and phonology, - dialectology and sociolinguistics. **EXMARaLDA** was developed in the project "Computer assisted methods for the creation and analysis of multilingual data" at the Collaborative Research Center "Multilingualism" (Sonderforschungsbereich "Mehrsprachigkeit" – SFB 538) at the University of Hamburg. Since July 2011, the development of EXMARaLDA is continued at the [Hamburg Centre for Language Corpora](https://corpora.uni-hamburg.de/drupal/en), since November 2011 in cooperation with the [Archive for Spoken German](http://agd.ids-mannheim.de/index.shtml) at the Institute for the German Language in Mannheim.
DARIAH DKPro-Wrapper: POS-Tagging und Lemmatization DE

1 resources

The DARIAH DKPro Wrapper is a wrapper for DKPro Core, a tool for linguistic annotation.
WebLicht Tokenization TUR

1 resources

WebLicht Easy Chain for tokenization of Turkish texts. The pipeline makes use of WebLicht's TCF converter, and the tokenizer from the OpenNLP project. The 'newlineBounds' parameter treats newlines as a hard break (a sentence boundary). WebLicht's built-in viewer for annotations can be used to visualize the processing result.
WebLicht Lemmas DE

1 resources

WebLicht Easy Chain for Lemmatization (German). The pipeline makes use of WebLicht's TCF converter, the IMS tokenizer, and the IMS TreeTagger. WebLicht's Tundra can be used to visualize the result.
WebLicht POSTags Lemmas IT

1 resources

WebLicht Easy Chain for POS Tagging and Lemmatization (Italian). The pipeline makes use of WebLicht's TCF converter, the IMS tokenizer, and the POS Tagger from the OpenNLP project. The model for Italian is trained on a relatively small training corpus (MIDT) and should therefore be considered experimental. WebLicht's Tundra can be used to visualize the result.
WebLicht POSTags Lemmas DE

1 resources

WebLicht Easy Chain for POS Tagging and Lemmatization (German). The pipeline makes use of WebLicht's TCF converter, the IMS tokenizer, and the IMS TreeTagger. WebLicht's Tundra can be used to visualize the result.
iDAI.vocab FR

1 resources

The German Archaeological Institute is a scientific institution of the confederation in the business area of the Department of Foreign Affairs.
Opener Tokenizer

1 resources

Tokenizer for Dutch, English, German, French, Spanish and Italian. Consumes Plain text and produces TCF.
Wikipedia Search EN

1 resources

Wikipedia is an online free-content encyclopedia that you can edit and contribute to. Wikipedia co-founder Jimmy Wales has described Wikipedia as "an effort to create and distribute a free encyclopedia of the highest possible quality to every single person on the planet in their own language." Wikipedia exists to bring knowledge to everyone who seeks it.

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Byte-Level Neural Error Correction Model for Icelandic - Yfirlestur (22.09)

EXMARaLDA

DARIAH DKPro-Wrapper: POS-Tagging und Lemmatization DE

WebLicht Tokenization TUR

WebLicht Lemmas DE

WebLicht POSTags Lemmas IT

WebLicht POSTags Lemmas DE

iDAI.vocab FR

Opener Tokenizer

Wikipedia Search EN

Result filters

Metadata provider

Language

Resource type

Type of tool

Tool task

Field of study

Availability

Organisation

Project

Keywords

Search results

Session recording