Result filters

Metadata provider

Language

  • English

Resource type

Tool task

  • Machine translation

Availability

Active filters:

  • Tool task: Machine translation
  • Language: English
Loading...
22 record(s) found

Search results

  • Optimized Long Context Translation Models for English-Icelandic translations (22.09)

    ENGLISH: These models are optimized versions of the translation models released in http://hdl.handle.net/20.500.12537/278. Instead of the 24 layers used in the full model, they have been shrunk down to 7 layers. The computational resources required to run inference on the models is thus significantly less than using the original models. Performance is comparable to the original models when evaluated on general topics such as news, but for expert knowledge from the training data (e.g. EEA regulations) the original models are more capable. The models are capable of translating between English and Icelandic, in both directions. They are capable of translating several sentences at once and are robust to some input errors such as spelling errors. The models are based on the pretrained mBART25 model (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210) and finetuned on bilingual EN-IS data and backtranslated data (including http://hdl.handle.net/20.500.12537/260). The full backtranslation data used includes texts from the following sources: The Icelandic Gigaword Corpus (Without sport) (IGC), The Icelandic Common Crawl Corpus (IC3), Student theses (skemman.is), Greynir News, Wikipedia, Icelandic sagas, Icelandic e-books, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. The true parallel long context data used is from European Economic Area (EEA) regulations, document-level Icelandic Student Theses Abstracts corpus (IPAC), Stúdentablaðið (university student magazine), The report of the Special Investigation Commision (Rannsóknarnefnd Alþingis), The Bible and Jehovah’s witnesses corpus (JW300). Provided here are model files, a SentencePiece subword-tokenizing model and dictionary files for running the model locally along with scripts for translating sentences on the command line. We refer to the included README for instructions on running inference. ÍSLENSKA: Þessi líkön eru smækkaðar útgáfur af líkönunum sem má finna á http://hdl.handle.net/20.500.12537/278 . Upphaflegu líkönin eru með 24 lög en þessar útgáfur eru með 7 lög og eru skilvirkari í keyrslu. Frammistaða líkananna er á pari við þau upphaflegu fyrir almennan texta, svo sem í fréttum. Á sérhæfðari texta sem er að finna í þjálfunargögnunum standa þau sig verr, t.d. á evrópureglugerðum. Þessi líkön geta þýtt á milli ensku og íslensku. Líkönin geta þýtt margar málsgreinar í einu og eru þolin gagnvart villum og smávægilegu fráviki í inntaki. Líkönin eru áframþjálfuð þýðingarlíkön sem voru þjálfuð frá mBART25 líkaninu (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210). Þjálfunargögin eru samhliða ensk-íslensk gögn ásamt bakþýðingum (m.a. http://hdl.handle.net/20.500.12537/260). Einmála gögn sem voru bakþýdd og nýtt í þjálfanir eru fengin úr: Risamálheildinni (án íþróttafrétta), Icelandic Common Crawl Corpus (IC3), ritgerðum af skemman.is, fréttum í fréttagrunni Greynis, Wikipedia, Íslendingasögunum, opnum íslenskum rafbókum, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. Samhliða raungögn eru fengin upp úr European Economic Area (EEA) reglugerðum, samröðuðum útdráttum úr ritgerðum nemenda (IPAC), Stúdentablaðinu, Skýrslu Rannsóknarnefndar Alþingis, Biblíunni og samhliða málheild unna úr Varðturninum (JW300). Útgefin eru líkönin sjálf, orðflísunarlíkan og orðabók fyrir flísunina, ásamt skriptum til að keyra þýðingar frá skipanalínu. Nánari leiðbeiningar eru í README skjalinu.
  • Translation Models (en-ru) (v1.0)

    En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->ru: 18.0 ru->en: 30.4 (Evaluated using multeval: https://github.com/jhclark/multeval)
  • CUBBITT Translation Models (en-pl) (v1.0)

    CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->pl: 12.3 pl->en: 20.0 (Evaluated using multeval: https://github.com/jhclark/multeval)
  • GreynirTranslate - mBART25 NMT (with layer drop) models for Translations between Icelandic and English (1.0)

    These are the models in http://hdl.handle.net/20.500.12537/125 trained with 40% layer drop. They are suitable for inference using every other layer for optimized inference speed with lower translation performance. We refer to the prior submission for usage and the documentation on layerdrop at https://github.com/pytorch/fairseq/blob/fcca32258c8e8bcc9f9890bf4714fa2f96b6b3e1/examples/layerdrop/README.md. Þessi líkön eru þjálfuð með 40% laga missi (e. layer drop) á líkönunum í http://hdl.handle.net/20.500.12537/125. Þau henta vel til þýðinga þar sem er búið að henda öðru hverju lagi í netinu og þannig er hægt að hraða á þýðingum á kostnað gæða. Leiðbeiningar um notkun netanna er að finna með upphaflegu líkönunum og í notkunarleiðbeiningum Fairseq í https://github.com/pytorch/fairseq/blob/fcca32258c8e8bcc9f9890bf4714fa2f96b6b3e1/examples/layerdrop/README.md.
  • MCSQ Translation Models (en-de) (v1.0)

    En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->de: 67.5 (train: genuine in-domain MCSQ data only) de->en: 75.0 (train: additional in-domain backtranslated MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)
  • Long Context Translation Models for English-Icelandic translations (22.09)

    ENGLISH: These models are capable of translating between English and Icelandic, in both directions. They are capable of translating several sentences at once and are robust to some input errors such as spelling errors. The models are based on the pretrained mBART25 model (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210) and finetuned on bilingual EN-IS data and backtranslated data (including http://hdl.handle.net/20.500.12537/260). The full backtranslation data used includes texts from the following sources: The Icelandic Gigaword Corpus (Without sport) (IGC), The Icelandic Common Crawl Corpus (IC3), Student theses (skemman.is), Greynir News, Wikipedia, Icelandic sagas, Icelandic e-books, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. The true parallel long context data used is from European Economic Area (EEA) regulations, document-level Icelandic Student Theses Abstracts corpus (IPAC), Stúdentablaðið (university student magazine), The report of the Special Investigation Commision (Rannsóknarnefnd Alþingis), The Bible and Jehovah’s witnesses corpus (JW300). Provided here are model files, a SentencePiece subword-tokenizing model and dictionary files for running the model locally along with scripts for translating sentences on the command line. We refer to the included README for instructions on running inference. ÍSLENSKA: Þessi líkön geta þýtt á milli ensku og íslensku. Líkönin geta þýtt margar málsgreinar í einu og eru þolin gagnvart villum og smávægilegu fráviki í inntaki. Líkönin eru áframþjálfuð þýðingarlíkön sem voru þjálfuð frá mBART25 líkaninu (http://hdl.handle.net/20.500.12537/125, https://arxiv.org/abs/2001.08210). Þjálfunargögin eru samhlíða ensk-íslensk gögn ásamt bakþýðingum (m.a. http://hdl.handle.net/20.500.12537/260). Einmála gögn sem voru bakþýdd og nýtt í þjálfanir eru fengin úr: Risamálheildinni (án íþróttafrétta), Icelandic Common Crawl Corpus (IC3), ritgerðum af skemman.is, fréttum í fréttagrunni Greynis, Wikipedia, íslendingasögurnar, opnar íslenskar rafbækur, Books3, NewsCrawl, Wikipedia, EuroPARL, Reykjavik Grapevine, Iceland Review. Samhliða raungögn eru fengin upp úr European Economic Area (EEA) reglugerðum, samröðuðum útdráttum úr ritgerðum nemenda (IPAC), Stúdentablaðið, Skýrsla Rannsóknarnefndar Alþingis, Biblíunni og samhliða málheild unna úr Varðturninum (JW300). Útgefin eru líkönin sjálf, orðflísunarlíkan og orðabók fyrir flísunina, ásamt skriptum til að keyra þýðingar frá skipanalínu. Nánari leiðbeiningar eru í README skjalinu.
  • MCSQ Translation Models (en-ru) (v1.0)

    En-Ru translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). The models were trained using the MCSQ social surveys dataset (available at https://repo.clarino.uib.no/xmlui/bitstream/handle/11509/142/mcsq_v3.zip). Their main use should be in-domain translation of social surveys. Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on MCSQ test set (BLEU): en->ru: 64.3 (train: genuine in-domain MCSQ data) ru->en: 74.7 (train: additional backtranslated in-domain MCSQ data) (Evaluated using multeval: https://github.com/jhclark/multeval)
  • Translation Models (en-de) (v1.0)

    En-De translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->de: 25.9 de->en: 33.4 (Evaluated using multeval: https://github.com/jhclark/multeval)
  • TMODS:ENG-CZE -- query translation

    AMALACH project component TMODS:ENG-CZE; machine translation of queries from Czech to English. This archive contains models for the Moses decoder (binarized, pruned to allow for real-time translation) and configuration files for the MTMonkey toolkit. The aim of this package is to provide a full service for Czech->English translation which can be easily utilized as a component in a larger software solution. (The required tools are freely available and an installation guide is included in the package.) The translation models were trained on CzEng 1.0 corpus and Europarl. Monolingual data for LM estimation additionally contains WMT news crawls until 2013.