Result filters

Metadata provider

Language

Resource type

Availability

Loading...
703 record(s) found

Search results

  • Slovenian RoBERTa contextual embeddings model: SloBERTa 2.0

    The monolingual Slovene RoBERTa (A Robustly Optimized Bidirectional Encoder Representations from Transformers) model is a state-of-the-art model representing words/tokens as contextually dependent word embeddings, used for various NLP tasks. Word embeddings can be extracted for every word occurrence and then used in training a model for an end task, but typically the whole RoBERTa model is fine-tuned end-to-end. SloBERTa model is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool Compared with the previous version (1.0), this version was trained for further 61 epochs (v1.0 37 epochs, v2.0 98 epochs), for a total of 200,000 iterations/updates. The released model here is a pytorch neural network model, intended for usage with the transformers library https://github.com/huggingface/transformers (sloberta.2.0.transformers.tar.gz) or fairseq library https://github.com/pytorch/fairseq (sloberta.2.0.fairseq.tar.gz)
  • Smashcima

    Smashcima is a library and framework for synthesizing images containing handwritten music for creating synthetic training data for OMR models. It is primarily intended to be used as part of optical music recognition workflows, esp. with domain adaptation in mind. The target user is therefore a machine-learning, document processing, library sciences, or computational musicology researcher with minimal skills in python programming. Smashcima is the only tool that simultaneously: - synthesizes handwritten music notation, - produces not only raster images but also segmentation masks, classification labels, bounding boxes, and more, - synthesizes entire pages as well as individual symbols, - synthesizes background paper textures, - synthesizes also polyphonic and pianoform music images, - accepts just MusicXML as input, - is written in Python, which simplifies its adoption and extensibility. Therefore, Smashcima brings a unique new capability for optical music recognition (OMR): synthesizing a near-realistic image of handwritten sheet music from just a MusicXML file. As opposed to notation editors, which work with a fixed set of fonts and a set of layout rules, it can adapt handwriting styles from existing OMR datasets to arbitrary music (beyond the music encoded in existing OMR datasets), and randomize layout to simulate the imprecisions of handwriting, while guaranteeing the semantic correctness of the output rendering. Crucially, the rendered image is provided also with the positions of all the visual elements of music notation, so that both object detection-based and sequence-to-sequence OMR pipelines can utilize Smashcima as a synthesizer of training data. (In combination with the LMX canonical linearization of MusicXML, one can imagine the endless possibilities of running Smashcima on inputs from a MusicXML generator.)
  • RÚV-DI Speaker Diarization v5 models (21.05)

    English This archive contains files generated from the recipe in kaldi-speaker-diarization/v5/. Its contents should be placed in a similar directory type, with symbolic links to diarization/, sid/, steps/, etc. It was created when Kaldi's master branch was at git commit 321d3959dabf667ea73cc98881400614308ccbbb. v5 These models are trained on the Althingi Parliamentary Speech corpus available on malfong.is. It uses MFCCS, x-vectors, PLDA and AHC. The recipe uses the Icelandic Rúv-di corpus as two hold out sets for tuning parameters. The Icelandic Rúv-di corpus is currently not publicly available. Íslenska Þetta skjalasafn inniheldur skrár frá kaldi-speaker-diarization v5. Innihaldi skjalasafnsins ætti að setja í eins möppu, með hlekki (symlinks) á diarization, sid, steps, o.s.frv. Notast var við Kaldi af master grein og Git commit 321d3959dabf667ea73cc98881400614308ccbbb. v5 Þessi líkön eru þjálfuð á gagnasafninu Alþingisræður til talgreiningar sem er aðgengilegt á malfong.is. Þau nota MFCC, x-vigra, PLDA, og AHC. Uppskriftin notar RÚV-di gagnasafnið sem hold-out gagnasöfn til að stilla forsendur. Eins og er þá er RÚV-di gagnasafnið ekki aðgengilegt almenningi.
  • CORDEX inflectional lookup data 1.0

    The inflectional data lookup module serves as an optional component within the cordex library (https://github.com/clarinsi/cordex/) that significantly improves the quality of the results. The module consists of a pickled dictionary of 111,660 lemmas, and maps these lemmas to their corresponding word forms. Each word form in the dictionary is accompanied by its MULTEXT-East morphosytactic descriptions, relevant features (custom features extracted from morphosytactic descriptions with the help of https://gitea.cjvt.si/generic/conversion_utils and its frequency within the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320), or Gigafida 1.0 when other information is unavailable. The dictionary is used to select the most frequent word form of a lemma that satisfies additional filtering conditions (ie. find the most utilized word form of lemma "centralen" in singular, i.e."centralni").
  • The CLASSLA-StanfordNLP model for lemmatisation of standard Slovenian 1.4

    The model for lemmatisation of standard Slovenian was built with the CLASSLA-StanfordNLP tool (https://github.com/clarinsi/classla-stanfordnlp) by training on the ssj500k training corpus (http://hdl.handle.net/11356/1210) and using the Sloleks inflectional lexicon (http://hdl.handle.net/11356/1230). The estimated F1 of the lemma annotations is ~99.7. The difference to the previous version of the model is that the Sloleks inflectional lexicon is moved to the morphosyntactic model.
  • Semi-supervised Icelandic-Polish Translation System (22.09)

    This Icelandic-Polish translation model (bi-directional) was trained using fairseq (https://github.com/facebookresearch/fairseq) by means of semi-supervised translation by starting with the mBART50 model. The model was then trained using a multi-task curriculum to first learn to denoise sentences. Then the model was trained to translate using aligned parallel texts. Finally the model was provided with monolingual texts in both Icelandic and Polish with which it iteratively creates back-translations. For the PL-IS direction the model achieves a BLEU score of 27.60 on held out true parallel training data and 15.30 on the out-of-domain Flores devset. For the IS-PL direction the model achieves a score of 27.70 on the true data and 13.30 on the Flores devset. -- Þetta íslensk-pólska þýðingarlíkan (tvíátta) var þjálfað með fairseq (https://github.com/facebookresearch/fairseq) með hálf-sjálfvirkum aðferðum frá mBART50 líkaninu. Líkanið var þjálfað á þremur verkefnum, afruglun, samhliða þýðingum og bakþýðingum sem voru myndaðar á þjálfunartíma. Fyrir PL-IS áttina fæst BLEU skor 27.60 á raun gögnum sem voru tekin til hliðar og 15.30 á Flores þróunargögnunum. Fyrir IS-PL áttina fæst skor 27.70 á raun gögnunum og 13.30 á Flores þróunargögnunum.