CLARIN Tool Portal

OCR Post-Processing Tool for Icelandic 22.10

3 resources

ENGLISH: This entry consists of two trained transformer models to correct OCR errors, along with ca 50,000 line pairs of OCRed/corrected text. The models were trained on ca 900,000 lines (~7,000,000 tokens) of which only 50,000 (~400,000 tokens) were from real OCRed texts. It can be assumed that increasing the amount of such data can significantly improve the tool. More info in README.md. ICELANDIC: Þessi gagnahirsla inniheldur tvö þjálfuð transformer-líkön til leiðréttingar á ljóslestrarvillum, auk u.þ.b. 50.000 línupara úr ljóslesnum/leiðréttum textum. Líkönin voru þjálfuð á u.þ.b. 900.000 línum (~7.000.000 orð) en af þeim voru ekki nema um 50.000 (~400.000 orð) úr raunverulegum ljóslesnum gögnum. Ætla má að aukið magn slíkra gagna geti bætt tólið umtalsvert. Nánari upplýsingar í README.md.

Use "OCR Post-Processing Tool for Icelandic 22.10"

Punctuation model (20.09)

9 resources

A python package that punctuates Icelandic text. The input data is unpunctuated text and punctuated text is returned. The user can choose between two punctuation models, a BERT-based Transformer and a bidirectional RNN ([Punctuator 2](www.github.com/ottokart/punctuator2)) in Tensorflow 2. [Icelandic] Python-pakki sem greinarmerkjasetur íslenskan texta. Inntakið er á formi ógreinarmerkjasetts texta og greinarmerkjasettum texta er skilað. Notandinn getur valið milli tveggja greinarmerkjasetningalíkana, annars vegar umbreytis sem byggir á BERT og tvístefnu-endurkvæmnisneti ([Punctuator 2](www.github.com/ottokart/punctuator2)) í Tensorflow 2.

Use "Punctuation model (20.09)"

Yfirlestur Word 22.10

2 resources

Yfirlestur Word is the source code for a spelling and grammar correction add-on for Icelandic, for use with Microsoft Word. The plugin provides error annotation and replacement, based on user interaction. The source code is intended for third party development and can be installed and tested locally using Node.js. The plugin requires third party correction software for its functionality. For development and testing, the open-access Yfirlestur.is API produced by Miðeind was used (see: https://github.com/icelandic-lt/Yfirlestur)) but is not intended for production use. This software is licensed under the MIT License. More information at https://github.com/icelandic-lt/Yfirlestur-Word.

Use "Yfirlestur Word 22.10"

GreynirSeq - A Natural Language Processing Toolkit for Icelandic (v0.2.0)

2 resources

GreynirSeq is a natural language parsing toolkit for Icelandic focused on sequence modeling with neural networks. The modeling part (nicenlp) of GreynirSeq is built on top of the excellent Fairseq from Meta (which is built on top of PyTorch). Interfaces for POS-tagging, NER-tagging and machine translation are included in this version v.0.2.0. For updated versions of the software please refer to https://github.com/mideind/GreynirSeq -- GreynirSeq er málvinnsluhugbúnaður fyrir íslensku með áherslu á notkun runulíkana sem byggja á tauganetum. Sá hluti sem snýr að tauganetum er byggður á Fairseq frá Meta og byggir á PyTorch. Í þessari útgáfu (v0.2.0) er stuðningur við orðflokkagreiningu, nafnamörkun og þýðingu í gegnum viðmót á skipanalínu. Nýjustu útgáfu af hugbúnaðinum má ávallt finna á https://github.com/mideind/GreynirSeq

Use "GreynirSeq - A Natural Language Processing Toolkit for Icelandic (v0.2.0)"

GreynirCorrect (1.0.0)

3 resources

GreynirCorrect GreynirCorrect is a Python package and a command line tool for checking and correcting context-independent spelling errors in Icelandic text. GreynirCorrect relies on the Tokenizer package, by the same authors, to tokenize text. More information at: https://github.com/mideind/GreynirCorrect GreynirCorrect er Python-pakki og skipanalínutól sem leiðréttir ósamhengisháðar ritvillur í íslenskum texta. GreynirCorrect reiðir sig á Tokenizer-pakkann, eftir sömu höfunda, til að tilreiða textann. Frekari upplýsingar á: https://github.com/mideind/

Use "GreynirCorrect (1.0.0)"

BinPackage (0.3.1)

3 resources

BinPackage is a Python Package that embeds the vocabulary of the DMII (bin.arnastofnun.is) and offers various lookups and queries of the data. The database, maintained by The Árni Magnússon Institute for Icelandic Studies, contains over 6.5 million entries, over 3.1 million unique word forms, and about 300,000 distinct lemmas. The database has been encapsulated in an easy-to-install Python package, and compressed from 400+ megabyte CSV file to an ~80 megabyte indexed binary structure. More information at: https://github.com/mideind/BinPackage BinPackage er Python-pakki utan um BÍN, Beygingarlýsingu íslensks nútímamáls (bin.arnastofnun.is), sem inniheldur yfir 6,5 milljónir færslna, 3,1 milljón einstakra orðmynda og um 300.000 stakar lemmur. Stofnun Árna Magnússonar heldur utan um gagnagrunninn. Gagnagrunninum, um 400 megabæta CSV-skrá, hefur verið pakkað í um 80 megabæta tvíundarbyggingu með vísum. Frekari upplýsingar á: https://github.com/mideind/BinPackage

Use "BinPackage (0.3.1)"

Alexia: Lexicon Acquisition Tool for Icelandic (Orðtökutól) 1.0

2 resources

The purpose of the lexicon acquisition tool is to facilitate the development and expansion of online dictionaries and glossaries, particularly the Database of Modern Icelandic Inflection (DMII/BÍN) and ISLEX. The tool is designed around the Icelandic Gigaword Corpus (IGC) and the information contained within its TEI-formatted documents. That is to say, its best performance comes from using the available part-of-speech tags, lemmas and word forms defined in the IGC. The lexicon acquisition tool can however use any corpus as input that uses either the same TEI-format as is used in the IGC or a plain text file format, depending on the user's preference. The output files, examples of which are included, are the following: Frequency per word form with no extra information added. Useful for generally picking candidates for the online dictionaries and glossaries. Frequency per lemma with no extra information added. Useful for generally picking candidates for the online dictionaries and glossaries. Frequency per word form, including information on all possible lemmas for the given word forms. Provides information on whether the word form can belong to more than one word class, as well as whether or not the automatic lemmatization is working correctly. Frequency per lemma, including information on all possible word forms for the given lemma. To examine if a certain word form appears much more or less frequently than the others and thus if the word form is only used as a part of a certain expression. Frequency per lemma, including information in which types of text the particular lemma appears. The frequency for each individual text type can also be examined in descending order. Facilitates the creation of a specialized glossary (e.g. a glossary of sport related words). Also included is a list of approximately 60 thousand subwords, manually chosen from the ICG. These include foreign words, typos, misspelled words, lemmatization errors and acronyms. Tilgangur orðtökutólsins er að einfalda þróun og smíði netorðabóka og netorðasafna, einkum og sér í lagi Beygingarlýsingu íslensks nútímamáls (BÍN) og Nútímamálsorðabókarinnar (ISLEX). Smíði tólsins byggist að miklu leyti á notkun Risamálheildarinnar (RMH) og þeirra upplýsinga sem eru skilgreindar innan tei-sniðsins sem hún notar, en þar er helst átt við notkun málfræðilegra marka, nefnimynda og orðmynda sem þar er að finna. Orðtökutólið má aftur á móti nota með hvaða málheild sem er sé hún annað hvort á sama tei-sniði og Risamálheildin eða á einföldu txt-sniði. Dæmi um úttaksskjöl orðtökutólsins má finna í meðfylgjandi möppu. Þau eru eftirfarandi: Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni. Þetta má nýta til þess að ákveða hvaða orð koma til greina að bæta við í orðabækur og -söfn. Tíðnilistar sem innihalda orðmyndir ásamt tíðni þeirra í inntaksmálheildinni. Þetta má nýta til þess að ákveða hvaða orð koma til greina að bæta við í orðabækur og -söfn. Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni, en jafnframt eru allar orðmyndir viðkomandi lemmu sem koma fyrir taldar upp. Nýtist til að kanna hvort tiltekin orðmynd er mun algengari en aðrar og þar með hvort orðið tilheyri einkum ákveðnu orðtaki. Tíðnilistar sem innihalda orðmyndir ásamt tíðni þeirra í inntaksmálheildinni, en jafnframt eru allar lemmur viðkomandi orðmyndar sem koma fyrir taldar upp. Veitir upplýsingar um hvort tiltekin orðmynd getur tilheyrt fleiri en einum orðflokki og hvort sjálfvirk lemmun skili réttum niðurstöðum. Tíðnilistar sem innihalda lemmur ásamt tíðni þeirra í inntaksmálheildinni, en auk þess tíðni hverrar lemmu innan ákveðinnar gerðar texta (t.d. fréttir, stærðfræði eða fótbolti). Má nýta við smíði íðorðasafna. Meðfylgjandi er einnig listi sem inniheldur um 60 þúsund stopporð sem hefur verið safnað handvirkt úr Risamálheildinni. Þetta eru erlend orð, stafsetningar- og innsláttarvillur, lemmuvillur og skammstafanir.

Use "Alexia: Lexicon Acquisition Tool for Icelandic (Orðtökutól) 1.0"

COMBO-based UD Parser for Icelandic 22.12

6 resources

ENGLISH: This Universal Dependencies parser for Icelandic was trained with COMBO [1]. This version of it was trained on v2.11 of UD_Icelandic-IcePaHC [2] and UD_Icelandic-Modern [3]. (Note that texts in UD_Icelandic-Modern [3] labeled RUV_TGS_2017 and RUV_ESP_2017 were not included here as these were originally parsed with COMBO-based UD Parser 22.10 [4] and the output subsequently corrected.) The parser utilizes information from an ELECTRA language model [4]. Its UAS (unlabeled attachment score) is 88.80 (89.00 on a pre-tokenized text file) and its LAS (labeled attachment score) is 85.52 (85.71 if pre-tokenized). ICELANDIC: Þessi UD-þáttari var þjálfaður með COMBO [1]. Hann var þjálfaður á útgáfu 2.11 af UD_Icelandic-IcePaHC [2] og UD_Icelandic-Modern [3]. (Ath. að textar í UD_Icelandic-Modern [3] merktir RUV_TGS_2017 og RUV_ESP_2017 voru ekki notaðir við þjálfunina þar sem þeir voru upphaflega þáttaðir með COMBO-based UD Parser 22.10 [4] og úttakið leiðrétt að því loknu.) Þáttarinn nýtir sér upplýsingar úr ELECTRA-mállíkani [5]. Hann skorar 88.80 (89.00 á fortókuðu skjali) á UAS (unlabeled attachment score) og 85.52 (85.71 á fortókuðu skjali) á LAS (labeled attachment score). [1] COMBO: https://gitlab.clarin-pl.eu/syntactic-tools/combo/ [2] UD_Icelandic-IcePaHC: https://github.com/UniversalDependencies/UD_Icelandic-IcePaHC/ [3] UD_Icelandic-Modern: https://github.com/UniversalDependencies/UD_Icelandic-Modern/ [4] COMBO-based UD Parser 22.10: http://hdl.handle.net/20.500.12537/272 [5] electra-base-igc-is: https://huggingface.co/jonfd/electra-base-igc-is

Use "COMBO-based UD Parser for Icelandic 22.12"

ABLTagger (PoS) - 3.0.0

3 resources

A Part-of-Speech (PoS) tagger for Icelandic. In this submission, you will find pretrained models for ABLTagger v3.0.0. In this submission we provide two versions, small and large, of PoS taggers that work with the revised tagset that achieve an accuracy of ~96.7% and ~97.8% on MIM-Gold (cross-validation, excluding "x" and "e" tags), respectively. For installation, usage, and other instructions see https://github.com/icelandic-lt/POS You should also check if a newer version is out (see README.md - versions) on CLARIN: - Model files ------------------------------------------------------------------------------------------- Markari fyrir íslensku. Í þessum pakka er ABLTagger v3.0.0. Í þessari útgáfu eru tvö forþjálfuð líkön, lítið og stórt, sem virka fyrir nýja markamengið og ná 96,7% og 97,8% nákvæmni á MÍM-Gull (krossprófanir, án "x" og "e" marka). Fyrir uppsetningar-, notenda- og aðrar leiðbeiningar sjá https://github.com/icelandic-lt/POS. Einnig er gott að athuga þar hvort ný útgáfa sé komin út (sjá README.md - versions) Á CLARIN: - Gögn fyrir líkan

Use "ABLTagger (PoS) - 3.0.0"

Heyra (1.0)

2 resources

Heyra is an Android application that provides three loosely coupled components, an implementation of Android's speech recognition interface, an intent handler activity for speech recognition actions from other applications and an input method service (i.e. virtual keyboard) that can either be used on its own or launched by supported applications. Heyra can be downloaded from the Google Play Store at https://play.google.com/store/apps/details?id=is.tiro.heyra Heyra er Android forrit sem inniheldur þrjá laustengda hluta; útfærslu á kerfisþjónustu í Android fyrir talgreiningu, meðhöndlara fyrir talgreiningaraðgerðir frá öðrum forritum og inntaksþjónustu (eða sýndarlyklaborð) sem hægt er að nota eitt og sér eða kalla á úr öðrum studdum forritum. Hægt er að sækja Heyra á Google Play Store á https://play.google.com/store/apps/details?id=is.tiro.heyra

Use "Heyra (1.0)"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

OCR Post-Processing Tool for Icelandic 22.10

Punctuation model (20.09)

Yfirlestur Word 22.10

GreynirSeq - A Natural Language Processing Toolkit for Icelandic (v0.2.0)

GreynirCorrect (1.0.0)

BinPackage (0.3.1)

Alexia: Lexicon Acquisition Tool for Icelandic (Orðtökutól) 1.0

COMBO-based UD Parser for Icelandic 22.12

ABLTagger (PoS) - 3.0.0

Heyra (1.0)

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Project

Keywords

Active filters:

Search results

Session recording