CLARIN Tool Portal

Tokenizer for Icelandic text (2.3.1)

3 resources

Tokenizer is a compact pure-Python (2.7 and 3) executable program and module for tokenizing Icelandic text. It converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, e-mail, URL/URI, etc. It also segments the token stream into sentences, considering corner cases such as abbreviations and dates in the middle of sentences. More information at: https://github.com/mideind/Tokenizer Tokenizer er pakki fyrir Python 2.7 og 3, ásamt skipanalínutóli, sem sér um tilreiðslu íslensks texta. Pakkinn umbreytir inntakstexta í tókastraum. Hver tóki er stakt orð, greinarmerki, tala/upphæð, dags-/tímasetning, netfang, vefslóð o.s.frv. Tólið skiptir tókastraumnum einnig í setningar og tekur tillit til jaðartilvika eins og skammstafana og dagsetninga í miðjum setningum. Frekari upplýsingar á: https://github.com/mideind/Tokenizer

Use "Tokenizer for Icelandic text (2.3.1)"

Yfirlestur 1.0.1 (22.10)

2 resources

Yfirlestur.is is a public website where you can enter or submit your Icelandic text and have it checked for spelling and grammar errors. The tool also gives hints on words and structures that might not be appropriate, depending on the intended audience for the text. The core spelling and grammar checking functionality of Yfirlestur.is is provided by the GreynirCorrect engine, by the same authors. This software is licensed under the MIT License. More information at https://github.com/icelandic-lt/Yfirlestur. Yfirlestur.is er opin vefsíða þar sem hægt er að senda inn íslenskan texta og finna stafsetningar- og málfræðivillur. Kerfið veitir einnig upplýsingar um orð og setningastrúktúra sem eru mögulega óviðeigandi fyrir ætlaðan lesendahóp textans. Málrýnivirknin Yfirlestur.is er fengin með GreynirCorrect eftir sömu höfunda. Frekari upplýsingar má finna á https://github.com/icelandic-lt/Yfirlestur.

Use "Yfirlestur 1.0.1 (22.10)"

UDConverter 22.01

2 resources

UDConverter is a tool for converting constituency treebanks in the format of PPCHE (Penn Parsed Corpora of Historical English) to dependency treebanks following the Universal Dependencies framework. The tool is specifically configured to convert treebanks in the IcePaHC format. This version has an 81.39 LAS (labeled attachment score). UDConverter er tól til að varpa liðgerðartrjábönkum á sniði PPCHE (Penn Parsed Corpora of Historical English) yfir í venslatrjábanka samkvæmt Universal Dependencies-sniði. Tólið er sérstaklega þróað til að varpa trjábönkum á sniði IcePaHC. Þessi útgáfa er með 81,39 LAS (labeled attachment score).

Use "UDConverter 22.01"

Talrómur Utils

13 resources

This is a collection of utilities for Text-to-speech (TTS) development using the Talrómur corpus. This collection includes: - Alignments for all the voices in Talrómur created with the Montreal Forced Aligner - Train, evaluation and test splits for all the voices in Talrómur - Two baseline TTS models and vocoder models Þetta er hjálparpakki fyrir Talrómsgagnasettið. Pakkinn inniheldur allt nauðsynlegt til að þróa og keyra talgervla búna til með Talrómi.

Use "Talrómur Utils"

Kaldi L2 Speakers Recipe 22.10

2 resources

This release includes a recipe intended to show how to integrate the corpus "Samromur L2 22.09" [1] and the "Icelandic Language Models with Pronunciations 22.01" [2] to create automatic speech recognition systems using the Kaldi toolkit. Þessi útgáfa inniheldur talgreiningarforskrift sem sýnir hvernig má beita talmálheildinni „Samromur L2 22.09“ [1] ásamt „Íslenskum mállíkönum með framburðarorðabók 22.01“ [2] til þess að byggja talgreiningarkerfi með verkfærakistunni Kaldi. [1] http://hdl.handle.net/20.500.12537/263 [2] http://hdl.handle.net/20.500.12537/172

Use "Kaldi L2 Speakers Recipe 22.10"

Grapheme-to-phoneme (g2p) module for Icelandic (22.10)

2 resources

ENGLISH: Grapheme-to-phoneme (g2p) module for Icelandic. The module can be used to transcribe Icelandic in four pronunciation variants (standard pronunciation, north Iceland, north-east Iceland, south Iceland), with different levels of detail and in four different phonetic alphabets. Default output is X-SAMPA phonetic alphabet without syllabification or stress labeling, according to standard pronunciation. The module transcribes English words using the Icelandic phoneset but close to English transcription rules. A transcription dictionary is also a part of the package. The package can be installed from PyPI: pip install ice-g2p ICELANDIC: Hljóðritunarforrit (g2p) fyrir íslensku. Forritið má nota til þess að hljóðrita íslensku skv. fjórum framburðartilbrigðum (hefðbundnum framburði, harðmæli, rödduðum framburði og hv-framburði), með mismiklum upplýsingum og í fjórum mismunandi hljóðritunarstafrófum. Séu engar stillingar sérvaldar þá skilar forritið úttaki í X-SAMPA hljóðritunarstafrófinu, án atkvæðaskiptinga eða áherslumerkinga, skv. hefðbundnum framburði. Forritið hljóðritar ensk orð með íslenskum hljóðritunartáknum en eins nálægt enskum reglum og mögulegt er. Framburðarorðabók fylgir pakkanum. Hægt er að sækja pakkann á PyPI: pip install ice-g2p

Use "Grapheme-to-phoneme (g2p) module for Icelandic (22.10)"

Yfirlestur Docs 22.10

2 resources

Yfirlestur Docs is the source code for a spelling and grammar correction add-on for Icelandic, for use with Google Docs. The plugin provides error annotation and replacement, based on user interaction. The source code is intended for third party development and can be installed and tested locally using Node.js. The plugin requires third party correction software for its functionality. For development and testing, the open-access Yfirlestur.is API produced by Miðeind was used (see:https://github.com/icelandic-lt/Yfirlestur) but is not intended for production use. This software is licensed under the MIT License. More information at https://github.com/icelandic-lt/Yfirlestur-Docs. Yfirlestur Docs er bakendakóði viðbótar fyrir Google Docs sem býður upp á leiðréttingu stafsetningar- og málfræðivillna. Viðbótin inniheldur notendaviðmót sem sýnir villur í textaskjali og býður notandanum að taka afstöðu til þeirra. Bakendakóðinn er ætlaður til utanaðkomandi þróunar og hægt er að prufukeyra viðbótina með því að ræsa vinnuumhverfi viðbótarinnar með NodeJS. Viðbótin þarf á utanaðkomandi leiðréttingarhugbúnaði að halda til að leiðrétta texta. Í þróunarferlinu var notast við forritaskilin á vegum Yfirlestur.is (sjá: https://github.com/icelandic-lt/Yfirlestur) en ekki er ætlast til að þau séu notuð í opinberri útgáfu viðbótarinnar.

Use "Yfirlestur Docs 22.10"

GreynirCorrect (3.2.0)

3 resources

GreynirCorrect is a Python 3 package and a command line tool for checking and correcting various types of spelling and grammar errors in Icelandic text. GreynirCorrect relies on the Tokenizer package, by the same authors, to tokenize text. More information can be found at https://github.com/mideind/GreynirCorrect, and detailed documentation at https://yfirlestur.is/doc/. GreynirCorrect er Python 3 pakki og skipanalínutól sem bendir á og leiðréttir ýmsar tegundir stafsetningar- og málvillna í íslenskum texta. GreynirCorrect reiðir sig á Tokenizer-pakkann, eftir sömu höfunda, til að tilreiða textann. Frekari upplýsingar má finna á https://github.com/mideind/GreynirCorrect, og ítarlega skjölun (á ensku) á https://yfirlestur.is/doc/.

Use "GreynirCorrect (3.2.0)"

Webrice extension (22.01)

2 resources

The Webrice plugin is a software add-on that gives access to people to listen to web pages instead of reading them. This chrome browser extension changes Icelandic text to speech. Webrice viðbótin er hugbúnaðarforrit sem hjálpar notendum að velja texta og hlusta á hann í staðinn fyrir að lesa. Þessi Chrome viðbót breytir íslenskan textan í tal.

Use "Webrice extension (22.01)"

OCR Post-Processing Tool for Icelandic 22.10

3 resources

ENGLISH: This entry consists of two trained transformer models to correct OCR errors, along with ca 50,000 line pairs of OCRed/corrected text. The models were trained on ca 900,000 lines (~7,000,000 tokens) of which only 50,000 (~400,000 tokens) were from real OCRed texts. It can be assumed that increasing the amount of such data can significantly improve the tool. More info in README.md. ICELANDIC: Þessi gagnahirsla inniheldur tvö þjálfuð transformer-líkön til leiðréttingar á ljóslestrarvillum, auk u.þ.b. 50.000 línupara úr ljóslesnum/leiðréttum textum. Líkönin voru þjálfuð á u.þ.b. 900.000 línum (~7.000.000 orð) en af þeim voru ekki nema um 50.000 (~400.000 orð) úr raunverulegum ljóslesnum gögnum. Ætla má að aukið magn slíkra gagna geti bætt tólið umtalsvert. Nánari upplýsingar í README.md.

Use "OCR Post-Processing Tool for Icelandic 22.10"

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Tokenizer for Icelandic text (2.3.1)

Yfirlestur 1.0.1 (22.10)

UDConverter 22.01

Talrómur Utils

Kaldi L2 Speakers Recipe 22.10

Grapheme-to-phoneme (g2p) module for Icelandic (22.10)

Yfirlestur Docs 22.10

GreynirCorrect (3.2.0)

Webrice extension (22.01)

OCR Post-Processing Tool for Icelandic 22.10

Result filters

Metadata provider

Language

Resource type

Tool task

Availability

Organisation

Project

Keywords

Active filters:

Search results

Session recording