Result filters

Metadata provider

Language

  • English
  • Polish

Resource type

Availability

Active filters:

  • Language: English
  • Language: Polish
Loading...
19 record(s) found

Search results

  • Universal Dependencies 2.10 models for UDPipe 2 (2022-07-11)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 123 treebanks of 69 languages of Universal Depenencies 2.10 Treebanks, created solely using UD 2.10 data (https://hdl.handle.net/11234/1-4758). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_210_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
  • CUBBITT Translation Models (en-pl) (v1.0)

    CUBBITT En-Pl translation models, exported via TensorFlow Serving, available in the Lindat translation service (https://lindat.mff.cuni.cz/services/translation/). Models are compatible with Tensor2tensor version 1.6.6. For details about the model training (data, model hyper-parameters), please contact the archive maintainer. Evaluation on newstest2020 (BLEU): en->pl: 12.3 pl->en: 20.0 (Evaluated using multeval: https://github.com/jhclark/multeval)
  • CorPipe 23 multilingual CorefUD 1.1 model (corpipe23-corefud1.1-231206)

    The `corpipe23-corefud1.1-231206` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 23 (https://github.com/ufal/crac2023-corpipe). It is released under the CC BY-NC-SA 4.0 license. The model is language agnostic (no _corpus id_ on input), so it can be used to predict coreference in any `mT5` language (for zero-shot evaluation, see the paper). However, note that the empty nodes must be present already on input, they are not predicted (the same settings as in the CRAC23 shared task).
  • Paralela corpus and search engine

    Paralela is as an open-ended, opportunistic parallel corpus of Polish-English and English-Polish translations. It currently contains 262 million words in 10,877,000 translation segments. The Paralela online search engine supports the SlopeQ query syntax for bilingual Polish-English corpus queries for the full dataset. Both the full texts and query results can be accessed and exported through the online application at http://paralela.clarin-pl.eu.
  • Universal Dependencies 2.6 models for UDPipe 2 (2020-08-31)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 99 treebanks of 63 languages of Universal Depenencies 2.6 Treebanks, created solely using UD 2.6 data (https://hdl.handle.net/11234/1-3226). The model documentation including performance can be found at https://ufal.mff.cuni.cz/udpipe/2/models#universal_dependencies_26_models . To use these models, you need UDPipe version 2.0, which you can download from https://ufal.mff.cuni.cz/udpipe/2 .
  • CorpoGrabber

    CorpoGrabber: The Toolchain to Automatic Acquiring and Extraction of the Website Content Jan Kocoń, Wroclaw University of Technology CorpoGrabber is a pipeline of tools to get the most relevant content of the website, including all subsites (up to the user-defined depth). The proposed toolchain can be used to build a big Web corpora of text documents. It requires only the list of the root websites as the input. Tools composing CorpoGrabber are adapted to Polish, but most subtasks are language independent. The whole process can be run in parallel on a single machine and includes the following tasks: downloading of the HTML subpages of each input page URL [1], extracting of plain text from each subpage by removing boilerplate content (such as navigation links, headers, footers, advertisements from HTML pages) [2], deduplication of plain text [2], removing of bad quality documents utilizing Morphological Analysis Converter and Aggregator (MACA) [3], tagging of documents using Wrocław CRF Tagger (WCRFT) [4]. Last two steps are available only for Polish. The result is a corpora as a set of tagged documents for each website. References [1] https://www.httrack.com/html/faq.html [2] J. Pomikalek. 2011. Removing Boilerplate and Duplicate Content from Web Corpora. Ph.D. Thesis. Masaryk University, Faculcy of Informatics. Brno. [3] A. Radziszewski, T. Sniatowski. 2011. Maca – a configurable tool to integrate Polish morphological data. Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation. Barcelona, Spain. [4] A. Radziszewski. 2013. A tiered CRF tagger for Polish. Intelligent Tools for Building a Scientific Information Platform: Advanced Architectures and Solutions. Springer Verlag.
  • CorPipe 23 multilingual CorefUD 1.2 model (corpipe23-corefud1.2-240906)

    The `corpipe23-corefud1.2-240906` is a `mT5-large`-based multilingual model for coreference resolution usable in CorPipe 23 <https://github.com/ufal/crac2023-corpipe>. It is released under the CC BY-NC-SA 4.0 license. The model is language agnostic (no corpus id on input), so it can be in theory used to predict coreference in any `mT5` language. However, the model expects empty nodes to be already present on input, predicted by the https://www.kaggle.com/models/ufal-mff/crac2024_zero_nodes_baseline/. This model was present in the CorPipe 24 paper as an alternative to a single-stage approach, where the empty nodes are predicted joinly with coreference resolution (via http://hdl.handle.net/11234/1-5672), an approach circa twice as fast but of slightly worse quality.
  • Universal Dependencies 2.5 Models for UDPipe (2019-12-06)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 94 treebanks of 61 languages of Universal Depenencies 2.5 Treebanks, created solely using UD 2.5 data (http://hdl.handle.net/11234/1-3105). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_25_models . To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe . In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
  • Universal Dependencies 2.4 Models for UDPipe (2019-05-31)

    Tokenizer, POS Tagger, Lemmatizer and Parser models for 90 treebanks of 60 languages of Universal Depenencies 2.4 Treebanks, created solely using UD 2.4 data (http://hdl.handle.net/11234/1-2988). The model documentation including performance can be found at http://ufal.mff.cuni.cz/udpipe/models#universal_dependencies_24_models . To use these models, you need UDPipe binary version at least 1.2, which you can download from http://ufal.mff.cuni.cz/udpipe . In addition to models itself, all additional data and value of hyperparameters used for training are available in the second archive, allowing reproducible training.
  • WiKNN Text Classifier

    WiKNN is an online text classifier service for Polish and English texts. It supports hierarchical labelled classification of user-submitted texts with Wikipedia categories. WiKNN is available through a web-based interface (http://pelcra.clarin-pl.eu/tools/classifier/) and as a REST service with interactive documentation available at http://clarin.pelcra.pl/apidocs/wiknn.