Tools
Linguistic Preprocessing Tools for Ukrainian
Stanza - the Stanford library for language processing; it supports Ukrainian using the UD corpus. Features models for tokenization, lemmatization, POS and syntactic analysis.
Spacy for Ukrainian - Ukrainian pipeline optimized for CPU. Components: tok2vec, morphologizer, parser, senter, ner, attribute_ruler, lemmatizer.
Ukrainian model to restore punctuation and capitalization in sentences, trained on 10m+ sentences from UberText 2.0 corpus.
sentence_boundary_detection_multilang segments a long, punctuated text into one or more constituent sentences. The key feature is that the model is multi-lingual and language-agnostic at inference time. Supports 49 common languages.
punct_cap_seg_47_language accepts as input lower-cased, unpunctuated, unsegmented text in 47 languages and performs punctuation restoration, true-casing (capitalization), and sentence boundary detection (segmentation).
xlm-roberta_punctuation_fullstop_truecase restores punctuation, true-case (capitalize), and detect sentence boundaries (full stops) in 47 languages.
Pymorphy2 - a morphological analyzer without disambiguation; the Ukrainian language is supported via the old version of VESUM.
Pymorphy3 is the continuation of the unmaintained project [pymorphy2](https://github.com/kmike/pymorphy2) which is an morphological analyzer (POS tagger + inflection engine) for Russian and Ukrainian languages.
LanguageTool - spelling, stylistic, and grammar checker, which helps to correct and paraphrase texts.
Stemmer for Ukrainian language - a new stemmer for the Ukrainian language (tree_stem) created via machine learning.
Ispell is an interactive spell-checking program for Unix which supports a large number of European languages. An emacs interface is available as well as the standard command-line mode.
Text Tonsorium - an automatic construction and execution of several workflows which includes normalisation.
tree_stem is a repository that introduces a new stemmer for the Ukrainian language created via machine learning. It outperforms all other stemmers available to date as well as some lemmatizers by the error rate relative to truncation (ERRT) (Paice 1994). It also has the lowest percentage of understemming errors compared to the available stemming algorithms. This repository also contains Python ports of some of the previously published stemmers.
uk-punctcase - a fine-tuning of XLM-RoBERTa-Uk model on Ukrainian texts to recover punctuation and case.
ukrainian-word-stress - this package takes text in Ukrainian and adds the stress mark after an accented vowel. This is useful in speech synthesis applications and for preparing text for language learners.
Ukrainian model to restore punctuation and capitalization is the NeMo model to restore punctuation and capitalization in sentences, trained on 10m+ sentences from UberText 2.0 corpus.
Nlp-uk is an instrument based on the VESUM dictionary and the LanguageTool engine. Supports tokenization, lemmatization, POS analysis, and basic disambiguation.
UDPipe 2 is a Python prototype, capable of performing tagging, lemmatization and syntactic analysis of CoNLL-U input.
NLP Cube is an opensource Natural Language Processing Framework with support for languages which are included in the UD Treebanks. NLP-Cube performs the following tasks: sentence segmentation, tokenization, POS Tagging (both language independent (UPOSes) and language dependent (XPOSes and ATTRs)), lemmatization, dependency parsing.
Trankit is a light-weight transformer-based Python Toolkit for Multilingual Natural Language Processing.
Word Embeddings & Lightweight NLP Models (Flair, FastText, etc.)
English-Ukrainian Legal Crosslingual Word Embeddings trained on legal domain texts that have been aligned on the same vector space using Vecmap according to their similarity. The embeddings have been developed in the framework of the CEF project MT4ALL.
Ukrainian flair embeddings - a model trained for 25+ epochs on the texts from ubertext2.0 (WIP). Has forward and backward versions of the embeddings.
flair-uk-pos is a Flair model that is ready to use for part-of-speech (upos) tagging. It is based on flair embeddings trained for Ukrainian language and has superior performance and a very small size (just 72mb!).
fastText (Ukrainian) - is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. fastText is a library for efficient learning of word representations and sentence classificationa custom neural networks machine translation engine.
skipgram.uk.300.bin is pre-trained word vectors for the Ukrainian language, trained with fastText on (yet unreleased) UberText2.0 dataset, collected and processed by the lang-uk.
Word embeddings (Word2Vec, GloVe, LexVec) - separate models with 300d vectors for newswire, articles, fiction, juridical texts.
BPEmb - a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.
FasText Common Crawl & Wikipedia contains pre-trained word vectors for 157 languages, trained on Wikipedia and the Common Crawl using fastText's CBOW model.
General Text Embedding Models is the GTE (General Text Embedding) family of models. Achieves state-of-the-art (SOTA) results in multilingual retrieval tasks and multi-task representation model evaluations when compared to models of similar size. Trained using an encoder-only transformers architecture, resulting in a smaller model size.
LEALLA is a collection of lightweight language-agnostic sentence embedding models supporting 109 languages, distilled from LaBSE. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
SONAR_200_text_encoder supports the same 202 languages as NLLB-200. Its embeddings are expected be equal to those the official implementation, but the latter stays the source of truth.
Bedrock Titan Text Embeddings v2 You can use the embedding model either via the Bedrock InvokeModel API or via Bedrock's batch jobs. For RAG use cases we recommend the former to embed queries during search (latency optimized) and the latter to index corpus (throughput optimized).
Language Models Supporting Ukrainian
UDify Pretrained Model weights for the UDify model, and extracted BERT weights in pytorch-transformers format.
Passage Reranking Multilingual BERT is trained using the Microsoft MS Marco Dataset. This training dataset contains approximately 400M tuples of a query, relevant and non-relevant passages.
BERT is a transformers model pretrained on a large corpus of multilingual data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
DistilBERT is a distilled version of the BERT base multilingual model. The model is trained on the concatenation of Wikipedia in 104 different languages. The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base).
CANINE pretrained on 104 languages using a masked language modeling (MLM) objective. It doesn't require an explicit tokenizer (such as WordPiece or SentencePiece) as other models like BERT and RoBERTa.
AviLaBSE is a unified model trained over LaBSE by google LaBSE to add other row resourced language dimensions and then convereted to PyTorch. It can be used to map more than 250 languages to a shared vector space. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
RemBERT is pretrained on 110 languages using a masked language modeling (MLM) objective. RemBERT uses small input embeddings and larger output embeddings.
LaBSE is a BERT-based model trained for sentence embedding for 109 languages. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
TwHIN-BERT is a new multi-lingual Tweet language model that is trained on 7 billion Tweets from over 100 distinct languages. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision (e.g., MLM), but also with a social objective based on the rich social engagements within a Twitter Heterogeneous Information Network (TwHIN).
LaBSE returns the sentence embeddings (pooler_output) and implements caching. Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval.
LaBSE is a port of the LaBSE model to PyTorch. It can be used to map 109 languages to a shared vector space.
HPLT Bert for Ukrainian is one of the encoder-only monolingual language models trained as a first release by the HPLT project. It is a so-called masked language model. In particular, this model is the modification of the classic BERT model named LTG-BERT.
LiBERTa is a BERT-like model pre-trained from scratch exclusively for Ukrainian. It was presented during the UNLP @ LREC-COLING 2024.
O3ap-sm is a Ukrainian news summarization model fine-tuned on the T5-small architecture. The model has been trained on the Ukrainian Corpus CCMatrix for text summarization tasks.
Ukrainian Roberta was trained with code provided in HuggingFace tutorial. Currently released model follows roberta-base-cased model architecture (12-layer, 768-hidden, 12-heads, 125M parameters).
ukr-paraphrase-multilingual-mpnet-base is a sentence-transformers model fine-tuned for Ukrainian language: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
Multilingual_en_ru_uk is a sentence-transformers model. It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. The model is used on the resource of multilingual analysis of patient complaints to determine the specialty of the doctor that is needed in this case: Virtual General Practice.
Ukrainian grammar correction - a model trained by Pravopysnyk team for the Ukrainian NLP shared task in Ukrainian grammar correction. The model is MBart-50-large set to ukr-to-ukr translation task finetuned on UA-GEC augmented by custom dataset generated using our synthetic error generation.
LLMs for Ukrainian
MamayLM is a new state-of-the-art LLM targeting the Ukrainian language (release 2025).
Aya 23 is an open weights research release of an instruction fine-tuned model with highly advanced multilingual capabilities. Aya 23 focuses on pairing a highly performant pre-trained Command family of models with the recently released Aya Collection. The result is a powerful multilingual large language model serving 23 languages.
LLaMAX is a language model with powerful multilingual capabilities without loss instruction-following capabilities.
Llama-2-7b-Ukrainian is a bilingual pre-trained model supporting Ukrainian and English. Continued pre-training from Llama-2-7b on 5B tokens consisting of 75% Ukrainian documents and 25% English documents from CulturaX.
LLaMAX3-8B is a multilingual language base model, developed through continued pre-training on Llama3, and supports over 100 languages. LLaMAX3-8B can serve as a base model to support downstream multilingual tasks but without instruct-following capability. The model is designed for Text Generation tasks.
aya-101 is a massively multilingual generative language model that follows instructions in 101 languages. Aya outperforms mT0 and BLOOMZ a wide variety of automatic and human evaluations despite covering double the number of languages.
EuroLLM-1.7B is a project has the goal of creating a suite of LLMs capable of understanding and generating text in all European Union languages as well as some additional relevant languages. For pre-training, the authors use 256 Nvidia H100 GPUs of the Marenostrum 5 supercomputer, training the model with a constant batch size of 3,072 sequences, which corresponds to approximately 12 million tokens, using the Adam optimizer, and BF16 precision.
EuroGPT2 - a model for European languages (EU-24 + Ukrainian). The model follows the original architecture as OpenAI's GPT2 apart from using rotary instead of learned positional embeddigs. Training data - Wikimedia dumps (Wikipedia, Wikinews, Wikibooks, Wikisource, Wikivoyage; 20230301). Tokens: 75,167,662,080.
mGPT 13B - a multilingual language model trained on the 61 languages from 25 language families. This model was pretrained on a 600Gb of texts, mostly from MC4 and Wikipedia.
XLM model was proposed in Cross-lingual Language Model Pretraining by Guillaume Lample and Alexis Conneau, trained on Wikipedia text in 100 languages. The model is a transformer pretrained using a masked language modeling (MLM) objective.
Ukranian mGPT 1.3B one of the models derived from the base mGPT-XL (1.3B) model which was originally trained on the 61 languages from 25 language families using Wikipedia and C4 corpus.
MiniLM-L12-v2 - is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space.
GPT2 124M Trained on Ukranian Fiction is a model trained on corpus of 4040 fiction books, 2.77 GiB in total. Evaluation on brown-uk gives perplexity of 50.16.
Mistral 7B OpenOrca oasst Top1 contains AWQ, GPTQ, and GGUF model files.The model designed for text generation tasks.
CodeKobzar13B is a generative model that was trained on Ukrainian Wikipedia data and Ukrainian language rules. It has knowledge of Ukrainian history, language, literature and culture.
uk4bВ - models pretrained on 4B tokens from UberText 2.0; designed for Text Generation, Text-Conditioned Metadata Prediction tasks.
Named Entity Recognition and Coreference Resolution for Ukrainian
GLiNER-X is the Multilingual Named Entity Recognition (NER) model which is capable of identifying any entity type.
uk_ner_web_trf_base is a fine-tuned XLM-Roberta model that is ready to use for Named Entity Recognition and achieves a performance close to SoA for the NER task for Ukrainian language. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PERS) and Miscellaneous (MISC).
uk_core_news (Spacy model) is a Ukrainian pipeline optimized for CPU. Components: tok2vec, morphologizer, parser, senter, ner, attribute_ruler, lemmatizer.
coref-ua is trained on the silver Ukrainian coreference dataset using the F-Coref library. The model was trained on top of the XML-Roberta-base model. According to the metrics retrieved from the evaluation dataset, the model is more precision-oriented.
MITIE NER Model - a model that automatically labels words in unfamiliar texts with the corresponding entities (name, geographical locations, company, etc.). For the NER recognition, MITIE library has been chosen. MITIE also provides high quality by combining standard text features and CCA embeddings.
uk_ner_web_trf_large - a fine-tuned XLM-Roberta model that is ready to use for Named Entity Recognition and achieves a SoA performance for the NER task for Ukrainian language. It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PERS) and Miscellaneous (MISC).
Flair-uk-ner - a model that is ready to use for Named Entity Recognition. Recognizes four types of entities: location (LOC), organizations (ORG), person (PERS) and Miscellaneous (MISC). The model was fine-tuned on the NER-UK dataset, released by the lang-uk.
Sentiment Analysis & Opinion Mining Tools
Emotions detector in Ukrainian texts is the first emotion detection model for Ukrainian texts, fine-tuned on the ukr-emotions-binary dataset for multi-label classification. Based on the intfloat/multilingual-e5-large architecture, the model detects the presence or absence of six basic emotions — Joy, Anger, Fear, Disgust, Surprise, and Sadness — as well as the absence of any emotion.
HENSOLDT ANALYTICS services for Speech to text, Language identification, Sentiment analysis and Named entities detection, Keyword spotting, Age detection, Gender detection, Summarization.
Machine Translation
Dragoman is a sentence-level SOTA English-Ukrainian translation model. It's trained using a two-phase pipeline: pretraining on cleaned Paracrawl dataset and unsupervised data selection phase on turuta/Multi30k-uk.
OPUS-tools is a collection of tools for search and download OPUS data.
Multilizer Localization Tools 1.0.0 are the easiest way to create and manage multilingual versions of software, documents, webpages and other content. With highly usable editor features, dictionaries and validations, the focus can be on the essential: translation.
Moses Web Demo is an interactive web demo of selected ÚFAL MT systems.
MTData automates the collection and preparation of machine translation (MT) datasets. It provides CLI and python APIs, which can be used for preparing MT experiments.
Tilde MT Machine Translation engine 1.0.0 a custom neural networks machine translation engine.
The English-Ukrainian Legal Translation Model is a neural translation model trained via unsupervised machine translation using Monoses. The model has been developed in the framework of the CEF project MT4ALL.
HelsinkiNLP - OPUS-MT 1.0.0 is a multilingual machine translation using neural networks.
mBART is fine-tuned for multilingual machine translation. It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper.
COMET receives a triplet with (source sentence, translation, reference translation) and returns a score that reflects the quality of the translation compared to both source and reference. The model is intented to be used for MT evaluation.
SynEst Translation Models are machine translation models focused on translating from and into the Estonian language. The models are based on the NLLB-1.3B multilingual model.
OPUS-CAT MT Engine is a Windows-based machine translation system built on the Marian NMT framework. OPUS-CAT MT Engine makes it possible to use a large selection of advanced neural machine translation models natively on Windows computers. The primary purpose of OPUS-CAT Engine is to provide professional translators local, secure, and confidential neural machine translation in computer-assisted translation tools (CAT tools), which are usually Windows-based.
OPUS-MT - an app that integrates publically avaiable translation models from the OPUS-MT project to bring fast and secure machine translation to the desktop of end users.
EdUKate translation software 1 - a software package that includes three tools: web frontend for machine translation featuring phonetic transcription of Ukrainian suitable for Czech speakers, API server and a tool for translation of documents with markup (html, docx, odt, pptx, odp,...).
OPUS-MT models for Ukrainian — an app that integrates publically avaiable translation models from the OPUS-MT project to bring fast and secure machine translation to the desktop of end users.
Multilingual Speech Translation Models
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting, without the need for fine-tuning.
Massively Multilingual Speech (MMS) is a model fine-tuned for multi-lingual ASR and part of Facebook's Massive Multilingual Speech project. This checkpoint is based on the Wav2Vec2 architecture and makes use of adapter models to transcribe 1000+ languages.
mHuBERT-147 is compact and competitive multilingual HuBERT models trained on 90K hours of open-license data in 147 languages. Different from traditional HuBERTs, mHuBERT-147 models are trained using faiss IVF discrete speech units. Training employs a two-level language, data source up-sampling during training.
Chatbot Tools
HelpUkraineBot - this chatbot is Latvia’s assistance to Ukraine.
Tradukka translator (Spanish-X)
OPUS-MT Telegram Translation Bot
Ukrainian - Czech Telegram Translation Bot
Ukrainian - Czech Messenger Translation Bot
Charles Translator for Ukraine - the project whose primary objective is to help refugees from Ukraine by narrowing the communication gap between them and other people in the Czech Republic. This a a machine translation system for Czech-Ukrainian which should be of higher quality than Google Translate and free to use through web app, Android app and REST API.