2015 — 2026

Model training

From MATLAB, manual features, and classic MLPs through BERT- and T5-era NLP to Polish LLMs, plus vision and audio stacks worth naming—what I have actually trained and shipped.

Deep Learning NLP Computer Vision Audio Open Source

Over the years I have trained and fine-tuned many models, not only quick experiments but long-running research and production releases. The sections below split that work by modality and task. Later on this page, the Hugging Face block lists the public VoiceLab catalog I contributed to with the team.

My earliest pipelines were in MATLAB: manual feature selection, classical descriptors, and plain MLPs. Then came CNN stacks for vision (ResNet-style backbones, dense blocks, whatever the benchmark called for), before today’s huge transformers in every modality. Later NLP moved through BERT-family encoders, T5-style text-to-text models, and on to LLMs at scale: same scientific habits, different parameter counts and tooling.

Tooling, honestly: most custom training stays in PyTorch. I reach for Lightning for easy checkpointing and clean validation loops. After the LLM wave, the default is Hugging Face, with hosted APIs when latency or cost wins.

These days, model training has become less central than designing pipelines, ordering API calls, and what goes in — and the prompts, of course. I used to wire things in plain code, then LangChain showed up when retrieval, routing, or multi-tool orchestration was faster than writing HTTP clients and parsers myself; now I often skip that extra dependency.

Computer vision

Image classification with CNN encoders: residual and efficient backbones, custom heads, and task-specific losses for biomedical settings (dermoscopy, microscopy, related benchmarks) plus other vision problems in my research line.

Bibliography and context live on biomedical imaging.

Object detection and instance segmentation on litter and merged waste benchmarks. Models I trained end-to-end for those benchmarks included:

YOLO-family single-shot detectors
Two-stage Faster R-CNN and instance Mask R-CNN
EfficientDet and DETR

Training was in PyTorch with competition-style evaluation. Write-up and links: waste detection.

Generative image models in the GAN era meant convolutional generator–discriminator pairs, for example 64×64 pixel-character generation and dataset work on Tiny Hero.

NLP and text

Along the way: HerBERT-style classification, T5-style keyword heads, and full decoder-only LLMs. Sequence modeling covered punctuation restoration, keyword-style text-to-text generation, and later large-scale Polish generative LMs (TRURL family) with quantised variants. Pointers: PolEval punctuation, Reedy, and the Hugging Face section on this page for public checkpoints.

Audio and video

Environmental sound and species ID meant turning field recordings into mel spectrograms, then CNN stacks over time–frequency tiles with aggregation across clips (the usual “treat audio like texture” move before audio transformers were everywhere). I curated baselines and wrote about the pipeline for bird song classification.

Sign language recognition pushed me into video and pose streams: 2D keypoints over time, temporal pooling or RNN-style heads on top of convolutional backbones, and the engineering pain of frame sampling and class imbalance. That line lives under HearAI.

Text-side work on conversational transcripts (punctuation, truecasing) sits next to audio in the research story, but the heavy lifting there was sequence models on text, not acoustic front ends. See PolEval punctuation above if you care about that thread.

Public Hugging Face models

At VoiceLab I worked with the NLP team on the public releases below. We trained and released Poland’s first large-scale generative model TRURL (7B and 13B variants, 8-bit quantizations, and an academic edition), alongside production Polish NLP models still in daily use: vlt5-base-keywords for keyword extraction (11k+ downloads) and herbert-base-cased-sentiment (24k+). Everything is open on Hugging Face with accompanying datasets.

The wider Voicelab catalog on Hugging Face counts close to fifty checkpoints — most kept private — covering Polish and multilingual NER, punctuation restoration, intent classification, paraphrasing, sentence embeddings (SBERT), and lemmatization, on HerBERT, BERT-multilingual, T5, plT5, and GPT-Neo backbones. The public slice above is the part anyone can download today.

All VoiceLab models → TRURL 2 13B → TRURL 2 7B → TRURL 2 13B (academic) → TRURL 2 13B (8-bit) → TRURL 2 7B (8-bit) → vlt5 keywords → HerBERT sentiment → VoiceLab datasets →

← All projects