2021

Punctuation restoration for read Polish speech

WikiPunct dataset & PolEval 2021 Task 1 — text + aligned audio for punctuation and capitalization after ASR.

NLP Datasets
Automatic speech recognition output is often unpunctuated and uncapitalized, which hurts readability and downstream NLP. This line of work builds a redistributable benchmark of read Polish: news‑style and conversational Wikipedia talk pages, recorded by many speakers, with forced alignment between audio and reference text.

WikiPunct at a glance

  • · WikiTalks — talk pages as conversational Polish (~20% of spoken hours).
  • · WikiNews — cleaner news‑style articles (~80%).
  • · Train/test splits with ASR‑style transcripts, time alignment, and reference corpora for model training.

Task framing

Participants could fuse lexical cues with acoustic/prosodic features in a multimodal setup—the goal is robust punctuation and capitalization aligned with how Polish is read aloud, while staying honest about ambiguous boundaries in spontaneous speech.

Publications

PolEval 2021 Task 1: Punctuation restoration from read text describes the shared task, WikiPunct construction, and baselines (Mikołajczyk et al., PolEval 2021 workshop proceedings). Workshop PDF (open at p. 21) →

Joint prediction of truecasing and punctuation for conversational speech in low‑resource scenarios (Pappagari et al., ASRU 2021) extends the same problem setting to conversational ASR with joint truecasing and punctuation—adjacent to the PolEval read‑speech benchmark. IEEE Xplore →

← All projects