2021
Punctuation restoration for read Polish speech
WikiPunct dataset & PolEval 2021 Task 1 — text + aligned audio for punctuation and capitalization after ASR.
WikiPunct at a glance
- · WikiTalks — talk pages as conversational Polish (~20% of spoken hours).
- · WikiNews — cleaner news‑style articles (~80%).
- · Train/test splits with ASR‑style transcripts, time alignment, and reference corpora for model training.
Task framing
Participants could fuse lexical cues with acoustic/prosodic features in a multimodal setup—the goal is robust punctuation and capitalization aligned with how Polish is read aloud, while staying honest about ambiguous boundaries in spontaneous speech.
Publications
PolEval 2021 Task 1: Punctuation restoration from read text describes the shared task, WikiPunct construction, and baselines (Mikołajczyk et al., PolEval 2021 workshop proceedings). Workshop PDF (open at p. 21) →
Joint prediction of truecasing and punctuation for conversational speech in low‑resource scenarios (Pappagari et al., ASRU 2021) extends the same problem setting to conversational ASR with joint truecasing and punctuation—adjacent to the PolEval read‑speech benchmark. IEEE Xplore →