← Back to overview
15 mei 2026 • Maarten Sukel

How Murmel performs on speakers from different provinces

Methodology of the Murmel ASR benchmark on Dutch parliamentary debate audio: which audio was used, which models were compared, how the metadata was collected, and the limitations of the resulting numbers.

This document describes how the numbers in the Murmel ASR benchmark were produced. It is meant as reference material: a researcher, journalist, or civil servant who wants to verify or reproduce the figures will find the audio source, the models, the evaluation procedure, and the limitations of the approach below.

1. Audio

The test set consists of recordings from publicly available debates of the Tweede Kamer (the Dutch House of Representatives). Sample properties:

  • 1,662 audio segments, 8.9 hours (32,200 seconds) of speech in total.
  • 225 unique speakers across 79 different debates.
  • Average segment duration: 19.4 seconds.
  • Original source: video and audio streams from tweedekamer.nl, segmented by speaker turn.
  • Selection: random sample from debates recorded after the training-data cut-off of the most recently published model in the comparison. None of the tested segments are present in the training data of any of the models.
  • No filtering on audio quality, speaker, party, gender, or birthplace. A minimum segment duration of 3 s and maximum of 45 s was applied to avoid WER degeneracy on extremely short clips.

2. Models tested

Seven systems, all in their officially published weights and default configuration:

  • Parakeet TDT v3 (NVIDIA)
  • Qwen3-ASR 1.7B (Alibaba)
  • Voxtral Small (Mistral)
  • Whisper Large v3 (OpenAI)
  • Whisper Large v3 Turbo (OpenAI)
  • Whisper Large v3 Dutch fine-tuned (third-party Dutch fine-tune)
  • Murmel v1 (trained by The AI Factory)

Every model received the same audio in the same order and quality. No language prompt or system instruction; only the language code nl where the model supports it.

3. Evaluation metric

  • Metric: Word Error Rate (WER), defined as (substitutions + insertions + deletions) / number of reference words.
  • Text normalisation before WER: lowercasing, removal of punctuation, expansion of common abbreviations and numbers, removal of filler tokens such as "eh" and "uh". The exact same routine is applied to every model and to the reference.
  • Reference transcripts: the official Handelingen of the Tweede Kamer, with manual correction of obvious transcription or OCR errors.
  • Common-sample filter: only segments where all seven models produced a non-empty output are included (n=1,662). This prevents one model being unfairly advantaged or disadvantaged by drop-outs.

4. Important caveat: province of birth ≠ accent

The province-of-birth breakdown is often read as a measurement of accent. That is not a one-to-one mapping. Not everyone born in a given province speaks with the regional accent of that province. Many speakers:

  • moved to another region at an early age;
  • went through education or a career path that shifted their accent toward Standard Dutch;
  • carry a mix of multiple regional influences.

The province columns in the results table are therefore a proxy for the average accent variation within that birth group, not a measurement of the accent itself. A clean accent measurement would require human annotators to label every recording on phonetic features; that is outside the scope of this benchmark.

The numbers should be read as: "for speakers born in province X — a group whose audio on average contains certain accent features more often — WER is Y%." Not as: "the accent of province X yields WER Y%."

5. Results by province of birth

The table below shows WER per model, grouped by the speaker's province of birth. Lower is better.

Word Error Rate (WER) per province of birth

Lower is better. All six comparison models received exactly the same audio.

Province of birthMurmelWhisper TurboWhisper Large v3Voxtral SmallQwen3 1.7BWhisper L-v3 FT
Groningen (n=41)6.4%12.3%12.9%9.3%10.6%17.1%
Zeeland (n=8)9.8%20.7%21.6%20.7%14.4%23.9%
Overijssel (n=150)13.9%18.8%18.5%18.1%19.0%22.9%
Limburg (n=80)14.6%18.1%18.8%17.9%19.7%23.9%
Noord-Brabant (n=106)15.3%19.4%20.5%19.1%20.3%23.7%
Noord-Holland (n=257)15.6%19.9%20.2%20.3%21.3%26.8%
Zuid-Holland (n=342)16.0%19.2%20.4%19.2%20.8%24.8%
Drenthe (n=30)16.8%19.9%20.4%19.5%22.0%27.9%
Friesland / Fryslân (n=38)19.5%24.2%23.2%23.7%23.4%26.9%
Utrecht (n=117)19.7%23.1%22.9%22.8%24.6%27.8%
Gelderland (n=201)21.2%22.8%21.5%21.5%22.1%26.4%

6. Limitations of this measurement

  • Province of birth ≠ accent. See section 5. The provincial numbers are a proxy, not a direct measurement of accent.
  • Sample size differs per province. Zeeland (n=8) and Drenthe (n=30) carry far wider uncertainty than Zuid-Holland (n=342). Comparisons between small-n provinces should be treated cautiously.
  • Code-switching, not an accent effect. The relatively high WER for Friesland / Fryslân is likely related to speakers mixing Dutch and Frisian within a single sentence. That is a language-mix effect, not an accent effect.
  • Procedural meetings. Short technical sentences and people talking over each other yield significantly higher WER for all seven models. These segments are in the test set; they raise the averages, but not selectively for one model.
  • Domain. Parliamentary audio is relatively formal and has reasonable microphone quality. Results on call-centre, field, or healthcare-context audio may differ.
  • Unknown metadata. For 17–22% of speaking time, country or province of birth is unknown. That portion sits in a separate "Unknown" category and is excluded from the provincial rows in the table.
  • Reference quality. The official Handelingen are manually compiled but occasionally contain transcription errors. Where clearly identifiable, they have been corrected; residual errors cannot be ruled out.

We are continuously improving Murmel. Ideas or input? Get in touch!

Murmel is free to try — 30 minutes of transcription included, no credit card required. Create an account and test it directly on your own audio.