who spoke when
Did you notice the coloured bar with stripes under the video?
What is that bar?
Each colour segment in that bar represents the time a different speaker spoke in the video. This is called "speaker diarisation", it identifies who spoke and when. This is a different problem from voice recognition, we are not trying to identify what is spoken.
Why do that?
Well, in this case this is useful to identify different segments of the talk, so I can skip the introduction, jump through the questions in a Q&A, identify a discussion between groups of speakers, get an idea of the general flow of the talk... basically it gives a "bird's ear view" of the media and facilitates its navigation. Yes you can already click and drag along the video slider to see snapshots in youtube videos but you still have to browse slowly through all snapshosts. Also audio files don't have snapshots.
This is great! Why is this feature not all over youtube?
The implementation described below is not scalable as is. The computation to identify the different segments is extremely intensive, it can take almost as long as just playing the media itself
How to buid your own diarisation bars:
download the audio of the video/podcast
youtube-dl -x 'http://www.youtube.com/watch?v=klZWuI6Fqgk' -o edge.of.sky.m4a
convert it to 16kHz 16bit mono PCM
ffmpeg -i edge.of.sky.m4a -acodec pcm_s16le -ac 1 -ar 16000 edge.of.sky.wav
apply the diarisation tool from lium3 (links below) to generate the segments file
java -Xmx2048m -jar ./lium_spkdiarization-8.4.1.jar --fInputMask=./what.if.wav --sOutputMask=./edge.of.sky.seg --doCEClustering edge.of.sky
run the R script to generate the bar
R CMD BATCH diarise.r
finally open edge.of.sky.png and insert/embed under videos/podcasts
The chosen colour palette (Accent1), based on colorbrewer2, is optimised for categorical data, in this case different speakers, it provides maximum hue contrast between colours. It is also suitable for dichromats.
This speaker diarisation bar was inspired from the "moodbar" that I've been using for many years to navigate music files.
other speaker diarisation tools: