Processing Graphical Subtitles

In the past, when I got hold of a video that has hdmv_pgs_subtitle subtitle streams, I have always ignored it. Instead I tried to find a compatible subtitle in .srt format on the opensubtitles.org website. Today I came across a video that I am trying to archive that does not have the appropriate subtitles that I wanted. All of this would not have been an issue if my preferred mp4 format actually supports the hdmv_pgs_subtitle format.

I know an OCR (Optical Character Recognition) technique for extracting the subtitles from the hdmv_pgs_subtitle stream, but I am always in a hurry. This time, I bit the bullet and went down on this path.

Below are the steps that I had to go through.

First I had to download and install ffmpeg and mkvtoolnix packages on my Linux machine, and then execute the following commands to extract the Chinese subtitles that I wanted.

ffmpeg -y -i archive.mkv -map 0:s:1 -c:s dvdsub -f matroska chi.mkv
mkvextract chi.mkv tracks 0:mysub

After the above commands, I will have mysub.idx and mysub.sup files. The first are the time index codes and the latter are the subtitle images.

On a Windows virtual machine, I had to download Subtitle Edit, a subtitle editor tool that has the OCR functionality, and convert the mysub.idx and mysub.sup into mysub.srt, which I can then later use to re-incorporate back into the archive video file.

After the OCR is completed.

Above is a screenshot of the application after the OCR is completed. I found that the engine mode of Tesseract + LSTM worked the best. Of course, I had to select the matching language that is befitting of the subtitle. Once I saved the finished product as mysub.srt I can then use this file to create archive.mp4 using ffmpeg.

ffmpeg -i archive.mkv -i mysub.srt -map 0:v -map 0:a -map 1:s -c copy -c:s mov_text -metadata:s:s:0 language=chi archive.mp4

Video file successfully archived!

Leave a Reply

Your email address will not be published. Required fields are marked *