In the past, when I got hold of a video that has hdmv_pgs_subtitle
subtitle streams, I have always ignored it. Instead I tried to find a compatible subtitle in .srt
format on the opensubtitles.org website. Today I came across a video that I am trying to archive that does not have the appropriate subtitles that I wanted. All of this would not have been an issue if my preferred mp4
format actually supports the hdmv_pgs_subtitle
format.
I know an OCR (Optical Character Recognition) technique for extracting the subtitles from the hdmv_pgs_subtitle
stream, but I am always in a hurry. This time, I bit the bullet and went down on this path.
Below are the steps that I had to go through.
First I had to download and install ffmpeg and mkvtoolnix packages on my Linux machine, and then execute the following commands to extract the Chinese subtitles that I wanted.
ffmpeg -y -i archive.mkv -map 0:s:1 -c:s dvdsub -f matroska chi.mkv
mkvextract chi.mkv tracks 0:mysub

After the above commands, I will have mysub.idx
and mysub.sup
files. The first are the time index codes and the latter are the subtitle images.
On a Windows virtual machine, I had to download Subtitle Edit, a subtitle editor tool that has the OCR functionality, and convert the mysub.idx and mysub.sup into mysub.srt, which I can then later use to re-incorporate back into the archive video file.

Above is a screenshot of the application after the OCR is completed. I found that the engine mode of Tesseract + LSTM worked the best. Of course, I had to select the matching language that is befitting of the subtitle. Once I saved the finished product as mysub.srt
I can then use this file to create archive.mp4
using ffmpeg
.
ffmpeg -i archive.mkv -i mysub.srt -map 0:v -map 0:a -map 1:s -c copy -c:s mov_text -metadata:s:s:0 language=chi archive.mp4
Video file successfully archived!