{"id":3022,"date":"2025-03-12T16:19:34","date_gmt":"2025-03-12T20:19:34","guid":{"rendered":"https:\/\/blog.lufamily.ca\/kang\/?p=3022"},"modified":"2025-04-08T18:43:14","modified_gmt":"2025-04-08T22:43:14","slug":"processing-graphical-subtitles","status":"publish","type":"post","link":"https:\/\/blog.lufamily.ca\/kang\/2025\/03\/12\/processing-graphical-subtitles\/","title":{"rendered":"Processing Graphical Subtitles"},"content":{"rendered":"\n<p>In the past, when I got hold of a video that has <code>hdmv_pgs_subtitle<\/code> subtitle streams, I have always ignored it. Instead I tried to find a compatible subtitle in <code>.srt<\/code> format on the <a href=\"https:\/\/opensubtitles.org\" target=\"_blank\" rel=\"noreferrer noopener\">opensubtitles.org<\/a> website. Today I came across a video that I am trying to archive that does not have the appropriate subtitles that I wanted. All of this would not have been an issue if my preferred <code>mp4<\/code> format actually supports the <code>hdmv_pgs_subtitle<\/code> format.<\/p>\n\n\n\n<p>I know an OCR (Optical Character Recognition) technique for extracting the subtitles from the <code>hdmv_pgs_subtitle<\/code> stream, but I am always in a hurry. This time, I bit the bullet and went down on this path.<\/p>\n\n\n\n<p>Below are the steps that I had to go through.<\/p>\n\n\n\n<p>First I had to download and install ffmpeg and mkvtoolnix packages on my Linux machine, and then execute the following commands to extract the Chinese subtitles that I wanted.<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code>ffmpeg -y -i archive.mkv -map 0:s:1 -c:s dvdsub -f matroska chi.mkv\nmkvextract chi.mkv tracks 0:mysub<\/code><\/pre>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"alignright size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"142\" height=\"170\" src=\"https:\/\/blog.lufamily.ca\/kang\/wp-content\/uploads\/sites\/3\/2025\/03\/Screenshot-2025-03-12-at-4.07.01\u202fPM-1.png\" alt=\"\" class=\"wp-image-3035\" style=\"width:78px;height:auto\"\/><\/figure>\n<\/div>\n\n\n<p>After the above commands, I will have <code>mysub.idx<\/code> and <code>mysub.sup<\/code> files. The first are the time index codes and the latter are the subtitle images.<\/p>\n\n\n\n<p>On a Windows virtual machine, I had to download <a href=\"https:\/\/www.nikse.dk\/subtitleedit\" target=\"_blank\" rel=\"noreferrer noopener\">Subtitle Edit<\/a>, a subtitle editor tool that has the OCR functionality, and convert the mysub.idx and mysub.sup into mysub.srt, which I can then later use to re-incorporate back into the archive video file.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/blog.lufamily.ca\/kang\/wp-content\/uploads\/sites\/3\/2025\/03\/Screenshot-2025-03-12-at-4.06.48\u202fPM-1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"573\" src=\"https:\/\/blog.lufamily.ca\/kang\/wp-content\/uploads\/sites\/3\/2025\/03\/Screenshot-2025-03-12-at-4.06.48\u202fPM-1-1024x573.png\" alt=\"\" class=\"wp-image-3036\" srcset=\"https:\/\/blog.lufamily.ca\/kang\/wp-content\/uploads\/sites\/3\/2025\/03\/Screenshot-2025-03-12-at-4.06.48\u202fPM-1-1024x573.png 1024w, https:\/\/blog.lufamily.ca\/kang\/wp-content\/uploads\/sites\/3\/2025\/03\/Screenshot-2025-03-12-at-4.06.48\u202fPM-1-300x168.png 300w, https:\/\/blog.lufamily.ca\/kang\/wp-content\/uploads\/sites\/3\/2025\/03\/Screenshot-2025-03-12-at-4.06.48\u202fPM-1-768x429.png 768w, https:\/\/blog.lufamily.ca\/kang\/wp-content\/uploads\/sites\/3\/2025\/03\/Screenshot-2025-03-12-at-4.06.48\u202fPM-1-1536x859.png 1536w, https:\/\/blog.lufamily.ca\/kang\/wp-content\/uploads\/sites\/3\/2025\/03\/Screenshot-2025-03-12-at-4.06.48\u202fPM-1-2048x1145.png 2048w, https:\/\/blog.lufamily.ca\/kang\/wp-content\/uploads\/sites\/3\/2025\/03\/Screenshot-2025-03-12-at-4.06.48\u202fPM-1-1200x671.png 1200w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/a><figcaption class=\"wp-element-caption\">After the OCR is completed.<\/figcaption><\/figure>\n\n\n\n<p>Above is a screenshot of the application after the OCR is completed. I found that the engine mode of Tesseract + LSTM worked the best. Of course, I had to select the matching language that is befitting of the subtitle. Once I saved the finished product as <code>mysub.srt<\/code> I can then use this file to create <code>archive.mp4<\/code> using <code>ffmpeg<\/code>.<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code>ffmpeg -i archive.mkv -i mysub.srt -map 0:v -map 0:a -map 1:s -c copy -c:s mov_text -metadata:s:s:0 language=chi archive.mp4<\/code><\/pre>\n\n\n\n<p>Video file successfully archived!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the past, when I got hold of a video that has hdmv_pgs_subtitle subtitle streams, I have always ignored it. Instead I tried to find a compatible subtitle in .srt format on the opensubtitles.org website. Today I came across a video that I am trying to archive that does not have the appropriate subtitles that &hellip; <a href=\"https:\/\/blog.lufamily.ca\/kang\/2025\/03\/12\/processing-graphical-subtitles\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Processing Graphical Subtitles&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"everybody","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[111],"tags":[168,5,28,178],"class_list":["post-3022","post","type-post","status-publish","format-standard","hentry","category-tech","tag-ffmpeg","tag-nas","tag-technology","tag-video-processing"],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p7V6i8-MK","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/posts\/3022","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/comments?post=3022"}],"version-history":[{"count":3,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/posts\/3022\/revisions"}],"predecessor-version":[{"id":3038,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/posts\/3022\/revisions\/3038"}],"wp:attachment":[{"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/media?parent=3022"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/categories?post=3022"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.lufamily.ca\/kang\/wp-json\/wp\/v2\/tags?post=3022"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}