mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-04-24 23:20:09 +00:00
[PR #962] [MERGED] [IMPROVEMENT] [FIX] Improve the start and end timestamps of extracted burned in captions #1784
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/CCExtractor/ccextractor/pull/962
Author: @saurabhshah0410
Created: 3/12/2018
Status: ✅ Merged
Merged: 3/12/2018
Merged by: @cfsmp3
Base:
master← Head:improvement📝 Commits (1)
52b5a38Improve the start and end timestamps of extracted burned in captions📊 Changes
1 file changed (+50 additions, -23 deletions)
View changed files
📝
src/lib_ccx/hardsubx_decoder.c(+50 -23)📄 Description
In raising this pull request, I confirm the following (please check boxes):
My familiarity with the project is as follows (check one):
The start and end timestamps of extracted burned in captions are flawed
and off by a large difference. Also, the start time of the first burned
in caption extracted is always zero, which is not always the case. And
the extracted captions always appear in continuous timestamps.
To see that, you can download this file from the UK TV Samples in the samples repository:
https://drive.google.com/open?id=0B_61ywKPmI0TdlRWcVdnajVJUWs
Since the duration of that file is 15 minutes, we can trim it down to 30 seconds for our purposes:
ffmpeg -i BBC1.mp4 -acodec copy -vcodec copy -scodec copy -ss 00:00:00 -t 00:00:30 bbc.mp4This will generate the first 30 seconds of the
BBC1.mp4inbbc.mp4.Now, before this commit, if I compile ccextractor with hard subs enabled and run the following command:
./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60,the generated
bbc.srt(ignore the weird looking characters, that is just OCR not giving a good output I guess) is:One can see that the timings are clearly flawed, and are off by a large margin.
Also, the way that the code is written, the first extracted caption will always have
a starting timestamp of 00:00:00. Also, the extracted subtitles always come one after
the other in terms of time(i.e., there is only a 2 ms gap between two consecutive captions
making it look like the whole video had hard subs in it throughout).
This commit improves the start and end timestamps of the extracted
burned in captions and reduces the error significantly, bringing the
timestamps fairly close to the actual timings as they appear in the
media file.
With these changes included, and running the command:
./ccextractor bbc.mp4 -hardsubx -sub_color yellow -conf_thresh 60,the generated
bbc.srtis:One can see that while the OCR output is the same, the timings have improved and
are closer to the actual timings in the media file.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.