mirror of
https://github.com/CCExtractor/ccextractor.git
synced 2026-02-13 13:35:37 +00:00
[BUG] ocr.c writing outside allocated memory #573
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @dcjm on GitHub (Apr 6, 2020).
I am raising this as an issue but I do have a fix for the problem so it could be a pull request. However, this is a full explanation of the problem in case you want to try a different approach.
When running ccextractor on some ts files I found two cases where malloc was reporting memory corruption. This was tested first with the 0.88 release and then with git master. I ran valgrind and got the following:
==32409== Invalid write of size 1
==32409== at 0x483CA14: memmove (vg_replace_strmem.c:1270)
==32409== by 0x13F423: ocr_bitmap (ocr.c:671)
==32409== by 0x13FEEC: ocr_rect (ocr.c:915)
==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663)
==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726)
==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832)
==32409== by 0x149BEA: process_data (general_loop.c:644)
==32409== by 0x14AB6E: general_loop (general_loop.c:1019)
==32409== by 0x137058: api_start (ccextractor.c:213)
==32409== by 0x137D7F: main (ccextractor.c:534)
==32409== Address 0x9b75967 is 0 bytes after a block of size 135 alloc'd
==32409== at 0x483577F: malloc (vg_replace_malloc.c:299)
==32409== by 0x13F07E: ocr_bitmap (ocr.c:595)
==32409== by 0x13FEEC: ocr_rect (ocr.c:915)
==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663)
==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726)
==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832)
==32409== by 0x149BEA: process_data (general_loop.c:644)
==32409== by 0x14AB6E: general_loop (general_loop.c:1019)
==32409== by 0x137058: api_start (ccextractor.c:213)
==32409== by 0x137D7F: main (ccextractor.c:534)
==32409==
==32409== Invalid write of size 1
==32409== at 0x13F466: ocr_bitmap (ocr.c:679)
==32409== by 0x13FEEC: ocr_rect (ocr.c:915)
==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663)
==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726)
==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832)
==32409== by 0x149BEA: process_data (general_loop.c:644)
==32409== by 0x14AB6E: general_loop (general_loop.c:1019)
==32409== by 0x137058: api_start (ccextractor.c:213)
==32409== by 0x137D7F: main (ccextractor.c:534)
==32409== Address 0x9b7596a is 3 bytes after a block of size 135 alloc'd
==32409== at 0x483577F: malloc (vg_replace_malloc.c:299)
==32409== by 0x13F07E: ocr_bitmap (ocr.c:595)
==32409== by 0x13FEEC: ocr_rect (ocr.c:915)
==32409== by 0x17B828: write_dvb_sub (dvb_subtitle_decoder.c:1663)
==32409== by 0x17BBD9: dvbsub_handle_display_segment (dvb_subtitle_decoder.c:1726)
==32409== by 0x17C077: dvbsub_decode (dvb_subtitle_decoder.c:1832)
==32409== by 0x149BEA: process_data (general_loop.c:644)
==32409== by 0x14AB6E: general_loop (general_loop.c:1019)
==32409== by 0x137058: api_start (ccextractor.c:213)
==32409== by 0x137D7F: main (ccextractor.c:534)
Adding assertions and then looking at the values with gdb showed that the assumption in the comment on line 635 is incorrect:
(gdb) print text_out
$1 = 0x5555570b91f0 "Of course. I'll do\nwhatever | can to h<font color="#ffff00">elp.\n"
(gdb) print text_out
$1 = 0x555556caa9a0 "Ford resented him.\nSo he cooked up <font color="#ffff00">a pack of lies\n"
The overflow comes about because the loop at line 650 leaves
last_font_tagpointing to the start of the line but the call tostrstron line 658 setslast_font_tag_endto the closing>on the subsequent line. The fix is to addif (last_font_tag_end > line_end) last_font_tag_end = NULL;after line 658.
If you want to investigate this yourselves I can put the subtitle streams on a public server. There may be a better way to fix this.
@cfsmp3 commented on GitHub (Apr 6, 2020):
@NilsIrl want to look into this?
@NilsIrl commented on GitHub (Apr 7, 2020):
That would be very helpful.
@dcjm commented on GitHub (Apr 7, 2020):
The streams are here. These streams contain only the subtitles. The video and audio have been stripped with ffmpeg.
Ccextractor was built on a Debian Buster (amd64) machine.
uname -a
Linux athena 4.19.0-8-amd64 #1 SMP Debian 4.19.98-1 (2020-01-26) x86_64 GNU/Linux
The current version, including my fix and the added assertions is:
CCExtractor detailed version info
Version: 0.88
Git commit: 6b6f548531b7b681d5bd99a563077e6fd80a8ba9
Compilation date: 2020-04-06
File SHA256: 51eef32d5888a585bc83ffcf8f9a7de914896feb87776be9d1cea5e18c74545d
Libraries used by CCExtractor
Tesseract Version: 4.0.0
Leptonica Version: leptonica-1.76.0
libGPAC Version: 0.7.2-DEV
zlib: 1.2.11
utf8proc Version: 2.4.0
protobuf-c Version: 1.3.1
libpng Version: 1.6.35
FreeType
libhash
nuklear
libzvbi
@NilsIrl commented on GitHub (Apr 8, 2020):
If you have already written the code needed, there's no reason to not send a pull request even if we find a better solution.
@NilsIrl commented on GitHub (Apr 8, 2020):
I was not able to reproduce this on master or v0.88 with stream1.ts.
Output
@dcjm commented on GitHub (Apr 9, 2020):
Did you try stream2.ts? This seems to show the problem quite quickly. As a further check I installed Debian Buster in a new virtual machine with the minimum necessary to build ccextractor. It processed stream1.ts without a problem. I have a feeling that only failed when I added explicit bounds checking assertions. The output for stream2.ts was:
Stream3.ts produced
@NilsIrl commented on GitHub (Apr 12, 2020):
Well, I've tried with
stream2.tsandstream3.tsbut to no avail, nothing wrong reported byvalgrindon my side.Could you send a diff/patch/branch of the checks you added?
@cfsmp3 commented on GitHub (Apr 12, 2020):
@dcjm what exact command line did you use?
@dcjm commented on GitHub (Apr 13, 2020):
The commit with both the assertions and the fix is here. The fix is the addition of lines 660 and 661 so that needs to be removed or commented out to trigger the assertions.
The command line was simply:
ccextractor/linux/ccextractor stream2.tsWhich Linux platform do you use for testing? I may be able to run some comparisons. The issue could come down to the version of tesseract. This is what is installed on my testing virtual machine:
@cfsmp3 commented on GitHub (Mar 22, 2023):
I couldn't reproduce (and I spent quite a bit of time today with those samples because they showed two memory leaks that took a bit to fix). Anyway, I've added those two lines.
Closing.