* feat: unpack gpac
* fix: linux ci
* fix: mac build
* fix: remove unused [no ci]
* fix: ignore config.h [no ci]
* temp commit, will drop this soon
* fix: install gpac
* fix: gpac
* fix: formatting
* fix: preproccessor directive
* fix: comment display version for now
* fix: display dlls code
* fix: bundle vcruntime in hardsubx windows
* fix: again
* fix: erros in ci
* fix: ci
* fix: add vcruntime in additional dependencies
* fix: try to copy vcruntime after build
* fix: space in runtime library
* fix: remove for now [no ci]
* fix: things in vcxproj
* fix: ci for leptonica sys
* fix: docs
* fix: copy dlls on post build event
* fix: copy vcruntime after build
* feat: add arguments through clap
* fix: type of some arguments
* fix: "-" and "--" in comments
* fix: format files
* fix: add argument parsing till mkvlang
* fix: one todo item
* chore: lint fixes
* fix: nocodec value
* fix: for nocodec
* fix: add cfg feature for hardsubx
* feat: complete till startcreditstext
* fix: add more notes, args: option affect processed
* feat: port all till network stuff
* fix: complete almost all argument parsing
* fix: error free code
* fix: complete params port
* fix: hardsubx erros
* feat: clean up main function
* fix: pr reviews
* fix: make input,output function better
* fix: variant not used warning
* fix: warnings
* fix: all clippy warnings
* feat: add tests
* feat: add tests
* chore: lint fixes
* fix: move unit tests to correct folder
* fix: remove unncessary files
* fix: make function for parse_args
* fix: review changes
* fix: Impl CcxOptions whenever I could
* fix: try to convert rust to c
* chore: push c code
* fix: add more rust to c conversions
* fix: use set methods for bitfield
* fix: errors
* fix: arguments parsing
* fix: all issues
* fix: many errors
* chore: lint fix
* fix: err
* fix: unsafe function error
* fix: unsafe warning
* fix: safety lint
* chore: add docs
* fix: windows build
* fix: function
* fix: dependencies
* fix: set_binary_mode
* chore: lint fix
* fix: set_binary_mode for windows
* fix: error
* fix: undefined reference error
* chore: remove comment
* fix: output field
* chore: fix lint
* fix: ru1, ru2, ru3
* fix: undef before
* fix: parameter and update deps
* chore: update vcpkg
* feat: add release-with-debug profile
* fix; uncomment code
* fix: update visual studio to 2022
* chore: update docs
* fix: use default vcpkg
* fix: caching logic on release ci
* fix: vcpkg caching
* fix: add setup vcpkg
* chore: remove unneccesary formatting
* fix: Always write 2 bytes for UTF-16BE
* fix: formatting
* feat: add rest of the notes to bring continuity
* fix: remove extra line
* fix: add hardsubx note
* fix: source code format error
* chore: lint fixes acc to rustfmt
* feat: add unit test ci
* fix: conversion of strings, add file queue handling
* fix: decoder cfg
* fix: update dependencies
* chore: lint fix
* chore: add safety doc
* fix: default value for CcxOptions
* fix(rust): default value for teletext
* fix: leptonica version for windows
* fix: format errors
* fix: workflow
* Revert "fix: leptonica version for windows"
This reverts commit 461ef55e7b.
* fix: pin ffmpeg to 6 for mac
* fix(parser): default values and unwrap's
* fix(parser): hardsubx fixes
* chore(parse): lint fixes
* fix(windows): switch back to sdk 2019
* fix(workflow): windows workflow revert
* fix(windows): revert to old files which were working before
* fix(workflow): pin vcpkg packages
* chore(rust): downgrade leptonica
* fix(windows): move vcpkg.json to correct place
* fix(windows): improve vcxproj
* fix(windows): workflow
* fix(windows): workflow
* fix(windows): workflow clone from vcpkg everytime
* fix(workflow): error
* fix(workflow): don't skip building vcpkg
* fix: remove depth from vcpkg
* temporary commit
* fix(windows): pin gpac and use local vcpkg manifest properly
* fix(windows): install vcpkg dependencies manually
* fix(windows): update dll names
* fix(windows); dependencies copy
* fix(windows): don't continue on error for release
* fix(macos): build ffmpeg for mac workflow
* fix: move ffmpeg to current workspace
* fix: re-add profile for windows
* fix: pkg config for mac
* fix(mac): use ffmpeg@6 from brew
* fix(macos): there is no ffmpeg_prebuilt
* fix(macos): specify ffmpeg pkg config
* fix(macos): globally define pkg config
* fix(macos): add ffmpeg include and libs dir
* fix(macos): include ffmpeg headers in makefile
* fix: include ffmpeg libraries and include directories
* fix: try to manually specify ffmpeg header in rust
* fix: also include leptonica headres
* fix: leptonica name
* fix: test
* fix: string null when output_filename is empty
* fix: error
* fix: remove cflgas
* fix(mac): disable cmake ocr hardsubx
* chore: update gitignore
* fix: null if string is empty
* fix: allow --in
* chore: bump version to 1.0 in rust
* chore: add space to trigger sp
* fix: don't panic with rust
* fix: add double dashes to indicate parameters
* chore: update CHANGES.txt
* fix: test
* fix(workflow): update workflow name
* fix(rust): linux output_filename in sampleplatform
* fix(rust): parser default values
* fix(rust): exit with MalformedParameter instead of panic
* fix(decoder): revert always write 2 bytes
* chore(rust): format
* chore: update lock file
* fix(test): test lib_ccxr and rename to test
* fix(mac): remove failing cmake_ocr test
* fix: ci errors
* fix: feature related changes
* fix: trim down default features
* fix: don't check clippy for all features
4.0 KiB
Overview
OCR (Optical Character Recognition) is a technique used to extract text from images. In the World of Subtitle, subtitle stored in bitmap format are common and even necessary. For converting subtitle in bitmap format to subtitle in text format OCR is used.
Dependency
- Tesseract (OCR library by Google)
- Leptonica (Image processing library)
How to compile CCExtractor on Linux with OCR
Install Dependency
Using package manager
Ubuntu, Debian
sudo apt-get install libleptonica-dev libtesseract-dev tesseract-ocr-eng
Suse
zypper install leptonica-devel
Downloading source code and compiling it.
Leptonnica.
This package is available in your distro, you need liblept-devel library.
If Leptonica isn't available for your distribution, or you want to use a newer version than they offer, you can compile your own.
you can download lib leptonica source code from http://www.leptonica.com/download.html
Tesseract.
Tesseract is available directly from many Linux distributions. The package is generally called 'tesseract' or 'tesseract-ocr' - search your distribution's repositories to find it. Packages are also generally available for language training data (search the repositories,) but if not you will need to download the appropriate training data, unpack it, and copy the .traineddata file into the 'tessdata' directory, probably /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata.
If Tesseract isn't available for your distribution, or you want to use a newer version than they offer, you can compile your own.
If you compile Tesseract then following command in its source code are enough
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
Note:
- CCExtractor is tested with Tesseract 3.04 version but it works with older versions.
- Useful Download links:
- Tesseract https://github.com/tesseract-ocr/tesseract/archive/3.04.00.tar.gz
- Tesseract training data https://github.com/tesseract-ocr/tessdata/archive/3.04.00.tar.gz
##Compilation
###using Build script
cd ccextractor/linux
./build
Passing flags to configure
cd ccextractor/linux
./autogen.sh
./configure --with-gui --enable-ocr
make
Passing flags to cmake
cd <CCExrtactor cloned code>
mkdir build
cd build
cmake -DWITH_OCR=ON ../src
make
How to compile CCExtractor on Windows with OCR
Download prebuild library of leptonica and tesseract from following link
https://drive.google.com/file/d/0B2ou7ZfB-2nZOTRtc3hJMHBtUFk/view?usp=sharing
put the path of libs/include of leptonica and tesseract in library paths.
- In visual studio 2022 right click and select property.
- Select Configuration properties in left panel(column) of property.
- Select VC++ Directory.
- In the right pane, in the right-hand column of the VC++ Directory property, open the drop-down menu and choose Edit.
- Add path of Directory where you have kept uncompressed library of leptonica and tesseract.
Set preprocessor flag ENABLE_OCR=1
- In visual studio 2022 right click and select property.
- In the left panel, select Configuration Properties, C/C++, Preprocessor.
- In the right panel, in the right-hand column of the Preprocessor Definitions property, open the drop-down menu and choose Edit.
- In the Preprocessor Definitions dialog box, add ENABLE_OCR=1. Choose OK to save your changes.
Add library in linker
- Open property of project
- Select Configuration properties
- Select Linker in left panel(column)
- Select Input
- Select Additional dependencies in right panel
- Add libtesseract304d.lib in new line
- Add liblept172.lib in new line
Download language data from following link
https://code.google.com/p/tesseract-ocr/downloads/list
after downloading the tesseract-ocr-3.02.eng.tar.gz extract the tar file and put
tessdata folder where you have kept CCExtractor executable
Copy the tesseract and leptonica dll from lib folder downloaded from above link to folder of executable or in system32.