[PR #491] [CLOSED] Split to sentences implementation #1306

Closed
opened 2026-01-29 17:15:38 +00:00 by claunia · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/CCExtractor/ccextractor/pull/491
Author: @maxkoryukov
Created: 12/8/2016
Status: Closed

Base: masterHead: master


📝 Commits (8)

  • eff5fe0 Proto for tests and proto for the sentence buf
  • c260158 Tests were added
  • 02915e4 Sentence buffer : some slight changes
  • 8239932 Break to sentences works good.
  • 57b397d Fix build. No extra comments, removed extra include instructions
  • 808d1f6 SBS: First working version with dup detection
  • 071b751 Additional tests for SBS
  • f1afd74 Final solution for sentence breaker

📊 Changes

10 files changed (+1051 additions, -285 deletions)

View changed files

📝 .gitignore (+6 -0)
📝 src/lib_ccx/ccx_encoders_common.c (+180 -183)
📝 src/lib_ccx/ccx_encoders_common.h (+9 -11)
📝 src/lib_ccx/ccx_encoders_splitbysentence.c (+413 -91)
src/lib_ccx/debug_def.h (+11 -0)
tests/Makefile (+59 -0)
tests/README.md (+43 -0)
tests/ccx_encoders_splitbysentence_suite.c (+305 -0)
tests/ccx_encoders_splitbysentence_suite.h (+4 -0)
tests/runtest.c (+21 -0)

📄 Description

Hello!

This PR contains the implementation of Sentence Buffer: Split

Usage:

./ccextractor -sbs ~/source.ts

Currently, it works only with sub->type == CC_BITMAP. Implementation details - in comments to the PR.

Long example

New output

1
00:00:00,001 --> 00:00:00,189
Oleon costs.

2
00:00:00,191 --> 00:00:00,783
buried in the annex, 95 Oleon costs.

3
00:00:00,785 --> 00:00:05,159
Didn't want to acknowledge the pressures on hospitals, schools and infrastructure.

Old output

1
00:00:00,001 --> 00:00:00,000
Oleon

2
00:00:00,001 --> 00:00:00,189
Oleon costs.

3
00:00:00,190 --> 00:00:00,889
buried in the annex, 95 Oleon costs.
Didn't

4
00:00:00,890 --> 00:00:01,129
buried in the annex, 95 Oleon costs.
Didn't want

5
00:00:01,130 --> 00:00:01,359
buried in the annex, 95 Oleon costs.
Didn't want to

6
00:00:01,360 --> 00:00:02,059
buried in the annex, 95 Oleon costs.
Didn't want to acknowledge

7
00:00:02,060 --> 00:00:02,299
buried in the annex, 95 Oleon costs.
Didn't want to acknowledge the

8
00:00:02,300 --> 00:00:03,419
Didn't want to acknowledge the
pressures

9
00:00:03,420 --> 00:00:03,609
Didn't want to acknowledge the
pressures on

10
00:00:03,610 --> 00:00:04,029
Didn't want to acknowledge the
pressures on hospitals,

11
00:00:04,030 --> 00:00:04,779
Didn't want to acknowledge the
pressures on hospitals, schools

12
00:00:04,780 --> 00:00:05,019
Didn't want to acknowledge the
pressures on hospitals, schools and

13
00:00:05,020 --> 00:00:05,159
pressures on hospitals, schools and
infrastructure.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/CCExtractor/ccextractor/pull/491 **Author:** [@maxkoryukov](https://github.com/maxkoryukov) **Created:** 12/8/2016 **Status:** ❌ Closed **Base:** `master` ← **Head:** `master` --- ### 📝 Commits (8) - [`eff5fe0`](https://github.com/CCExtractor/ccextractor/commit/eff5fe079b856b0153f08f63bf59e543c561bca8) Proto for tests and proto for the sentence buf - [`c260158`](https://github.com/CCExtractor/ccextractor/commit/c26015876cf87ae98a90216227623f73a45d307c) Tests were added - [`02915e4`](https://github.com/CCExtractor/ccextractor/commit/02915e43d03c5a1e435ad2dcca48cf7f6dd48462) Sentence buffer : some slight changes - [`8239932`](https://github.com/CCExtractor/ccextractor/commit/823993208f4158f6857d35063b976558d8feb1b7) Break to sentences works good. - [`57b397d`](https://github.com/CCExtractor/ccextractor/commit/57b397d798b43cab922755b2f9be37cc23deff59) Fix build. No extra comments, removed extra `include` instructions - [`808d1f6`](https://github.com/CCExtractor/ccextractor/commit/808d1f69c10f6de44232d3d92aa28ab0523ee4e6) SBS: First working version with dup detection - [`071b751`](https://github.com/CCExtractor/ccextractor/commit/071b751f5cefb41a9fb78e8ed45eee37f1296e41) Additional tests for SBS - [`f1afd74`](https://github.com/CCExtractor/ccextractor/commit/f1afd74321e34dc92abc505a2beac850efa7d6b6) Final solution for sentence breaker ### 📊 Changes **10 files changed** (+1051 additions, -285 deletions) <details> <summary>View changed files</summary> 📝 `.gitignore` (+6 -0) 📝 `src/lib_ccx/ccx_encoders_common.c` (+180 -183) 📝 `src/lib_ccx/ccx_encoders_common.h` (+9 -11) 📝 `src/lib_ccx/ccx_encoders_splitbysentence.c` (+413 -91) ➕ `src/lib_ccx/debug_def.h` (+11 -0) ➕ `tests/Makefile` (+59 -0) ➕ `tests/README.md` (+43 -0) ➕ `tests/ccx_encoders_splitbysentence_suite.c` (+305 -0) ➕ `tests/ccx_encoders_splitbysentence_suite.h` (+4 -0) ➕ `tests/runtest.c` (+21 -0) </details> ### 📄 Description Hello! This PR contains the implementation of Sentence Buffer: Split Usage: ```shell ./ccextractor -sbs ~/source.ts ``` Currently, it works only with `sub->type == CC_BITMAP`. Implementation details - in comments to the PR. ### Long example #### New output ``` 1 00:00:00,001 --> 00:00:00,189 Oleon costs. 2 00:00:00,191 --> 00:00:00,783 buried in the annex, 95 Oleon costs. 3 00:00:00,785 --> 00:00:05,159 Didn't want to acknowledge the pressures on hospitals, schools and infrastructure. ``` #### Old output ``` 1 00:00:00,001 --> 00:00:00,000 Oleon 2 00:00:00,001 --> 00:00:00,189 Oleon costs. 3 00:00:00,190 --> 00:00:00,889 buried in the annex, 95 Oleon costs. Didn't 4 00:00:00,890 --> 00:00:01,129 buried in the annex, 95 Oleon costs. Didn't want 5 00:00:01,130 --> 00:00:01,359 buried in the annex, 95 Oleon costs. Didn't want to 6 00:00:01,360 --> 00:00:02,059 buried in the annex, 95 Oleon costs. Didn't want to acknowledge 7 00:00:02,060 --> 00:00:02,299 buried in the annex, 95 Oleon costs. Didn't want to acknowledge the 8 00:00:02,300 --> 00:00:03,419 Didn't want to acknowledge the pressures 9 00:00:03,420 --> 00:00:03,609 Didn't want to acknowledge the pressures on 10 00:00:03,610 --> 00:00:04,029 Didn't want to acknowledge the pressures on hospitals, 11 00:00:04,030 --> 00:00:04,779 Didn't want to acknowledge the pressures on hospitals, schools 12 00:00:04,780 --> 00:00:05,019 Didn't want to acknowledge the pressures on hospitals, schools and 13 00:00:05,020 --> 00:00:05,159 pressures on hospitals, schools and infrastructure. ``` --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
claunia added the pull-request label 2026-01-29 17:15:38 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/ccextractor#1306