[Epic] Text Buffer rewrite #11107

Open
opened 2026-01-31 02:38:51 +00:00 by claunia · 0 comments
Owner

Originally created by @DHowett on GitHub (Oct 22, 2020).

This is the issue tracking the great buffer rewrite of 202x.

Aims

  • Refactor to remove the need for UnicodeStorage (which is a lookup table keyed on row+column)
    • Removing this allows us to remove ROW::_id, ROW::_pParent, CharRow::_pParent
  • Reduce the fiddliness of the DBCS attribute APIs
    • DBCS attributes are stored for every character when they could be easily inferred from column position
  • Add support for the storage of surrogate pairs
    • Surrogate pairs work today as an accident of fate: a pair of UTF-16 code units encoding a EA=wide codepoint is seen as wide, which conveniently matches how many wchar_t it takes up.
    • We have little to no proper support for a codepoint requiring two UTF-16 code units that is only seen as one column wide (#6555 (master issue), #6162 #8709)
  • Provide a platform on which to build full ZWJ support (#1472)
  • Kill CharRow, CharRowCell, CharRowCellReference
  • Reduce the static storage required to store a row (eventually) by not storing space characters
    • This should make MeasureRight faster, and therefore help fix #32.

Notes

Surrogate Pairs

Work will be required to teach WriteCharsLegacy to measure UTF-16 codepoints in aggregate, rather than individual code units.

I have done a small amount of work in WriteCharsLegacy. It is slow going.

Motivation

#8689 (IRM) requires us to be able to shift buffer contents rightward. I implemented it in a hacky way, but then realized that UnicodeStorage would need to be rekeyed.

Implementation

The buffer is currently stored as a vector (small_vector) of CharRowCell, each of which contains a DbcsAttribute and a wchar_t. Each cell takes 3 bytes (plus padding, if required.)

In the common case (all narrow text), this is terribly wasteful.

To better support codepoints requiring one or more code units representing a character, we are going to move to a single wchar string combined with a column count table. The column count table will be stored compressed by way of til::rle (#8741).

Simple case - all glyphs narrow
 CHAR    A    B    C    D
UNITS 0041 0042 0043 0044
 COLS    1    1    1    1

Simple case - all glyphs wide
 CHAR   カ   タ   カ   ナ
UNITS 30ab 30bf 30ab 30ca
 COLS    2    2    2    2

Surrogate pair case - glyphs narrow
 CHAR         🕴        🕴        🕴
UNITS d83d dd74 d83d dd74 d83d dd74
 COLS    1    0    1    0    1    0

Surrogate pair case - glyphs wide
 CHAR        🥶        🥶        🥶
UNITS d83e dd76 d83e dd76 d83e dd76
 COLS    2    0    2    0    2    0

Representative complicated case
 CHAR        🥶    A    B         🕴
UNITS d83e dd76 0041 0042 d83d dd74
 COLS    2    0    1    1    1    0

Representative complicated case (huge character)
[FUTURE WORK]
 CHAR ﷽
UNITS         fdfd
 COLS           12

Representative complicated case (Emoji with skin tone variation)
[FUTURE WORK]
 CHAR 👍🏼
UNITS d83d dc31 200d d83d dc64
 COLS    2    0    0    0    0

A column count of zero indicates a code unit that is a continuation of an existing glyph.

Since there is one column width for each code unit, it is trivial to match column offsets with character string indices by summation.

Work Log

  • Add tests for reflow so that we can rewrite it (#8715)
  • Hide more of CharRow/AttrRow's implementation details inside Row (#8446)
  • (from Michael) til::rle<T, S> - a run length encoded storage template, which we will use to store column counts

Other issues that might just be fixed by this

Originally created by @DHowett on GitHub (Oct 22, 2020). This is the issue tracking the great buffer rewrite of 202x. ### Aims * [x] Refactor to remove the need for `UnicodeStorage` (which is a lookup table keyed on row+column) * Removing this allows us to remove `ROW::_id`, `ROW::_pParent`, `CharRow::_pParent` * [x] Reduce the fiddliness of the DBCS attribute APIs * DBCS attributes are stored for every character when they could be easily inferred from column position * [x] Add support for the storage of surrogate pairs * Surrogate pairs work today as an accident of fate: a pair of UTF-16 code units encoding a EA=wide codepoint is seen as wide, which conveniently matches how many wchar_t it takes up. * We have little to no proper support for a codepoint requiring two UTF-16 code units that is only seen as one column wide (#6555 (master issue), #6162 #8709) * [x] Provide a platform on which to build full ZWJ support (#1472) * [x] Kill `CharRow`, `CharRowCell`, `CharRowCellReference` * [ ] Reduce the static storage required to store a row (eventually) by not storing space characters * This should make MeasureRight faster, and therefore help fix #32. ### Notes #### Surrogate Pairs Work will be required to teach WriteCharsLegacy to measure UTF-16 codepoints in aggregate, rather than individual code units. I have done a small amount of work in WriteCharsLegacy. It is slow going. #### Motivation #8689 (IRM) requires us to be able to shift buffer contents rightward. I implemented it in a hacky way, but then realized that UnicodeStorage would need to be rekeyed. #### Implementation The buffer is currently stored as a vector (small_vector) of `CharRowCell`, each of which contains a `DbcsAttribute` and a `wchar_t`. Each cell takes 3 bytes (plus padding, if required.) In the common case (all narrow text), this is terribly wasteful. To better support codepoints requiring one or more code units representing a character, we are going to move to a single wchar string combined with a column count table. The column count table will be stored compressed by way of `til::rle` (#8741). ``` Simple case - all glyphs narrow CHAR A B C D UNITS 0041 0042 0043 0044 COLS 1 1 1 1 Simple case - all glyphs wide CHAR カ タ カ ナ UNITS 30ab 30bf 30ab 30ca COLS 2 2 2 2 Surrogate pair case - glyphs narrow CHAR 🕴 🕴 🕴 UNITS d83d dd74 d83d dd74 d83d dd74 COLS 1 0 1 0 1 0 Surrogate pair case - glyphs wide CHAR 🥶 🥶 🥶 UNITS d83e dd76 d83e dd76 d83e dd76 COLS 2 0 2 0 2 0 Representative complicated case CHAR 🥶 A B 🕴 UNITS d83e dd76 0041 0042 d83d dd74 COLS 2 0 1 1 1 0 Representative complicated case (huge character) [FUTURE WORK] CHAR ﷽ UNITS fdfd COLS 12 Representative complicated case (Emoji with skin tone variation) [FUTURE WORK] CHAR 👍🏼 UNITS d83d dc31 200d d83d dc64 COLS 2 0 0 0 0 ``` A column count of zero indicates a code unit that is a continuation of an existing glyph. Since there is one column width for each code unit, it is trivial to match column offsets with character string indices by summation. #### Work Log * [x] Add tests for reflow so that we can rewrite it (#8715) * [x] Hide more of CharRow/AttrRow's implementation details inside Row (#8446) * [ ] (from Michael) `til::rle<T, S>` - a run length encoded storage template, which we will use to store column counts #### Other issues that might just be fixed by this * [x] #8839 * [x] #11756 * [x] #32 * [x] #6987 * [x] #30 * [x] #4968
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#11107