Half-width Katakana and (han)dakuten should not overlap/combine. #22434

Open
opened 2026-01-31 08:13:08 +00:00 by claunia · 7 comments
Owner

Originally created by @PhMajerus on GitHub (Oct 20, 2024).

Windows Terminal version

1.23.2913.0

Windows build number

10.0.26100.2033 ARM64

Explanations

I believe there is an error in the code for grapheme clusters text width computation in the current version of Windows Terminal (tested in Preview and Canary).

Japanese in the terminal can be tricky. For historical reasons there are two sets of katakana, a full-width that fits square / double-cells like hiragana and kanji, and a half-width set that fits single cells like ASCII text does.
The problem is how these handle dakuten (and handakuten, but I'll use dakuten to refer to both from now on), which are the Japanese equivalent of accents, and like other diacritical marks, can be combining or not… We have 3 sets of them, a non-combining half-width version, a non-combining full-width version, and a combining full-width version, plus precomposed characters as well.

U+3099 ゛ COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
U+309A ゜ COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
U+309B ゛ KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C ゜ KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
U+FF9E ゙ HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F ゚ HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

Take Windows Terminal written in Japanese: ウィンドウズ・ターミナル.
The is with an extra mark, and is with an extra mark.
There are 46 katakana, plus 9 small forms, which required their own glyphs in old terminals and PCs, and a large part of them can combine with or/and , yielding an extra 30 common combined katakana, and some foreign sounds can be represented using less common combinations, for a total of 92 katakana glyphs variations. Add the Japanese punctuation characters, and we reach over 100 symbols.
So while glyphs representing the combined katakana+dakuten is desirable and better looking, old systems didn't combine them, and used the main katakana glyph, followed by the (han)dakuten glyph. This worked pretty well for half-width katakana, as they felt squeezed, and the dakuten as a second character cell basically just made those square again.

So in half-width katakana, Windows Terminal is written ウィンドウズ・ターミナル. Note how the is represented using the two glyphs ド, and with the two glyphs ズ.

When handling them as grapheme clusters, is makes sense to handle the half-width katakana+dakuten as a single group, they should never be separated. But when displayed in a console or terminal, they are separate characters, and probably should be handled separately, as in legacy systems such as those using Shift-JIS (MS-DOS and Windows codepage 932), they were really separate characters and dakuten could be placed anywhere by themselves.
Even more important, when displaying them, they do not combine or overlap!

The following behavior is the correct and expected way to show them in a terminal:
Image

And is the way it works in Windows Terminal Canary with the wcswidth text measurement mode.

But when using the Grapheme clusters text measurement mode, half-width handakuten are handled like combining diacritic, overlapping the previous katakana:
Image

So to be clear, U+3099 and U+309A are full-width combining, while U+309B and U+309C are full-width, U+FF9E and U+FF9F are half-width, all non-combining.

ウィンドウズ・ターミナル is full-width using precomposed characters, ウィンドウズ・ターミナル is full-width using combining dakuten, ウィント゛ウス゛・ターミナル is full-width using non-combining dakuten, and ウィンドウズ・ターミナル is half-width, which is always non-combining.

I think for Windows Terminal, the grapheme clusters code should not group half-width katakana with dakuten or handakuten. It would fix the text measurement issue and users probably expect to be able to navigate between those characters as if they were completely separate for cursor navigation.

Expected Behavior

Half-width katakana shouldn't have dakuten overlapping them.

Actual Behavior

Half-width katakana has dakuten overlapping them as if they were combining diacritical marks.

Originally created by @PhMajerus on GitHub (Oct 20, 2024). ### Windows Terminal version 1.23.2913.0 ### Windows build number 10.0.26100.2033 ARM64 ### Explanations I believe there is an error in the code for grapheme clusters text width computation in the current version of Windows Terminal (tested in Preview and Canary). Japanese in the terminal can be tricky. For historical reasons there are two sets of katakana, a full-width that fits square / double-cells like hiragana and kanji, and a half-width set that fits single cells like ASCII text does. The problem is how these handle dakuten (and handakuten, but I'll use dakuten to refer to both from now on), which are the Japanese equivalent of accents, and like other diacritical marks, can be combining or not… We have 3 sets of them, a non-combining half-width version, a non-combining full-width version, and a combining full-width version, plus precomposed characters as well. U+3099 ゛ COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+309A ゜ COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK U+309B ゛ KATAKANA-HIRAGANA VOICED SOUND MARK U+309C ゜ KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK U+FF9E ゙ HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F ゚ HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK Take `Windows Terminal` written in Japanese: `ウィンドウズ・ターミナル`. The `ド` is `ト` with an extra `゛` mark, and `ズ` is `ス` with an extra `゛` mark. There are 46 katakana, plus 9 small forms, which required their own glyphs in old terminals and PCs, and a large part of them can combine with `゛` or/and `゜`, yielding an extra 30 common combined katakana, and some foreign sounds can be represented using less common combinations, for a total of 92 katakana glyphs variations. Add the Japanese punctuation characters, and we reach over 100 symbols. So while glyphs representing the combined katakana+dakuten is desirable and better looking, old systems didn't combine them, and used the main katakana glyph, followed by the (han)dakuten glyph. This worked pretty well for half-width katakana, as they felt squeezed, and the dakuten as a second character cell basically just made those square again. So in half-width katakana, Windows Terminal is written `ウィンドウズ・ターミナル`. Note how the `ド` is represented using the two glyphs `ド`, and `ズ` with the two glyphs `ズ`. When handling them as grapheme clusters, is makes sense to handle the half-width katakana+dakuten as a single group, they should never be separated. But when displayed in a console or terminal, they are separate characters, and probably should be handled separately, as in legacy systems such as those using Shift-JIS (MS-DOS and Windows codepage 932), they were really separate characters and dakuten could be placed anywhere by themselves. Even more important, when displaying them, they do not combine or overlap! The following behavior is the correct and expected way to show them in a terminal: ![Image](https://github.com/user-attachments/assets/247ae0e1-03db-41c0-abdb-8bdcea52c0b7) And is the way it works in Windows Terminal Canary with the `wcswidth` text measurement mode. But when using the `Grapheme clusters` text measurement mode, half-width handakuten are handled like combining diacritic, overlapping the previous katakana: ![Image](https://github.com/user-attachments/assets/d5b38661-6757-4b9c-9486-3bc4b463a806) So to be clear, `U+3099` and `U+309A` are full-width combining, while `U+309B` and `U+309C` are full-width, `U+FF9E` and `U+FF9F` are half-width, all non-combining. `ウィンドウズ・ターミナル` is full-width using precomposed characters, `ウィンドウズ・ターミナル` is full-width using combining dakuten, `ウィント゛ウス゛・ターミナル` is full-width using non-combining dakuten, and `ウィンドウズ・ターミナル` is half-width, which is always non-combining. I think for Windows Terminal, the grapheme clusters code should not group half-width katakana with dakuten or handakuten. It would fix the text measurement issue and users probably expect to be able to navigate between those characters as if they were completely separate for cursor navigation. ### Expected Behavior Half-width katakana shouldn't have dakuten overlapping them. ### Actual Behavior Half-width katakana has dakuten overlapping them as if they were combining diacritical marks.
claunia added the Area-RenderingIssue-BugProduct-TerminalArea-Fonts labels 2026-01-31 08:13:08 +00:00
Author
Owner

@lhecker commented on GitHub (Oct 21, 2024):

First of all: Thank you so much about the detailed report! I learned a couple new things thanks to you.

Before anything else, I believe we should address the rendering of "ズ".
In Windows Terminal the ゙ are placed at the wrong spot. It is possible that this is simply a flaw in the font files we're using, in which case this is something we can't fix. We'll have to investigate that.

Regarding the grapheme clustering, I believe the current implementation is correct: While I 100% believe you that they used to be separate entities, you yourself wrote:

When handling them as grapheme clusters, is makes sense to handle the half-width katakana+dakuten as a single group, they should never be separated.

Since we use grapheme clusters, we should follow the grapheme clustering algorithm. The wcswidth measurement mode exists precisely so that older applications can be supported that assume that they occupy separate cells. And it is IMO the task of a potential readline implementation to allow the user to backspace-delete just the combining character.

@lhecker commented on GitHub (Oct 21, 2024): First of all: Thank you so much about the detailed report! I learned a couple new things thanks to you. Before anything else, I believe we should address the rendering of "ズ". In Windows Terminal the ゙ are placed at the wrong spot. It is possible that this is simply a flaw in the font files we're using, in which case this is something we can't fix. We'll have to investigate that. Regarding the grapheme clustering, I believe the current implementation is correct: While I 100% believe you that they used to be separate entities, you yourself wrote: > When handling them as grapheme clusters, is makes sense to handle the half-width katakana+dakuten as a single group, they should never be separated. Since we use grapheme clusters, we should follow the grapheme clustering algorithm. The wcswidth measurement mode exists precisely so that older applications can be supported that assume that they occupy separate cells. And it is IMO the task of a potential readline implementation to allow the user to backspace-delete just the combining character.
Author
Owner

@PhMajerus commented on GitHub (Oct 21, 2024):

You're probably right, I makes sense for grapheme clusters to keep them grouped as they are logical characters, even when non-combining.
It's just that we have been able to move the cursor between them individually for so long that it feels unnatural, while modernizing it using grapheme clusters is probably the right thing to do and get used to.

The main problem I wanted to report is the rendering, and I don't think it's the font because I tried with several of them: Cascadia Next JP, Consolas, MS Gothic, MS Mincho, and they all exhibit the same problem.

@PhMajerus commented on GitHub (Oct 21, 2024): You're probably right, I makes sense for grapheme clusters to keep them grouped as they are logical characters, even when non-combining. It's just that we have been able to move the cursor between them individually for so long that it feels unnatural, while modernizing it using grapheme clusters is probably the right thing to do and get used to. The main problem I wanted to report is the rendering, and I don't think it's the font because I tried with several of them: Cascadia Next JP, Consolas, MS Gothic, MS Mincho, and they all exhibit the same problem.
Author
Owner

@o-sdn-o commented on GitHub (Oct 21, 2024):

Perhaps the following info may be useful.

Character Codepoint TR29 Grapheme Cluster Break Property
U+FF7D Any
U+FF9E Extend

Matches the GB9 clustering rule:

Any × (Extend | ZWJ)

Using the font Cascadia Next JP + DirectWrite API we can get:

  • IDWriteTextAnalyzer2::GetGlyphs("ズ")+IDWriteTextAnalyzer2::GetGlyphPlacements("ズ") returns a glyph run:

    • Glyph :
      Metric Value
      leftSideBearing 111
      advanceWidth 1024
      rightSideBearing 83
      topSideBearing 361
      advanceHeight 2000
      bottomSideBearing 177
      verticalOriginY 1760
    • Glyph :
      Metric Value
      leftSideBearing 71
      advanceWidth 1024
      rightSideBearing 440
      topSideBearing 23
      advanceHeight 2000
      bottomSideBearing 1500
      verticalOriginY 1760
  • IDWriteGlyphRunAnalysis::CreateAlphaTexture(glyphrun{ス,゙}, DWRITE_TEXT_ANTIALIAS_MODE_GRAYSCALE) returns an alpha mask:
    Mask bits:
    Image

@o-sdn-o commented on GitHub (Oct 21, 2024): Perhaps the following info may be useful. Character | Codepoint | TR29 Grapheme Cluster Break Property -----------|-----------|------ `ス` | `U+FF7D` | Any `゙` | `U+FF9E` | Extend Matches the GB9 clustering rule: ``` Any × (Extend | ZWJ) ``` Using the font `Cascadia Next JP` + DirectWrite API we can get: - `IDWriteTextAnalyzer2::GetGlyphs("ズ")`+`IDWriteTextAnalyzer2::GetGlyphPlacements("ズ")` returns a glyph run: - Glyph `ス`: Metric | Value ------------------|---------------- leftSideBearing | 111 advanceWidth | 1024 rightSideBearing | 83 topSideBearing | 361 advanceHeight | 2000 bottomSideBearing | 177 verticalOriginY | 1760 - Glyph `゙`: Metric | Value ------------------|---------------- leftSideBearing | 71 advanceWidth | 1024 rightSideBearing | 440 topSideBearing | 23 advanceHeight | 2000 bottomSideBearing | 1500 verticalOriginY | 1760 - `IDWriteGlyphRunAnalysis::CreateAlphaTexture(glyphrun{ス,゙}, DWRITE_TEXT_ANTIALIAS_MODE_GRAYSCALE)` returns an alpha mask: `Mask bits`: ![Image](https://github.com/user-attachments/assets/4232ff13-4326-4cb2-a9e5-f026c25e6bf9)
Author
Owner

@lhecker commented on GitHub (Oct 21, 2024):

I suspect the issue is here: 18098eca42/src/renderer/atlas/AtlasEngine.cpp (L904-L917)
...because I'm not actually getting the glyph advances. Well, the good news is that adding support for that will allow us to center glyphs in their cell which will help with the appearance of font fallback (i.e. were the fallback font has a narrow appearance but a wide cell was allocated).

@lhecker commented on GitHub (Oct 21, 2024): I suspect the issue is here: https://github.com/microsoft/terminal/blob/18098eca423f996695befd44bd2bd067002c7023/src/renderer/atlas/AtlasEngine.cpp#L904-L917 ...because I'm not actually getting the glyph advances. Well, the good news is that adding support for that will allow us to center glyphs in their cell which will help with the appearance of font fallback (i.e. were the fallback font has a narrow appearance but a wide cell was allocated).
Author
Owner

@tats-u commented on GitHub (Sep 5, 2025):

Half-width katakana (semi-)voiced sound marks are the only exceptions to the rule that combining characters should be placed at the same place as the base character.
Grapheme clusters are not only for or designed for terminals.
It is mainly designed for character splitting or counting.

@tats-u commented on GitHub (Sep 5, 2025): Half-width katakana (semi-)voiced sound marks are the only exceptions to the rule that combining characters should be placed at the same place as the base character. Grapheme clusters are not only for or designed for terminals. It is mainly designed for character splitting or counting.
Author
Owner

@Wukuyon commented on GitHub (Sep 5, 2025):

Half-width katakana (semi-)voiced sound marks are the only exceptions to the rule that combining characters should be placed at the same place as the base character.

Note that the half-width katakana sound marks U+FF9E and U+FF9F are technically not combining characters.

  • They are base characters: they belong to General Category Letter/L.
  • They are, uniquely, the only base characters that are also grapheme extenders (defined in Unicode D59). This is why they "extend" any preceding extended grapheme cluster, as per rule UAX29-GB9a. But they do not create a combining character sequence.
  • As base characters, U+FF9E and U+FF9F definitely should have positive advance width, even in terminals, just like any other base character. They should each take up a terminal cell.
Aside about spacing marks: those combining characters that should also take up space

There are, however, combining characters that also should have positive advance width: the spacing marks. Pasting this from another comment I made in https://github.com/kovidgoyal/kitty/issues/8533#issuecomment-3202373085:

  • There are two subtypes of combining characters: nonspacing marks (D53) and spacing marks (D55).
    • Nonspacing marks include all characters with the General Category of Nonspacing Mark (Mn) or Enclosing Mark (Me) (D53).
    • All combining characters that are not nonspacing marks are spacing marks (D55), i.e., all characters with the General Category of Spacing Mark (Mc) (D52).
  • Rule UAX29-GB9a prevents grapheme cluster breaks before SpacingMark characters, which include characters with a General Category of Spacing_Mark (Mc).
  • UAX44, Table 12 describes Spacing_Mark (Mc) as "a spacing combining mark (positive advance width)". This implies that Spacing_Mark (Mc) characters should each take up a positive width, even though they are combining characters.
  • Furthermore, D55 in the Core Specification defines spacing marks (any Spacing_Mark/Mc character) and says, "In general, the behavior of spacing marks does not differ greatly from that of base characters."
  • In other words, according to D55, spacing marks should take up positive advance width, like base characters do (D51).
  • For example: U+093F DEVANAGARI VOWEL SIGN I ( ि) and U+0BCA TAMIL VOWEL SIGN O (◌ொ).
  • When you combine ि into कि or खि or ◌ொ into டொ or ணொ, each of those is a single extended grapheme cluster. That's useful for line breaking and text selection.
  • But putting each of those clusters in a single halfwidth terminal cells each is probably a bad default. Imagine shrinking कि, खि, டொ, or even ணொ into a single halfwidth terminal cell each. They would become unreadable to any reader.
  • This is why the Unicode Standard defines ि and ◌ொ as spacing marks, which the Standard says have "positive advance width", like base characters.
  • And so, terminals should treat them as such and give them positive advance width, unlike nonspacing marks.
  • (Arguably, டொ and ணொ should take up three cells, since ◌ொ is a split spacing combining mark. See Core Specification, Section 4.3.1, "Split Class Zero Combining Marks". It's worth considering making spacing combining marks that have Indic_Positional_Category=Left_And_Right, like ◌ொ, have an advance width of two in terminals.)

So, not only U+FF9E and U+FF9F, which are grapheme-extender base characters, but also spacing marks should take up advance width in terminals. Although how much width each should take up might be a more difficult question.

@Wukuyon commented on GitHub (Sep 5, 2025): > Half-width katakana (semi-)voiced sound marks are the only exceptions to the rule that combining characters should be placed at the same place as the base character. Note that the half-width katakana sound marks U+FF9E and U+FF9F are technically *not* combining characters. - They are base characters: they belong to General Category Letter/L. - They are, uniquely, the only base characters that are also grapheme extenders (defined in Unicode [D59][]). This is why they "extend" any preceding extended grapheme cluster, as per rule [UAX29-GB9a][]. But they do *not* create a combining character sequence. - As base characters, U+FF9E and U+FF9F definitely should have positive advance width, even in terminals, just like any other base character. They should each take up a terminal cell. <details> <summary>Aside about <strong>spacing marks</strong>: those combining characters that should also take up space</summary> There are, however, combining characters that also should have positive advance width: the spacing marks. Pasting this from another comment I made in https://github.com/kovidgoyal/kitty/issues/8533#issuecomment-3202373085: - There are **two subtypes of combining characters: nonspacing marks ([D53][]) and spacing marks ([D55][])**. - Nonspacing marks include all characters with the General Category of Nonspacing Mark (Mn) or Enclosing Mark (Me) ([D53][]). - All combining characters that are not nonspacing marks are spacing marks ([D55][]), i.e., all characters with the General Category of Spacing Mark (Mc) ([D52][]). - Rule [UAX29-GB9a][] prevents grapheme cluster breaks before [SpacingMark][] characters, which include characters with a General Category of Spacing_Mark (Mc). - [UAX44, Table 12][General Category Values] describes Spacing_Mark (Mc) as "a spacing combining mark **(positive advance width)**". This implies that **Spacing_Mark (Mc) characters should each take up a positive width, even though** they are combining characters. - Furthermore, [D55][] in the Core Specification defines spacing marks (any Spacing_Mark/Mc character) and says, "In general, **the behavior of spacing marks does not differ greatly from that of base characters.**" - In other words, according to [D55][], spacing marks should take up positive advance width, like base characters do ([D51][]). - For example: U+093F DEVANAGARI VOWEL SIGN I ( ि) and U+0BCA TAMIL VOWEL SIGN O (◌ொ). - When you combine ि into **कि or खि** or ◌ொ into **டொ or ணொ**, each of those is a single extended grapheme cluster. That's useful for line breaking and text selection. - But putting each of those clusters in a single halfwidth terminal cells each is probably a bad default. **Imagine shrinking कि, खि, டொ, or even ணொ into a single halfwidth terminal cell each.** They would become unreadable to any reader. - This is why the Unicode Standard defines ि and ◌ொ as *spacing* marks, which the Standard says have "positive advance width", like base characters. - And so, terminals should treat them as such and give them positive advance width, unlike nonspacing marks. - (Arguably, டொ and ணொ should take up *three* cells, since ◌ொ is a split spacing combining mark. See [Core Specification, Section 4.3.1, "Split Class Zero Combining Marks"][Section 4.3.1]. It's worth considering making spacing combining marks that have Indic_Positional_Category=Left_And_Right, like ◌ொ, have an advance width of two in terminals.) > So, not only U+FF9E and U+FF9F, which are grapheme-extender base characters, but also spacing marks should take up advance width in terminals. Although how *much* width each should take up might be a more difficult question. </details> [General Category Values]: https://www.unicode.org/reports/tr44/#General_Category_Values [UAX29-GB9]: https://www.unicode.org/reports/tr29/#GB9 [UAX29-GB9a]: https://www.unicode.org/reports/tr29/#GB9a [UAX29-GB9b]: https://www.unicode.org/reports/tr29/#GB9b [UAX29-C1-1]: https://www.unicode.org/reports/tr29/#C1-1 [UAX11]: https://www.unicode.org/reports/tr11/ [SpacingMark]: https://www.unicode.org/reports/tr29/#SpacingMark [Prepend]: https://www.unicode.org/reports/tr29/#Prepend [Extend]: https://www.unicode.org/reports/tr29/#Extend [D51]: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G37171 [D52]: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G1632 [D53]: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G2459 [D55]: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G62466 [D59]: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G41165 [D66]: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G751 [L2/03-106]: https://www.unicode.org/L2/L2003/03106-lb6-issue.htm [L2/06-246]: https://www.unicode.org/L2/L2006/06246-halfwidth.html [Unicode public mailing list]: https://www.unicode.org/consortium/distlist-unicode.html [Section 4.3.1]: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-4/#G129874
Author
Owner

@tats-u commented on GitHub (Sep 5, 2025):

Note that the half-width katakana sounds U+FF9E and U+FF9F are technically not combining characters.

I see.

So, not only U+FF9E and U+FF9F, which are grapheme-extender base characters, but also spacing marks should take up advance width in terminals.

"खि", "டொ", and "ணொ" are properly rendered in Windows Terminal. (their occupying widths are broken though) We have to fix voiced-sound mark first.

@tats-u commented on GitHub (Sep 5, 2025): > Note that the half-width katakana sounds U+FF9E and U+FF9F are technically not combining characters. I see. > So, not only U+FF9E and U+FF9F, which are grapheme-extender base characters, but also spacing marks should take up advance width in terminals. "खि", "டொ", and "ணொ" are properly rendered in Windows Terminal. (their occupying widths are broken though) We have to fix voiced-sound mark first.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/terminal#22434