Chinese garbled character, Encoding options not work #305

Closed
opened 2026-01-29 22:09:46 +00:00 by claunia · 2 comments
Owner

Originally created by @Wildcatii on GitHub (Jun 14, 2018).

At last, I located error in this section
`internal override void Read(BinaryReader reader)
{
Version = reader.ReadUInt16();
Flags = (HeaderFlags)reader.ReadUInt16();
CompressionMethod = (ZipCompressionMethod)reader.ReadUInt16();
LastModifiedTime = reader.ReadUInt16();
LastModifiedDate = reader.ReadUInt16();
Crc = reader.ReadUInt32();
CompressedSize = reader.ReadUInt32();
UncompressedSize = reader.ReadUInt32();
ushort nameLength = reader.ReadUInt16();
ushort extraLength = reader.ReadUInt16();
byte[] name = reader.ReadBytes(nameLength);
byte[] extra = reader.ReadBytes(extraLength);

        if (Flags.HasFlag(HeaderFlags.Efs))
        {
            Name = ArchiveEncoding.Decode(name);
        }
        else
        {
            // Use IBM Code Page 437 (IBM PC character encoding set)
            Name = ArchiveEncoding.Decode437(name);
        }`
  1. I set ArchiveEncoding.Default=Encoding.GetEncoding("GBK") for Chinese

  2. Flags result is Bit1, so the name is result of ArchiveEncoding.Decode437(name), not ArchiveEncoding.Decode(name); This makes garbled character.The name becomes
    ║═╞╜3G_20180201║═╞╜│╟╟°124549│ñ║⌠╙∩╥⌠┴¬═¿╓≈╜╨.rcu

I open the zip file by UE, these lines contains header.File data between ':' and ';'

00000000h: 50 4B 03 04 14 00 02 00 08 00 BD 6D 42 4C 86 7A ; PK........絤BL唞
00000010h: 36 26 75 65 6D 01 37 5C 6E 01 37 00 11 00 BA CD ; 6&uem.7\n.7...和
00000020h: C6 BD 5F 33 47 5F 32 30 31 38 30 32 30 31 5F BA ; 平_3G_20180201_?
00000030h: CD C6 BD B3 C7 C7 F8 5F 31 32 34 35 34 39 5F B3 ; 推匠乔鴂124549_?
00000040h: A4 BA F4 D3 EF D2 F4 5F C1 AA CD A8 5F D6 F7 BD ; ず粲镆鬫联通_主?
00000050h: D0 2E 72 63 75 55 54 0D 00 07 16 FB 73 5A C5 2C ; ?rcuUT....鹲Z?

The Flag result comes from first line , it's 02

I use other unzip software,it's ok, the name is
和平_3G_20180201_和平城区_124549_长呼语音_联通_主叫.rcu

I use SevenZipSharp,it's also ok.

I visit http://www.pkware.com/documents/casestudies/APPNOTE.TXT
Bit11(In this project it's Efs)
Bit 11: Language encoding flag (EFS). If this bit is set,
the filename and comment fields for this file
MUST be encoded using UTF-8. (see APPENDIX D)

Look, Bit11 must be encoded using utf-8.
So, I think some problem in this code.
if (Flags.HasFlag(HeaderFlags.Efs))
{
Name = ArchiveEncoding.Decode(name);
}
else
{
// Use IBM Code Page 437 (IBM PC character encoding set)
Name = ArchiveEncoding.Decode437(name);
}`

Originally created by @Wildcatii on GitHub (Jun 14, 2018). At last, I located error in this section `internal override void Read(BinaryReader reader) { Version = reader.ReadUInt16(); Flags = (HeaderFlags)reader.ReadUInt16(); CompressionMethod = (ZipCompressionMethod)reader.ReadUInt16(); LastModifiedTime = reader.ReadUInt16(); LastModifiedDate = reader.ReadUInt16(); Crc = reader.ReadUInt32(); CompressedSize = reader.ReadUInt32(); UncompressedSize = reader.ReadUInt32(); ushort nameLength = reader.ReadUInt16(); ushort extraLength = reader.ReadUInt16(); byte[] name = reader.ReadBytes(nameLength); byte[] extra = reader.ReadBytes(extraLength); if (Flags.HasFlag(HeaderFlags.Efs)) { Name = ArchiveEncoding.Decode(name); } else { // Use IBM Code Page 437 (IBM PC character encoding set) Name = ArchiveEncoding.Decode437(name); }` 1. I set ArchiveEncoding.Default=Encoding.GetEncoding("GBK") for Chinese 2. Flags result is Bit1, so the name is result of ArchiveEncoding.Decode437(name), not ArchiveEncoding.Decode(name); This makes garbled character.The name becomes ║═╞╜_3G_20180201_║═╞╜│╟╟°_124549_│ñ║⌠╙∩╥⌠_┴¬═¿_╓≈╜╨.rcu I open the zip file by UE, these lines contains header.File data between ':' and ';' 00000000h: 50 4B 03 04 14 00 02 00 08 00 BD 6D 42 4C 86 7A ; PK........絤BL唞 00000010h: 36 26 75 65 6D 01 37 5C 6E 01 37 00 11 00 BA CD ; 6&uem.7\n.7...和 00000020h: C6 BD 5F 33 47 5F 32 30 31 38 30 32 30 31 5F BA ; 平_3G_20180201_? 00000030h: CD C6 BD B3 C7 C7 F8 5F 31 32 34 35 34 39 5F B3 ; 推匠乔鴂124549_? 00000040h: A4 BA F4 D3 EF D2 F4 5F C1 AA CD A8 5F D6 F7 BD ; ず粲镆鬫联通_主? 00000050h: D0 2E 72 63 75 55 54 0D 00 07 16 FB 73 5A C5 2C ; ?rcuUT....鹲Z? The Flag result comes from first line , it's 02 I use other unzip software,it's ok, the name is 和平_3G_20180201_和平城区_124549_长呼语音_联通_主叫.rcu I use SevenZipSharp,it's also ok. I visit http://www.pkware.com/documents/casestudies/APPNOTE.TXT Bit11(In this project it's Efs) Bit 11: Language encoding flag (EFS). If this bit is set, the filename and comment fields for this file MUST be encoded using UTF-8. (see APPENDIX D) Look, Bit11 must be encoded using utf-8. So, I think some problem in this code. if (Flags.HasFlag(HeaderFlags.Efs)) { Name = ArchiveEncoding.Decode(name); } else { // Use IBM Code Page 437 (IBM PC character encoding set) Name = ArchiveEncoding.Decode437(name); }`
Author
Owner

@sciarium commented on GitHub (Jul 9, 2018):

My pull request concerning similar issue was approved:
https://github.com/adamhathcock/sharpcompress/pull/385

I think it will solve your issue.

@sciarium commented on GitHub (Jul 9, 2018): My pull request concerning similar issue was approved: https://github.com/adamhathcock/sharpcompress/pull/385 I think it will solve your issue.
Author
Owner

@Wildcatii commented on GitHub (Jul 10, 2018):

@sciarium Thank you, I download the latest version, it works well, the bug has fixed.

@Wildcatii commented on GitHub (Jul 10, 2018): @sciarium Thank you, I download the latest version, it works well, the bug has fixed.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: starred/sharpcompress#305