UTF-8 strings are corrupted when BOM is present
Original Reporter info from Mantis: sekelsenmat
-
Reporter name: Felipe Monteiro de Carvalho
Original Reporter info from Mantis: sekelsenmat
- Reporter name: Felipe Monteiro de Carvalho
Description:
Hello,
Writing utf-8 strings based on ansistrings (shortstrings too) are corrupted if the file has a UTF-8 BOM marker.
I attached two test apps. They are equal, except that utftestbom.pas has a BOM marker, and utftest.pas doesn't have.
The application is trivial, it will write all the characters of a utf-8 string into the screen:
program utftest;
{$mode objfpc}{$H+}
uses SysUtils;
var
MyStr: string;
i: Integer;
begin
MyStr := 'Texto ł ñ ø ß á';
WriteLn('Printing string values');
WriteLn('Length: ', Length(MyStr));
for i := 1 to Length(MyStr) do
Write(IntToHex(Integer(MyStr[i]), 2) + ' ');
WriteLn('');
end.
And only the application without BOM shows a correct value.
felipe2:~/Programas/Teste/carbondemos felipemonteirodecarvalho$ ./utftest
Printing string values
Length: 20
54 65 78 74 6F 20 C5 82 20 C3 B1 20 C3 B8 20 C3 9F 20 C3 A1
felipe2:~/Programas/Teste/carbondemos felipemonteirodecarvalho$ ./utftestbom
Printing string values
Length: 15
54 65 78 74 6F 20 20 20 20 20 20 20 20 20 20
All UTF-8 strings that I tested with special characters present this problem.
Here is my FPC version (I build myself with 2.1.4 as starting compiler):
felipe2:~/Programas/Teste/carbondemos felipemonteirodecarvalho$ fpc
Free Pascal Compiler version 2.3.1 [2007/06/10] for i386
Copyright (c) 1993-2007 by Florian Klaempfl
Jonas said that the compiler stores those kinds of strings internally as WideStrings, and it will manage them differently for different codepages. So the problem probably resides on the utf-8 handling.
thanks,
Mantis conversion info:
- Mantis ID: 9058
- OS: Mac OS X
- OS Build: 10.4.9
- Platform: MacBook
- Version: 2.3.1