This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: What is the intended bahaviour of recoding characters outside the target range?


Alexander,

I opine that it would make sense to model the solution after 
uconv, which is ICU project's replacement for iconv. uconv
allows the user to specify what to do with such data with the 
--to-callback option. E.g. escape-xml, escape-unicode, stop, skip, 
etc..

To use your example: 

uconv -f UTF-8 -t ASCII plentitude --to-callback escape-xml





Troy Korjuslommi                Tksoft Inc.
tjk@tksoft.com





> 
> Hi guys,
> 
> What happens (and what should happen) when I try to recode a character
> existing in one encoding but missing in another?
> 
> This has two sides:
> 
> 1. What should happen when I try to recode a file with iconv 
>    for example?
> 2. What should happen when I try to use the file via the glibc library
>    in a different locale with a different encoding?
> 
> Lets take the following example:
> The file "plentitude" is encoded in UTF8 and contains the following
> characters:
> ======
> Î0x91×0x90Ð0x90اá0x820xa0Ô±
> ======
> 
> These are as follows:
> 
> U+0391 GREEK CAPITAL LETTER ALPHA
> U+05D0 HEBREW LETTER ALEF
> U+0410 CYRILLIC CAPITAL LETTER A
> U+0627 ARABIC LETTER ALEF
> U+10A0 GEORGIAN CAPITAL LETTER AN
> U+0531 ARMENIAN CAPITAL LETTER AYB
> 
> If I try to convert it to some 8 bit encoding most probably at least one
> of the characters will be missing. Sometimes all of them - for example
> the C encoding which is ASCII.
> 
> In such cases:
> 
> 1. What I get when trying to convert the file via iconv is usually:
> iconv -f UTF-8 -t ASCII plentitude
> iconv: illegal input sequence at position 0
> or some other position, then iconv exits and the file is not recoded in
> its entirety.
> 
> 2. However when I change my locale to C
> export LANG=C
> I can still cat the file
> 
> 
> Now - suppose the behavior of iconv corresponds to the following:
> 
> The unicode standard states:
> http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf
> Section 5.3 Unknown and Missing Characters
> Reserved and Private-Use Character Codes
> =================================================
> An implementation should not attempt to interpret such code points.
> However, in practice, applications must deal with unassigned code points
> or private use characters. This may occur, for example, when the
> application is handling text that originated on a system implementing a
> later release of the Unicode Standard, with additional assigned
> characters.
> Options for rendering such unknown code points include printing the code
> point as four to six hexadecimal digits, printing a black or white box,
> using appropriate glyphs such as ê for reserved and | for private use,
> or simply displaying nothing. An implementation should not blindly
> delete such characters, nor should it unintentionally transform them
> into something else.
> ==================================================
> 
> Since iconv is unable to print boxes or cannot ignore characters, and
> simply deleting them or transforming them to another character is
> unacceptable - it simply fails the converting operation.
> 
> Is there a rule stating that a non existing character should be recoded
> to some symbol, or should the conversion fail in such cases?
> For example this
> http://www.unicode.org/glossary/#replacement_character
> states
> Character used as a substitute for an uninterpretable character from
> another encoding. The Unicode Standard uses U+FFFD REPLACEMENT CHARACTER
> for this function.
> ᅩ
> However - in most encodings this character does not exist.
> 
> What should be the behavior of programs trying to do such a conversion?
> 
> Kind regards:
> al_shopov
> 
> 


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]