UTF-8 in specs

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Tue Feb 13 17:42:58 CET 2007


Dnia 13-02-2007, wto o godzinie 16:14 +0100, Andrzej Krzysztofowicz
napisał(a):

> > I'm disappointed that fmt from coreutils assumes that 1 byte = 1 column.
> > I will try to fix fmt, and then someone might change adapter to use fmt.

Done (coreutils at HEAD). It might need more testing, especially in extreme
cases.

> Be careful as the old behaviour must remain for non-UTF locale.

fmt now processes the text using getwc & putwchar, which means that it
must be decodable using the current locale encoding.

fmt was not intended for languages which doesn't use spaces to delimit
words, like Chinese. For example e2fsprogs.spec / %description -l zh
uses single newlines to delimit one-line paragraphs, and both fmt and
the current adapter makes a mess of that. Perhaps adapter should exclude
Chinese from reformatting (anything else?).

> BTW, what settings do you intend to use in adapter?

I'm leaving adapter.awk for somebody else to fix.

I have no idea whether they use the same conventions to delimit
paragraphs, preserve existing indentation etc.

adapter.awk currently uses the width of 70, the fmt's default is 75.

Languages which prefer single spaces between sentences should use the
new -n / --single-spaces option.

adapter.awk should run fmt in a UTF-8 locale for UTF-8 texts, and
should react somehow if fmt fails (e.g. because the text is malformed
UTF-8). It must set the locale to something using ISO-8859-x when
processing ISO-8859-x texts.

Or perhaps if AC's RPM was patched to convert texts to the locale
encoding like the newer RPMs, then adapter could convert descriptions
to UTF-8?

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/



More information about the pld-devel-en mailing list