User Tools

Site Tools


sed:special_characters:sed_behaving_strangely_in_utf-8_environment

This is an old revision of the document!


SED - Special Characters - sed behaving strangely in UTF-8 environment

Your problem

You are using a Linux distribution with UTF-8 encoding.

You are using sed to operate on files containing German Umlauts or other non-Ascii characters.

sed is behaving quite strangly. An expression like the following normally should replace an arbitrary string by a single x. The dot, however, does not match non-Ascii characters any more!

sed 's/.*/x/'

The reason

The problem occurs if you operate on ISO-8859 (Latin) encoded files.

A non-ascii character is misinterpreted in UTF-8 as a sequence of characters or - even worse - as an invalid UTF-8 string.

So sed classifies the character as something not being matched by a dot. Strange and dangerous…


Solution

Converting your system back from UTF-8 to ISO-8859 seems not to be a good solution.

A problem similar to the upper one would occur then when you operate with UTF-8 files.

Better use iconv to convert the data on the fly:

iconv -f latin1 -t utf-8 sourcefile | sed 's/.*/x/' | iconv -f utf-8 -t latin1
sed/special_characters/sed_behaving_strangely_in_utf-8_environment.1597662231.txt.gz · Last modified: 2020/08/17 11:03 by 192.168.1.1

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki