[syslog-ng] Handle multiple encoding from multiple sources

Fri Jun 19 07:12:43 CEST 2009

On Thu, 2009-06-18 at 17:23 +0200, David Perret wrote:
> Hello,
> 
> I'm trying to understand how syslog-ng handle charset encoding.
> 
> I've seen that :
> 
> - without the "encoding("charset")" keyword in the source it just writes 
> to the output what it gets from the input.
> - with the "encoding("charset")" keyword in the source it converts what 
> it gets from the output from "charset" to UTF-8.
> 
> What is the best way to handle logs from different charset? Do I really 
> have to create a different source per charset used ? 

Currently yes, but read on.

> Is there any way to 
> say that I always want the input's charset to be converted to UTF-8 
> before writing my logs ?

I don't see any magic solution that could deduce what encoding a given
string uses before converting that to utf8. So in order to do that
conversion I need the source charset.

You need separate sources for each charset because that option
completely changes the way the incoming byte-stream is processed. For
example the NL character can be quite different from ASCII #10 if you
used UCS2 (two byte unicode what Windows uses for example), this way the
code to parse records out of the bytestream would not recognize the
record separator.

Thus it is not possible to set or discover the encoding on a per-message
basis.

What you could do to avoid lots of source() statements is to convert the
messages to utf8 on the relays/end hosts, this way the central would
only need to process utf8.

-- 
Bazsi