Handle multiple encoding from multiple sources
Hello, I'm trying to understand how syslog-ng handle charset encoding. I've seen that : - without the "encoding("charset")" keyword in the source it just writes to the output what it gets from the input. - with the "encoding("charset")" keyword in the source it converts what it gets from the output from "charset" to UTF-8. What is the best way to handle logs from different charset? Do I really have to create a different source per charset used ? Is there any way to say that I always want the input's charset to be converted to UTF-8 before writing my logs ? Thanks David
On Thu, 2009-06-18 at 17:23 +0200, David Perret wrote:
Hello,
I'm trying to understand how syslog-ng handle charset encoding.
I've seen that :
- without the "encoding("charset")" keyword in the source it just writes to the output what it gets from the input. - with the "encoding("charset")" keyword in the source it converts what it gets from the output from "charset" to UTF-8.
What is the best way to handle logs from different charset? Do I really have to create a different source per charset used ?
Currently yes, but read on.
Is there any way to say that I always want the input's charset to be converted to UTF-8 before writing my logs ?
I don't see any magic solution that could deduce what encoding a given string uses before converting that to utf8. So in order to do that conversion I need the source charset. You need separate sources for each charset because that option completely changes the way the incoming byte-stream is processed. For example the NL character can be quite different from ASCII #10 if you used UCS2 (two byte unicode what Windows uses for example), this way the code to parse records out of the bytestream would not recognize the record separator. Thus it is not possible to set or discover the encoding on a per-message basis. What you could do to avoid lots of source() statements is to convert the messages to utf8 on the relays/end hosts, this way the central would only need to process utf8. -- Bazsi
participants (2)
-
Balazs Scheidler
-
David Perret