Re: [syslog-ng] Store syslog occurrence frequency instead of adding all of them to the DB

19 Aug 2011

      My little schema was simply an example, and would need to be merged
with the full logs table.

You surely lose detail when doing de-duplication on non-sequential
logs.  However, I typically find this to be an acceptable trade-off
for most circumstances, but of course situations will vary.  I myself
do no de-duplication because I don't want to lose the forensic value.

Regarding CRC32 vs other hashing functions: 32 bits is certainly very
small and collisions are very possible.  However, in practice, I find
them incredibly rare (much rarer than one would think, given the
existence of far superior hashing algorithms).  The primary advantage
in CRC32 is its incredible speed and small size (fits in an integer),
versus the extremely (comparatively) CPU-intensive and space-consuming
MD5/SHA1 algorithms.  I will put it this way: I recommend using CRC32
because if you have a low enough log volume that you can spare CPU
cycles for calculating MD5/SHA1, then you do not have enough logs to
worry about collisions becoming an issue.

In my production schema (in ELSA), I calculate CRC32 on the program
field and insert a program ID instead of the program name.  This saves
an enormous amount of space.  Queries then query for
program_id=CRC32("program name") instead of program_name="program
name" or even program_id=(select id from programs where
program_name="program name") which is less desirable because you have
to maintain a separate lookup table.

If you are more concerned with 100% query accuracy at great cost to
both performance and storage versus %99.99999999 accuracy but large
cost savings, then I would say you want the less collision-prone
hashing functions.

On Fri, Aug 19, 2011 at 12:00 PM,  <syslogng@feystorm.net> wrote:
...
Sent: Fri Aug 19 2011 10:57:57 GMT-0600 (MST)
From: Jakub Jankowski <shasta@toxcorp.com>
To: Syslog-ng users' and developers' mailing list
<syslog-ng@lists.balabit.hu>
Subject: Re: [syslog-ng] Store syslog occurrence frequency instead of adding
all of them to the DB
On 2011-08-19, syslogng@feystorm.net wrote:
Secondly using a 32-bit checksum of the message text to determine uniqueness
is risky. It would be farily easy to end up with 2 different messages that
have the same checksum. A md5 checksum would be much better, but I dont
believe syslog-ng has a function to compute md5 sums.
One can delegate this task to the database itself. MySQL has MD5() as well
as SHA1() built in.
Its been a while since I've had syslog-ng talk to a database directly, but
when I did, you couldnt use database functions when storing data. You just
gave syslog-ng the names of the fields, and then the macros to stick in
those fields and syslog-ng went and assembled the query for you. Does
syslog-ng let you construct the query yourself now? If so, then yes, using
the database's hashing functions would work fine.
______________________________________________________________________________
Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng
Documentation:
http://www.balabit.com/support/documentation/?product=syslog-ng
FAQ: http://www.balabit.com/wiki/syslog-ng-faq

Re: [syslog-ng] Store syslog occurrence frequency instead of adding all of them to the DB

Martin Holste