Store syslog occurrence frequency instead of adding all of them to the DB
Hi, I am using Syslog-NG which stores the Syslog to a remote MySQL server for central logging. Now, I am running into a situation a lot of records inside the MySQL database is "almost" exactly the same, just except the timestamp is different. For example, I get "file system is full" every minute for a host and this message repeats itself periodically. In order to better manage the MySQL database, do you have any suggestions how I can reduce the number of overall records inside the DB, but add the occurrence frequency instead? For example, syslog message "file system is full" occurs every minutes on the system. There will be 24 x 60 "file system is full" records on my MySQL DB after 1 day. If no one fix it, I will get 10 x 24 x 60 "file system is full" records for 10 hosts having the same problem for 1 day. Can those records being "processed" some how and when I search the MySQL DB, I only see ONE record list the following only? Total occurrenceMessage content ===================== 14,400File system is full If I can do that, it will definitely reduce a log of space of my DB and speed up the query. It seems this is a database problem as well. Please let me know your idea at anytime. Regards, Marcos
Am 18.08.2011 19:52, schrieb Marcos Tang:
Can those records being "processed" some how and when I search the MySQL DB, I only see ONE record list the following only?
Total occurrenceMessage content ===================== 14,400File system is full
can´t you use something like INSERT INTO tables VALUES (date,logstring) ON duplicate KEY UPDATE date; ? florian
I'd add on to this by using a crc function to hash the message and store that in a column to make the unique check very fast: CREATE TABLE mylogs ( id BIGINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT, timestamp TIMESTAMP NOT NULL, crc INT UNSIGNED NOT NULL, count INT UNSIGNED NOT NULL DEFAULT 0, msg TEXT, UNIQUE KEY (crc) ); INSERT INTO mylogs (timestamp, crc, msg) values($timestamp, CRC32($msg), $msg) ON DUPLICATE KEY UPDATE count=count+1, timestamp=$timestamp; On Thu, Aug 18, 2011 at 1:33 PM, system@ra-schaal.de <system@ra-schaal.de> wrote:
Am 18.08.2011 19:52, schrieb Marcos Tang:
Can those records being "processed" some how and when I search the MySQL DB, I only see ONE record list the following only?
Total occurrenceMessage content ===================== 14,400File system is full
can´t you use something like
INSERT INTO tables VALUES (date,logstring) ON duplicate KEY UPDATE date;
?
florian ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
Hi Martin & Florian, I am not good at SQL and I have the following table right now. Do you mean my current table also needs to merge with your table to form one single table? What is the corresponding syslog-ng.conf file which I should modify to fit for your table? Once again, thanks for your suggestions for me. mysql> desc logs; +----------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+------------------+------+-----+---------+----------------+ | host | varchar(32) | YES | MUL | NULL | | | facility | varchar(10) | YES | MUL | NULL | | | priority | varchar(10) | YES | MUL | NULL | | | level | varchar(10) | YES | | NULL | | | tag | varchar(10) | YES | | NULL | | | date | date | YES | MUL | NULL | | | time | time | YES | MUL | NULL | | | program | varchar(15) | YES | MUL | NULL | | | msg | text | YES | | NULL | | | seq | int(10) unsigned | NO | PRI | NULL | auto_increment | +----------+------------------+------+-----+---------+----------------+ 10 rows in set (0.00 sec) Regards, Marcos --- On Fri, 8/19/11, Martin Holste <mcholste@gmail.com> wrote: From: Martin Holste <mcholste@gmail.com> Subject: Re: [syslog-ng] Store syslog occurrence frequency instead of adding all of them to the DB To: "Syslog-ng users' and developers' mailing list" <syslog-ng@lists.balabit.hu> Cc: "Marcos Tang" <marcostang2002@yahoo.com> Date: Friday, August 19, 2011, 4:10 AM I'd add on to this by using a crc function to hash the message and store that in a column to make the unique check very fast: CREATE TABLE mylogs ( id BIGINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT, timestamp TIMESTAMP NOT NULL, crc INT UNSIGNED NOT NULL, count INT UNSIGNED NOT NULL DEFAULT 0, msg TEXT, UNIQUE KEY (crc) ); INSERT INTO mylogs (timestamp, crc, msg) values($timestamp, CRC32($msg), $msg) ON DUPLICATE KEY UPDATE count=count+1, timestamp=$timestamp; On Thu, Aug 18, 2011 at 1:33 PM, system@ra-schaal.de <system@ra-schaal.de> wrote:
Am 18.08.2011 19:52, schrieb Marcos Tang:
Can those records being "processed" some how and when I search the MySQL DB, I only see ONE record list the following only?
Total occurrenceMessage content ===================== 14,400File system is full
can´t you use something like
INSERT INTO tables VALUES (date,logstring) ON duplicate KEY UPDATE date;
?
florian ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
The idea that was mentioned is not without its flaws. Firstly say you receive a message on the 1st, then you receive the same message on the 10th, you would lose all record that the message was received on the 1st as date on it would be modified to the most recent occurance of the message. The ideal example would be to only merge multiple occurrences of the message when they appear sequentially with no other messages between them. Secondly using a 32-bit checksum of the message text to determine uniqueness is risky. It would be farily easy to end up with 2 different messages that have the same checksum. A md5 checksum would be much better, but I dont believe syslog-ng has a function to compute md5 sums. Lastly the table key needs to be modified so the host name is part of it. As it is with the way martin mentioned, if 2 hosts get the same message, the latter one will update the timestamp of the first host's message making it look like the second host never had such a message at all, and that the first host got it twice. The idea holds merit, and if these issues are fine with you, then it should work. But syslog-ng already merges sequential duplicate messages when logging to files, so it'd probably be fairly trivial to port this functionality over to the sql driver. I'd make a feature request on the bug tracker. -Patrick Sent: Fri Aug 19 2011 02:11:45 GMT-0600 (MST) From: Marcos Tang <marcostang2002@yahoo.com> To: Syslog-ng users' and developers' mailing list <syslog-ng@lists.balabit.hu>, Martin Holste <mcholste@gmail.com> Subject: Re: [syslog-ng] Store syslog occurrence frequency instead of adding all of them to the DB
Hi Martin & Florian, I am not good at SQL and I have the following table right now. Do you mean my current table also needs to merge with your table to form one single table? What is the corresponding syslog-ng.conf file which I should modify to fit for your table? Once again, thanks for your suggestions for me. mysql> desc logs; +----------+------------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+------------------+------+-----+---------+----------------+ | host | varchar(32) | YES | MUL | NULL | | | facility | varchar(10) | YES | MUL | NULL | | | priority | varchar(10) | YES | MUL | NULL | | | level | varchar(10) | YES | | NULL | | | tag | varchar(10) | YES | | NULL | | | date | date | YES | MUL | NULL | | | time | time | YES | MUL | NULL | | | program | varchar(15) | YES | MUL | NULL | | | msg | text | YES | | NULL | | | seq | int(10) unsigned | NO | PRI | NULL | auto_increment | +----------+------------------+------+-----+---------+----------------+ 10 rows in set (0.00 sec)
Regards, Marcos
--- On *Fri, 8/19/11, Martin Holste /<mcholste@gmail.com>/* wrote:
From: Martin Holste <mcholste@gmail.com> Subject: Re: [syslog-ng] Store syslog occurrence frequency instead of adding all of them to the DB To: "Syslog-ng users' and developers' mailing list" <syslog-ng@lists.balabit.hu> Cc: "Marcos Tang" <marcostang2002@yahoo.com> Date: Friday, August 19, 2011, 4:10 AM
I'd add on to this by using a crc function to hash the message and store that in a column to make the unique check very fast: CREATE TABLE mylogs ( id BIGINT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT, timestamp TIMESTAMP NOT NULL, crc INT UNSIGNED NOT NULL, count INT UNSIGNED NOT NULL DEFAULT 0, msg TEXT, UNIQUE KEY (crc) ); INSERT INTO mylogs (timestamp, crc, msg) values($timestamp, CRC32($msg), $msg) ON DUPLICATE KEY UPDATE count=count+1, timestamp=$timestamp;
On Thu, Aug 18, 2011 at 1:33 PM, system@ra-schaal.de <http://us.mc1121.mail.yahoo.com/mc/compose?to=system@ra-schaal.de> <system@ra-schaal.de <http://us.mc1121.mail.yahoo.com/mc/compose?to=system@ra-schaal.de>> wrote: > Am 18.08.2011 19:52, schrieb Marcos Tang: > >> Can those records being "processed" some how and when I search the MySQL >> DB, I only see ONE record list the following only? >> >> Total occurrenceMessage content >> ===================== >> 14,400File system is full >> > > can´t you use something like > > INSERT INTO tables VALUES (date,logstring) ON duplicate KEY UPDATE date; > > ? > > florian > ______________________________________________________________________________ > Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng > Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng > FAQ: http://www.balabit.com/wiki/syslog-ng-faq > >
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
On 2011-08-19, syslogng@feystorm.net wrote:
The idea that was mentioned is not without its flaws. [...] Secondly using a 32-bit checksum of the message text to determine uniqueness is risky. It would be farily easy to end up with 2 different messages that have the same checksum. A md5 checksum would be much better, but I dont believe syslog-ng has a function to compute md5 sums.
One can delegate this task to the database itself. MySQL has MD5() as well as SHA1() built in. -- Jakub Jankowski|shasta@toxcorp.com|http://toxcorp.com/ GPG: FCBF F03D 9ADB B768 8B92 BB52 0341 9037 A875 942D
Sent: Fri Aug 19 2011 10:57:57 GMT-0600 (MST) From: Jakub Jankowski <shasta@toxcorp.com> To: Syslog-ng users' and developers' mailing list <syslog-ng@lists.balabit.hu> Subject: Re: [syslog-ng] Store syslog occurrence frequency instead of adding all of them to the DB
On 2011-08-19, syslogng@feystorm.net wrote:
Secondly using a 32-bit checksum of the message text to determine uniqueness is risky. It would be farily easy to end up with 2 different messages that have the same checksum. A md5 checksum would be much better, but I dont believe syslog-ng has a function to compute md5 sums. One can delegate this task to the database itself. MySQL has MD5() as well as SHA1() built in. Its been a while since I've had syslog-ng talk to a database directly, but when I did, you couldnt use database functions when storing data. You just gave syslog-ng the names of the fields, and then the macros to stick in those fields and syslog-ng went and assembled the query for you. Does syslog-ng let you construct the query yourself now? If so, then yes, using the database's hashing functions would work fine.
My little schema was simply an example, and would need to be merged with the full logs table. You surely lose detail when doing de-duplication on non-sequential logs. However, I typically find this to be an acceptable trade-off for most circumstances, but of course situations will vary. I myself do no de-duplication because I don't want to lose the forensic value. Regarding CRC32 vs other hashing functions: 32 bits is certainly very small and collisions are very possible. However, in practice, I find them incredibly rare (much rarer than one would think, given the existence of far superior hashing algorithms). The primary advantage in CRC32 is its incredible speed and small size (fits in an integer), versus the extremely (comparatively) CPU-intensive and space-consuming MD5/SHA1 algorithms. I will put it this way: I recommend using CRC32 because if you have a low enough log volume that you can spare CPU cycles for calculating MD5/SHA1, then you do not have enough logs to worry about collisions becoming an issue. In my production schema (in ELSA), I calculate CRC32 on the program field and insert a program ID instead of the program name. This saves an enormous amount of space. Queries then query for program_id=CRC32("program name") instead of program_name="program name" or even program_id=(select id from programs where program_name="program name") which is less desirable because you have to maintain a separate lookup table. If you are more concerned with 100% query accuracy at great cost to both performance and storage versus %99.99999999 accuracy but large cost savings, then I would say you want the less collision-prone hashing functions. On Fri, Aug 19, 2011 at 12:00 PM, <syslogng@feystorm.net> wrote:
Sent: Fri Aug 19 2011 10:57:57 GMT-0600 (MST) From: Jakub Jankowski <shasta@toxcorp.com> To: Syslog-ng users' and developers' mailing list <syslog-ng@lists.balabit.hu> Subject: Re: [syslog-ng] Store syslog occurrence frequency instead of adding all of them to the DB
On 2011-08-19, syslogng@feystorm.net wrote:
Secondly using a 32-bit checksum of the message text to determine uniqueness is risky. It would be farily easy to end up with 2 different messages that have the same checksum. A md5 checksum would be much better, but I dont believe syslog-ng has a function to compute md5 sums.
One can delegate this task to the database itself. MySQL has MD5() as well as SHA1() built in.
Its been a while since I've had syslog-ng talk to a database directly, but when I did, you couldnt use database functions when storing data. You just gave syslog-ng the names of the fields, and then the macros to stick in those fields and syslog-ng went and assembled the query for you. Does syslog-ng let you construct the query yourself now? If so, then yes, using the database's hashing functions would work fine.
______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.balabit.com/wiki/syslog-ng-faq
participants (5)
-
Jakub Jankowski
-
Marcos Tang
-
Martin Holste
-
syslogng@feystorm.net
-
system@ra-schaal.de