Anyone got a well performing search interface for syslog data?
We're generating around 4Gb syslog data per week, and I'm looking for a good search interface into it. I can cut my way through it with egrep/etc, but waiting 10-15min for a result really isn't going to break any speed records. Especially when I then need to re-run it with another "grep" on the end of it! ;-) I have tried injecting it into a MySQL database using some schemas I've found on the Internet - but the performance didn't seem much better to me - and you lost the "free-text" attributes of grep (or more specifically, the sorts of searches I find I want to do aren't SQL-friendly). Has anyone come up with a good speedy way of coping with Gbytes of syslog data? Or is it time to invest in some Appliance or the like? -- Cheers Jason Haar Information Security Manager, Trimble Navigation Ltd. Phone: +64 3 9635 377 Fax: +64 3 9635 417 PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1
Jason, I've been looking into php-syslog-ng (http://freshmeat.net/projects/phpsyslogng/), which, as the name might suggest is a php/mysql frontend for syslog-ng. For large amounts of data you can use the "logrotate" function that it provides to make a new database every day/week/whatever. This means that as long as you know the date of what you are looking for, the search stays small. In the case that you're not sure of the date you can still search across all databases, but be prepared to wait ! The databases are indexed and optimized which makes them faster (a lot!) than grep. Another alternative is to leave the data in text files but then to index the text files with something like "beagle" (http://beaglewiki.org/Main_Page) or "penetrator" (http://freshmeat.net/projects/penetrator/). You then just need to search the index which will let you know exactly where to look. Regards, Jim Jason Haar wrote:
We're generating around 4Gb syslog data per week, and I'm looking for a good search interface into it.
I can cut my way through it with egrep/etc, but waiting 10-15min for a result really isn't going to break any speed records. Especially when I then need to re-run it with another "grep" on the end of it! ;-)
I have tried injecting it into a MySQL database using some schemas I've found on the Internet - but the performance didn't seem much better to me - and you lost the "free-text" attributes of grep (or more specifically, the sorts of searches I find I want to do aren't SQL-friendly).
Has anyone come up with a good speedy way of coping with Gbytes of syslog data? Or is it time to invest in some Appliance or the like?
I rotate logs on my central server (for about 60 hosts) nightly into a simple linux filesystem and compress them with gzip -9. I have about 5 years online and searchable. My web interface for searching is written in perl and lets you search by day/month/year/time. It is quite fast. In fact, the main slowdown is the browser when using html tables for layout instead of straight text when there's a lot of data to display. I'll look into getting permission to post my viewer on the web. It's a few years old and a little crufty but could be cleaned up easily enough. I just ran a test searching for cfengine security /cfengine.*SECURITY/ events over the last 30 days/files. It took about 1.5 minutes to run. The script only got about 20% cpu (dual xeon 2800), so I'm betting most of the time was I/O even though this was on some fast EVA disk. The advantage of staying with flat files in I/O is it's a linear read through the file, which almost every OS does very fast. Technically, mysql tables could be linear reads, but will never match the raw speed of perl when doing regular expressions linearly through a file, especially when the RE's are well written. By the way, compression may actually help I/O throughput if you have fast CPU and slow disk, since the reads from disk will be smaller. Having extra system memory for buffer cache and readahead helps, too. -Al Tobey Senior Unix Engineer Priority Health On 9/6/05, Jason Haar <Jason.Haar@trimble.co.nz> wrote:
We're generating around 4Gb syslog data per week, and I'm looking for a good search interface into it.
I can cut my way through it with egrep/etc, but waiting 10-15min for a result really isn't going to break any speed records. Especially when I then need to re-run it with another "grep" on the end of it! ;-)
I have tried injecting it into a MySQL database using some schemas I've found on the Internet - but the performance didn't seem much better to me - and you lost the "free-text" attributes of grep (or more specifically, the sorts of searches I find I want to do aren't SQL-friendly).
Has anyone come up with a good speedy way of coping with Gbytes of syslog data? Or is it time to invest in some Appliance or the like?
-- Cheers
Jason Haar Information Security Manager, Trimble Navigation Ltd. Phone: +64 3 9635 377 Fax: +64 3 9635 417 PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1
_______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng Frequently asked questions at http://www.campin.net/syslog-ng/faq.html
On 9/5/05, Jason Haar <Jason.Haar@trimble.co.nz> wrote:
We're generating around 4Gb syslog data per week, and I'm looking for a good search interface into it.
We generate around 1GB syslog data per hour (peak), and still haven't found a good interface to search archived data. I am seriously considering "PHPsyslogNG", but we are concerned about the security risks of installing mysql and php on our otherwise very locked down OpenBSD loggers. I am primarily worried about the integrity of the original log archives, so I may end up deploying a new server with either PHPsyslogNG or MARS, and feeding a copy of the log stream to that new host.
I can cut my way through it with egrep/etc, but waiting 10-15min for a
I've found that fastest way to search large text files from the command line is to start with an 'fgrep' to get a broad match, then use egrep to look for specific information.
result really isn't going to break any speed records. Especially when I then need to re-run it with another "grep" on the end of it! ;-)
Another trick is to do this: fgrep -h "192.168.1." * |tee /tmp/temp192-168-1.log |egrep "(ftp|http|deny)" If you need to "re-run the command with another grep on the end", you can use /tmp/temp192-168-1.log as the source, instead of the complete logs. Just make sure /tmp has room to spare :)
Has anyone come up with a good speedy way of coping with Gbytes of syslog data?
I get "acceptable" search times when looking at short time ranges (usually just a couple of hours at a time) by coding common queries as Perl scripts. This also makes it easy to generate histograms and summary reports, in text, HTML, or both.
Or is it time to invest in some Appliance or the like?
We've considered appliances from companies such as LogLogic, at one time had a budget to purchase a syslog appliance.. As it turns out most "appliances" are LAMP with a nice GUI, and usually either have limitations on the types and formats of the log source data they will accept, or charge a license fee for modules to process different event sources, or even a fee per source host! Around 2001 NFR offered their "SLR" syslog appliance, they no longer sell this but SLR may be available to existing customers. Another appliance option to consider is Cisco's MARS product, (formerly Protego), which includes it's own Oracle backend Kevin Kadow
I use a mysql backend for all of my syslog collection, broken out into daily tables and tagged by application and rule, for fast indexing. For those interested, I'm working on a release for my syslog analyzer, Hera, hopefully to be ready to go within the next few weeks. More details here: http://www.billn.net/?page=project_hera - billn On Tue, 6 Sep 2005, Kevin wrote:
On 9/5/05, Jason Haar <Jason.Haar@trimble.co.nz> wrote:
We're generating around 4Gb syslog data per week, and I'm looking for a good search interface into it.
We generate around 1GB syslog data per hour (peak), and still haven't found a good interface to search archived data.
I am seriously considering "PHPsyslogNG", but we are concerned about the security risks of installing mysql and php on our otherwise very locked down OpenBSD loggers.
I am primarily worried about the integrity of the original log archives, so I may end up deploying a new server with either PHPsyslogNG or MARS, and feeding a copy of the log stream to that new host.
I can cut my way through it with egrep/etc, but waiting 10-15min for a
I've found that fastest way to search large text files from the command line is to start with an 'fgrep' to get a broad match, then use egrep to look for specific information.
result really isn't going to break any speed records. Especially when I then need to re-run it with another "grep" on the end of it! ;-)
Another trick is to do this:
fgrep -h "192.168.1." * |tee /tmp/temp192-168-1.log |egrep "(ftp|http|deny)"
If you need to "re-run the command with another grep on the end", you can use /tmp/temp192-168-1.log as the source, instead of the complete logs. Just make sure /tmp has room to spare :)
Has anyone come up with a good speedy way of coping with Gbytes of syslog data?
I get "acceptable" search times when looking at short time ranges (usually just a couple of hours at a time) by coding common queries as Perl scripts. This also makes it easy to generate histograms and summary reports, in text, HTML, or both.
Or is it time to invest in some Appliance or the like?
We've considered appliances from companies such as LogLogic, at one time had a budget to purchase a syslog appliance.. As it turns out most "appliances" are LAMP with a nice GUI, and usually either have limitations on the types and formats of the log source data they will accept, or charge a license fee for modules to process different event sources, or even a fee per source host!
Around 2001 NFR offered their "SLR" syslog appliance, they no longer sell this but SLR may be available to existing customers. Another appliance option to consider is Cisco's MARS product, (formerly Protego), which includes it's own Oracle backend
Kevin Kadow _______________________________________________ syslog-ng maillist - syslog-ng@lists.balabit.hu https://lists.balabit.hu/mailman/listinfo/syslog-ng Frequently asked questions at http://www.campin.net/syslog-ng/faq.html
I don't want to spoil the party ...
We generate around 1GB syslog data per hour (peak), and still haven't found a good interface to search archived data.
We basically use grep, which we patched a bit to speed up the search ;).
I am seriously considering "PHPsyslogNG", but we are concerned about the security risks of installing mysql and php on our otherwise very locked down OpenBSD loggers.
mysql shouldn't be a problem, and for php you can google for some hardening php projects. During the course of various years we have been doing centralised log file analysis we've come to realise that db's just don't cut it, as strange as this may sound now. We make heavy usage of macro expansion and build up a hierarchy of logfiles through simple fs directories. We simply had problems extracting important information from GBytes of log entries form a DB, either postgres or mysql. The current key to success is to write appropriate filters to dissect incoming log data in an intelligent way and store it in a directory structure using macro expansion.
I am primarily worried about the integrity of the original log archives,
Then DBMS is the way to go.
so I may end up deploying a new server with either PHPsyslogNG or MARS, and feeding a copy of the log stream to that new host.
What kind of information exactly do you need to extract? That's maybe the question that most people need to ask themselves when deploying syslog servers. Do you simply want to browse through some logfiles to cherry-pick suspicious lines or do you yield for correlated data for information and event management?
Another trick is to do this:
fgrep -h "192.168.1." * |tee /tmp/temp192-168-1.log |egrep "(ftp|http|deny)"
If you need to "re-run the command with another grep on the end", you can use /tmp/temp192-168-1.log as the source, instead of the complete logs. Just make sure /tmp has room to spare :)
As a short sidenote: people maintaining grep have recently switched maintainership and kind of a to me strange activity in grep's development was to change the way fgrep and egrep are dealt with: Basically egrep is grep -E and fgrep is grep -F. And egrep, resp. fgrep are symlinks to /bin/grep normally. So, now this will change in future, as those symlinks will be real files; which means that you'll lose some time if you use egrep¦fgrep instead of grep -E or grep -F ;). But with your pipe orgy I reckon this does not really account for. We had to patch grep heavily to reduce our '¦' orgy.
We've considered appliances from companies such as LogLogic, at one time had a budget to purchase a syslog appliance.. As it turns out most "appliances" are LAMP with a nice GUI, and usually either have limitations on the types and formats of the log source data they will accept, or charge a license fee for modules to process different event sources, or even a fee per source host!
Interesting...ly strange business model.
Around 2001 NFR offered their "SLR" syslog appliance, they no longer sell this but SLR may be available to existing customers. Another appliance option to consider is Cisco's MARS product, (formerly Protego), which includes it's own Oracle backend
Thanks for this input. Regards, Roberto Nibali, ratz -- echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
I really don't recommend php-syslog-ng, I have been using it for almost a month now and it has been extremely slow. I would be interested to see these perl scripts that Al Tobey talked about. What I have done is setup SEC for a monitoring system and just receive notifications on information I care about. Until I can come up with something quicker we are still using php-syslog-ng for allowing management and controllers to look up information from the logs. - Ken Jason Haar wrote:
We're generating around 4Gb syslog data per week, and I'm looking for a good search interface into it.
I can cut my way through it with egrep/etc, but waiting 10-15min for a result really isn't going to break any speed records. Especially when I then need to re-run it with another "grep" on the end of it! ;-)
I have tried injecting it into a MySQL database using some schemas I've found on the Internet - but the performance didn't seem much better to me - and you lost the "free-text" attributes of grep (or more specifically, the sorts of searches I find I want to do aren't SQL-friendly).
Has anyone come up with a good speedy way of coping with Gbytes of syslog data? Or is it time to invest in some Appliance or the like?
I just want to thank everyone for their responses. Very interesting stuff! I think I can paraphrase that SQL-backends don't give much advantage with large data sets due to the lack of relationships within syslog data, and the "fastest" solutions are going to be those that basically have custom-written "hot searches" pre-defined so that the appropriate indexes/extra files are already created to speed things up. The comments about gziping the files to speed up reads was interesting as well... It certainly an interesting problem. I want to do things like: 1. IDS event that IP 1.2.3.4 just did something bad against 3.4.5.6 2. I want to search logs for 7 days before this event for any other activity from IP address 1.2.3.4 (might be email, PIX ACL logs, etc) or from 3.4.5.6 or 1. User claims email never reached recipient 2. search for users email address 3. get report of all SMTP connection attempts, delivery attempts, AV and antispam/RBL records associated with path of message through 'n' different systems those are all doable by hand - but very slow and - basically you need to have someone who knows what they are doing. Being able to put that behind a Web interface and make it a few clicks would be wonderful. -- Cheers Jason Haar Information Security Manager, Trimble Navigation Ltd. Phone: +64 3 9635 377 Fax: +64 3 9635 417 PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1
php-syslog-ng might be what you are looking fo. if you want a simple interface for people to use for searching. I'll recommend using this site: http://www.phpwizardry.com/php-syslog-ng.php Claus has re-written the project in his own release and fixes many issues that have been brought up and included some useful scripts as well. Jason Haar wrote:
I just want to thank everyone for their responses. Very interesting stuff!
I think I can paraphrase that SQL-backends don't give much advantage with large data sets due to the lack of relationships within syslog data, and the "fastest" solutions are going to be those that basically have custom-written "hot searches" pre-defined so that the appropriate indexes/extra files are already created to speed things up.
The comments about gziping the files to speed up reads was interesting as well...
It certainly an interesting problem. I want to do things like:
1. IDS event that IP 1.2.3.4 just did something bad against 3.4.5.6 2. I want to search logs for 7 days before this event for any other activity from IP address 1.2.3.4 (might be email, PIX ACL logs, etc) or from 3.4.5.6
or
1. User claims email never reached recipient 2. search for users email address 3. get report of all SMTP connection attempts, delivery attempts, AV and antispam/RBL records associated with path of message through 'n' different systems
those are all doable by hand - but very slow and - basically you need to have someone who knows what they are doing. Being able to put that behind a Web interface and make it a few clicks would be wonderful.
On Thu, 8 Sep 2005, Jason Haar wrote:
I just want to thank everyone for their responses. Very interesting stuff!
I think I can paraphrase that SQL-backends don't give much advantage with large data sets due to the lack of relationships within syslog data, and the "fastest" solutions are going to be those that basically have custom-written "hot searches" pre-defined so that the appropriate indexes/extra files are already created to speed things up.
You're correct in that syslog, by itself, doesn't offer any amount of relationships.. by itself. This is what log analyzers are for. mysql> select syslogRule.appSet, count(*) from syslog left join syslogRule on (syslog.syslogRule = syslogRule.id) group by appSet; +---------------------+----------+ | appSet | count(*) | +---------------------+----------+ | NULL | 235 | | Alteon | 2316 | | Cisco IOS | 1552 | | Cron | 6 | | Linux Kernel | 214689 | | Linux PAM | 13465 | | logrotate | 6 | | named | 157584 | | PIX Firewall | 3868906 | | proftpd | 112 | | Snare Syslog Daemon | 91115 | | Snort | 7559 | | sshd | 103 | | syslog-ng | 7 | | tacacs | 95 | +---------------------+----------+ The top 'null' set are entries I don't have rules for. mysql> select eventDefinition.name, count(*) from syslog left join syslogRule on (syslog.syslogRule = syslogRule.id) left join eventDefinition on (eventDefinition.id = syslogRule.eventId) group by eventDefinition.name; +---------------------------------------+----------+ | name | count(*) | +---------------------------------------+----------+ | NULL | 2933114 | | ACL Violation | 1033 | | Attack Detected | 4599 | | Configuration Change | 1192107 | | Device Shutdown | 214805 | | Failed login attempt | 384 | | Interface State Change | 332 | | Load balanced device failure | 1152 | | Load balanced device restored | 1143 | | Promiscuous Network Interface | 2 | | Software reported an error | 9 | | Unexpected software termination | 72 | | Use of super-user privileges detected | 36 | | User Login | 13514 | | User Logout | 2878 | | VLAN State Change | 2 | +---------------------------------------+----------+ There are very few (free) packages that offer panacea for syslog management. The problem with a lot of packages is that they simply aren't flexible enough to let you do what you want to do, and you still wind up modifying them, or worse, scrapping them for that reason. I need to go find the 11,000 users who haven't logged out now. ;) - billn
participants (7)
-
Al Tobey
-
Bill Nash
-
Jason Haar
-
Jim Leitch
-
Ken Garland
-
Kevin
-
Roberto Nibali