The plan, which is mostly through the proof-of-concept phase, is to pipe the pre-parsed and pre-formatted logs from SyslogNG to a Perl script which takes care of inserting the log lines into a database.  On the surface, this sounds like I should just use the built-in SQL destination, but the Perl script is also handling doing inserts into special tables which are indexed for the individual custom macros db-parser is creating.  The templates are stored in the database as well to keep track of how many and what kind of fields each log type (or class, in my system) has and arranges them for easy retrieval.  This all sounds fairly complicated (and maybe it is overcomplicated), but the goal is to not need a separate table for each type of log class I want to collect.  <br>

<br>For instance, there are many different types of firewall logs I want to collect and I want to index the source IP, destination IP, etc. fields from each message in real-time.  I accomplish this by having four table templates in the schema: a main table with the log line metadata like timestamp, sender, etc., one which just contains the text of the message, one for integer fields (like IP and port) and one for strings (like interface).  So, if a log line comes in with 8 fields, 4 integers and 4 strings, then there is an insert into the metatable, an insert into the message table, 4 inserts into the ints table, and 4 inserts into the strings table.  The tables themselves are split into a database per day and a table per hour per class.  That means that when a client makes a query for a particular log class, only some of the tables have to be searched.  That query pruning can make a huge difference.<br>

<br>Perl has to arrange these transactions, and right now I&#39;m finding that by far the most efficient way of doing this is to write to tab separated files on the hard disk (or ram disk if necessary) and do &quot;LOAD DATA INFILE&quot; commands in MySQL in a second, concurrent Perl script.  This is incredibly efficient on the DB end, and it also has the additional benefit of accommodating bursts from SyslogNG that the database would normally not be able to handle.  The only downside is that there is a few second delay between when the log is received and when it is written to the database and available for query, but any alerting would have already taken place in the Perl script.  I implemented this extra step lately because we were experiencing log loss due to database insert bottleneck.  The test system is an older Dell 32-bit 4-way server and it was attempting to handle about 15 Mbps of syslog from various firewall sources.  I think it was something in the 2-3k msg/sec range.  I&#39;m trying to narrow that down now.  <br>

<br>In any case, each node will receive 1/Nth of the total logs and will write to its own local database.  The frontend will issue parallel queries to each node to achieve horizontal partitioning.  The frontend will be responsible for collating the query results to deliver to the end user.  With just the one test node, I&#39;m finding that you can request arbitrary

text from 150 million messages and get the results in less than 1

second.  Indexed fields are even faster (which is what I&#39;m counting on

for writing reporting later).  The plan right now is to eventually put this framework and frontend up on Sourceforge when it becomes Alpha-quality.  It will also include a frontend for creating db-parser XML patterns from example log files with just point-and-click.  <br>

<br>Suggestions/criticism are welcome for all I&#39;ve mentioned.<br><br>Thanks,<br><br>Martin<br><br><div class="gmail_quote">On Sun, May 31, 2009 at 1:53 PM, Jan Schaumann <span dir="ltr">&lt;<a href="mailto:jschauma@netmeister.org">jschauma@netmeister.org</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="im">Martin Holste &lt;<a href="mailto:mcholste@gmail.com">mcholste@gmail.com</a>&gt; wrote:<br>


&gt; Out of curiosity, how many messages per second was the stock syslogd able to<br>

&gt; process with minimal loss?<br>

<br>

</div>Between 15K and 18K / s.<br>

<div class="im"><br>

&gt; What method did you employ to determine loss?<br>

<br>

</div>Effectively:<br>

<br>

n1=$(netstat -s -p udp | awk &#39;/dropped due to full socket/ { print $1 }&#39;)<br>

sleep 5<br>

n2=$(netstat -s -p udp | awk &#39;/dropped due to full socket/ { print $1 }&#39;)<br>

<br>

(In reality, there&#39;s a tool that works much like sar(1) does and I can<br>

query it for stats, but underneath it happens to use the above logic.)<br>

<div class="im"><br>

&gt; I am setting up a similar logging solution with NG using the db-parser module<br>

&gt; which takes considerable CPU.  I plan on using Cisco server load balancing<br>

&gt; to round-robin load balance on N number of syslog nodes to achieve zero<br>

&gt; loss<br>

<br>

</div>What&#39;s your plan for handling the messages on the N nodes?  Will they<br>

all just log to their own filesystem, write to a shared filesystem,<br>

write into a database, forward to another system, ... ?<br>

<font color="#888888"><br>

-Jan<br>

</font><br>______________________________________________________________________________<br>

Member info: <a href="https://lists.balabit.hu/mailman/listinfo/syslog-ng" target="_blank">https://lists.balabit.hu/mailman/listinfo/syslog-ng</a><br>

Documentation: <a href="http://www.balabit.com/support/documentation/?product=syslog-ng" target="_blank">http://www.balabit.com/support/documentation/?product=syslog-ng</a><br>

FAQ: <a href="http://www.campin.net/syslog-ng/faq.html" target="_blank">http://www.campin.net/syslog-ng/faq.html</a><br>

<br>

<br></blockquote></div><br>