On Wed, 2010-07-07 at 13:37 +0200, Balazs Scheidler wrote:
On Mon, 2010-07-05 at 12:05 -0500, Martin Holste wrote:
A naive schema based SQL destination would simply create as many tables as there are schemas. A better optimized one would use the NV -> field mapping that you propose, and a NoSQL implementation would just scale to any number of NV pairs without having to rename the fields.
This mapping support would also be useful if we want to generate CEF/CEE formatted events.
Hm, so maybe we need to decouple the actual DB stuff from the XML schema and declare it out-of-scope, since its' really up to the implementer to figure that out, and the specific implementation will likely change for each setup. I think what's essential is providing the list of name-value pairs and whether they are integer or string. Maybe there could be a "contrib" section on your site with contributed scripts for stamping out the various configurations (e.g. multi-table SQL, no-SQL, etc.).
I'd like to create a generic SQL destination, which would magically work without having to explicitly configure the table schema (e.g. no need to generate the configuration)
If type information is present then the field names for your condensed table could be generated on the fly. I think I'd leave this question opened for a while, until we get that generic SQL destination.
The problem is that I'd like to support the multiple tables idea as well, e.g. store each schema in a separate table. In this case you need a unique id in order to join the tables. Also, if this would be combined with the MSGID field of RFC5424, this could be used to fetch the original raw message easily.
It looks to me like MSGID is better suited for a tag then being part of the ID itself. From the RFC: "It is intended for filtering messages on a relay or collector." A unique ID across multiple tables is not a problem as long as there is one master table where you would put the syslog header fields with an auto-increment column to generate the ID. If you absolutely wanted Syslog-NG to generate the ID, I suppose you could append a CRC of the $MSG to the epoch timestamp, though that isn't foolproof.
Right, I was under the wrong impression what MSGID is. Not that I understand or agree with the way it was defined though.
Anyway, I wouldn't want to store the syslog message in the database only to get an ID, and the use of this ID would be optional.
hmm... hmm, maybe "details" should be above all schemas, e.g instead of calling it "secevt.details", it should be called "details", it is a single pattern the extracts all the fields after all, so the pattern author can decide which information wouldn't fit into any of the schemas and put that in details.
Yep, I think details would be a good spot for all miscellany, as well as other meta-data that is inherent to a specific log class that doesn't fit in a predefined field.
Agreed.
Well, I believe that in SQL, the best we could probably come up with is a "list of tags field" and use free-text indexing.
Yes, for instance, the Sphinx full-text search engine has a Multi-Value Attribute (MVA) config attribute which is specifically designed for efficiently storing a list of n-number of tag ID's for a given record.
That's what I thought.
I'm going to update the document with these decisions. Thanks for your feedback, I really appreciate it.
I've updated the patterndb policy document with the latest discussion points at http://git.balabit.hu/ I still have some open points: * ruleset and rule IDs (UUID vs something else) * ruleset organization I'd appreciate feedback on the current policy. -- Bazsi