MongoDB destination driver

Gergely Nagy

30 Dec 2010 30 Dec '10

8:11 p.m.

Greetings! I was contemplating switching one of the websites I maintain from postgresql to one of these document store things, and while wondering about how to evaluate the options best, I stumbled into a short article on the MongoDB website: http://blog.mongodb.org/post/172254834/mongodb-is-fantastic-for-logging, which promptly led me to start pondering about how hard - or easy, as it turns out - would it be to write a MongoDB destination driver for syslog-ng? Only four hours later, I can present you the mongodb destination driver, available from my git repo (note the -b option, it is important): $ git clone -b algernon/dest/mongodb git://git.madhouse-project.org/syslog-ng/syslog-ng-3.2.git Of course, one can browse the sources on the web too: http://git.madhouse-project.org/syslog-ng/syslog-ng-3.2/tree/modules/afmongo... It is using the MongoDB C client library (http://www.mongodb.org/display/DOCS/C+Language+Center) - I simply embedded the sources for now, lacking a better option. Once compiled, one can already begin using it with the default options: destination d_mongodb { mongodb(); }; This will try to connect to localhost:27017, and use the logs collection in the syslog-ng database, and will log all the standard fields. Of course, all of those are configurable! To demonstrate all the - currently - available options, the destination definition above is the same as the following: destination d_mongodb { mongodb( host("localhost") port(27017) database("syslog-ng") collection("logs") keys("date", "facility", "level", "host", "program", "pid", "message") values("${R_YEAR}-${R_MONTH}-${R_DAY} ${R_HOUR}:${R_MIN}:${R_SEC}", "$FACILITY", "$LEVEL", "$HOST", "$PROGRAM", "$PID", "$MSGONLY") ); }; A few things, like authentication and some template options are not configurable yet, partly because I didn't figure out what they're good for, or how they work. But I will get there at some point, especially if there's interest in said features. All in all, I'm very happy that I could cook up a fairly simple destination driver within a few hours, having no prior experience writing one. The syslog-ng code is amazing, by the way, it was a breeze to navigate through and find the stuff I needed to make this driver work. Mind you, this is very new code, and I haven't tested it extensively, but I do have some great plans involving syslog-ng and mongodb >;) Hope you like the code, and perhaps find it useful! -- |8]

Show replies by date

Matthew Hall

30 Dec 30 Dec

8:40 p.m.

On Thu, Dec 30, 2010 at 08:11:07PM +0100, Gergely Nagy wrote:

...

$ git clone -b algernon/dest/mongodb git://git.madhouse-project.org/syslog-ng/syslog-ng-3.2.git

...

It is using the MongoDB C client library (http://www.mongodb.org/display/DOCS/C+Language+Center) - I simply embedded the sources for now, lacking a better option. Once compiled, one can already begin using it with the default options:

destination d_mongodb { mongodb(); };

This will try to connect to localhost:27017, and use the logs collection in the syslog-ng database, and will log all the standard fields. Of course, all of those are configurable!

To demonstrate all the - currently - available options, the destination definition above is the same as the following:

destination d_mongodb { mongodb( host("localhost") port(27017) database("syslog-ng") collection("logs") keys("date", "facility", "level", "host", "program", "pid", "message") values("${R_YEAR}-${R_MONTH}-${R_DAY} ${R_HOUR}:${R_MIN}:${R_SEC}", "$FACILITY", "$LEVEL", "$HOST", "$PROGRAM", "$PID", "$MSGONLY") ); };

A few things, like authentication and some template options are not configurable yet, partly because I didn't figure out what they're good for, or how they work. But I will get there at some point, especially if there's interest in said features.

...

Hope you like the code, and perhaps find it useful!

Good work. I am wondering if support for MongoDB must be added to the core code or if it could also be added as a libdbi driver which could be used in more than just syslog-ng. I am wondering if it would be possible to take advantage of MongoDB's dynamic nature, and log all of the defined name-value pairs in a message, or of a list of name-value pairs. This would eliminate a common problem of trying to get the most possible message fields into the DB without wasting space on empty ones, which plagues many of us when using relational DBs to store large quantities of often times dissimilar messages from different devices or software applications. Matthew.

Gergely Nagy

9:51 p.m.

...

Good work. I am wondering if support for MongoDB must be added to the core code or if it could also be added as a libdbi driver which could be used in more than just syslog-ng.

While the inserting code is very similar, and could be added to libdbi, the query code is very different. I do not think that adding it to libdbi would work. But to be honest, even the inserting code is different enough to make it tricky at best, to add it to libdbi.

...

I am wondering if it would be possible to take advantage of MongoDB's dynamic nature, and log all of the defined name-value pairs in a message, or of a list of name-value pairs.

Not adding empty values is certainly possible, I'll add support for that shortly. Thanks for the suggestion! It already is possible to log only selected name-value pairs: destination d_mongodb { mongodb(keys("host", "message") values("$HOST", "$MSGONLY")); }; Though, if one is empty, it will still be added to the store at the moment, but like I said, I'll fix that shortly.

Martin Holste

31 Dec 31 Dec

8:13 p.m.

Very cool. As a stop-gap, one can always pipe to a program() to do the actual inserts. That gives you a chance to batch the logs as a TSV and then run mongoimport on the TSV for high-performance inserts. You should be able to get around 20k-50k inserts/sec that way. The key thing to know when profiling MongoDB inserts is that you need to let everything run long enough for Mongo to fill RAM to capacity so that it is forced to begin using disk. Up until that point, everything is done in RAM, which means you're not seeing the long-term rates, only the burst rates. On Thu, Dec 30, 2010 at 2:51 PM, Gergely Nagy <algernon@madhouse-project.org> wrote:

...

...
Good work. I am wondering if support for MongoDB must be added to the core code or if it could also be added as a libdbi driver which could be used in more than just syslog-ng.

While the inserting code is very similar, and could be added to libdbi, the query code is very different. I do not think that adding it to libdbi would work.

But to be honest, even the inserting code is different enough to make it tricky at best, to add it to libdbi.

...
I am wondering if it would be possible to take advantage of MongoDB's dynamic nature, and log all of the defined name-value pairs in a message, or of a list of name-value pairs.

Not adding empty values is certainly possible, I'll add support for that shortly. Thanks for the suggestion!

It already is possible to log only selected name-value pairs:

destination d_mongodb { mongodb(keys("host", "message") values("$HOST", "$MSGONLY")); };

Though, if one is empty, it will still be added to the store at the moment, but like I said, I'll fix that shortly. ______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Gergely Nagy

9:07 p.m.

...

Very cool. As a stop-gap, one can always pipe to a program() to do the actual inserts. That gives you a chance to batch the logs as a TSV and then run mongoimport on the TSV for high-performance inserts. You should be able to get around 20k-50k inserts/sec that way.

I plan to do bulk inserts within the driver - much like how the sql driver does bulk commits with explicit-commit turned on. The plan is to make a writer thread, which will combine a (configurable) set of documents and insert them in bulk. I'm halfway through implementing that, should be done during the weekend.

...

The key thing to know when profiling MongoDB inserts is that you need to let everything run long enough for Mongo to fill RAM to capacity so that it is forced to begin using disk. Up until that point, everything is done in RAM, which means you're not seeing the long-term rates, only the burst rates.

Oh, that's nice to know, thanks!

Martin Holste

9:40 p.m.

...

I plan to do bulk inserts within the driver - much like how the sql driver does bulk commits with explicit-commit turned on.

If that's the equivalent of insert () values (), (), () batching, then that's not what I mean. In MySQL, and similarly though not to the same degree in other DBMS's, bulk insert is very different from batched inserts. Specifically, a LOAD DATA in MySQL will yield upwards of 100k inserts/sec because it gets a full table write lock and buffers the inserts at the columnar data level, not the statement level. It's closer to batch writing sectors of disk, not rows of data. MS-SQL bcp and mongoimport behave similarly, though the difference isn't quite as pronounced as it is with mysqlimport on a MyISAM table.

Gergely Nagy

9:56 p.m.

...

If that's the equivalent of insert () values (), (), () batching, then that's not what I mean.

It's probably better than that - mongodb has a bulk insert command, I expect it to work like mongoimport, but I'll take a look next year.

Matthew Hall

11:12 p.m.

...

Specifically, a LOAD DATA in MySQL will yield upwards of 100k inserts/sec because it gets a full table write lock and buffers the inserts at the columnar data level, not the statement level. It's closer to batch writing sectors of disk, not rows of data.

...

... mongoimport behave[s] similarly, though the difference isn't quite as pronounced as it is with mysqlimport on a MyISAM table.

We should also point out that grabbing these kinds of locks and making these kinds of manipulations should be done as part of careful planning since it can render the table inaccessible for long-ish periods through normal means such as queries and could require some potentially time intensive index rebuilding since indexing is turned off during some of these manipulations. (Not sure what percentage of this applies to MongoDB since it's a bit unique). Perhaps it would be good if we could work together (several of us have been experimenting with optimum buffering, database and index setups, etc.) to figure out what the best practices are in terms of initial storage, indexing, retention, archiving, etc. Matthew.

Martin Holste

1 Jan 1 Jan

9:18 p.m.

...

We should also point out that grabbing these kinds of locks and making these kinds of manipulations should be done as part of careful planning since it can render the table inaccessible for long-ish periods through normal means such as queries and could require some potentially time intensive index rebuilding since indexing is turned off during some of these manipulations. (Not sure what percentage of this applies to MongoDB since it's a bit unique).

For instance, using "LOAD DATA CONCURRENT INFILE" will allow reads to occur while doing the bulk imports in MySQL. The manual says there is a slight performance hit, but it is unnoticeable in my experience. I haven't tested to see what actual locking occurs during mongoimport.

...

Perhaps it would be good if we could work together (several of us have been experimenting with optimum buffering, database and index setups, etc.) to figure out what the best practices are in terms of initial storage, indexing, retention, archiving, etc.

Absolutely. The biggest challenge I've come across is how to properly do archiving. I've been using the ARCHIVE storage engine in MySQL because the compact row format actually compresses blocks of rows, not columnar data, giving you a 10:1 (or more) compression ratio on log data while still maintaining all of the meta data. The main drawback is that the archive storage engine is poorly documented: specifically, if MySQL crashes while an archive table is open, it will mark that table as crashed and rebuild the entire table on startup. It will usually have to do this for all archive tables under normal operation, which means that time to recover is on the order of many hours on even modest number of tables. There is no (documented) way to configured this or to change the table status, since it's not actually "marked" crashed. Then there's the challenge of performing the conversion from normal log table to compressed log table. I found that it takes so long to compress large tables that it's better just to record everything twice: once to the short-term, uncompressed tables, once to the compressed tables. Obviously, that situation is non-optimal, and I am all for suggestions as to how bulk data should be handled and welcome discussions on the topic.

Martin Holste

9:24 p.m.

Super cool! At those rates, I think few will benefit from the bulk insert benefits, so I'd put that low on the feature priority list, especially with the opportunity to create bugs with the complexity. My main feature to add (aside from the two you mentioned already on the roadmap) would be a way to use the keys from a patterndb database so that the db and collection in Mongo stay the same, but the key names change with every patterndb rule. That's really the big payoff with Mongo--you don't have to define a rigid schema, so you don't have to know the column names ahead of time. That's a big deal considering that the patterndb can change on the fly. Being confined to predefined templates in the config limits the potential. Bazsi, any idea how to do this? On Sat, Jan 1, 2011 at 2:18 PM, Martin Holste <mcholste@gmail.com> wrote:

...

...
We should also point out that grabbing these kinds of locks and making these kinds of manipulations should be done as part of careful planning since it can render the table inaccessible for long-ish periods through normal means such as queries and could require some potentially time intensive index rebuilding since indexing is turned off during some of these manipulations. (Not sure what percentage of this applies to MongoDB since it's a bit unique).

For instance, using "LOAD DATA CONCURRENT INFILE" will allow reads to occur while doing the bulk imports in MySQL. The manual says there is a slight performance hit, but it is unnoticeable in my experience. I haven't tested to see what actual locking occurs during mongoimport.

...
Perhaps it would be good if we could work together (several of us have been experimenting with optimum buffering, database and index setups, etc.) to figure out what the best practices are in terms of initial storage, indexing, retention, archiving, etc.

Absolutely. The biggest challenge I've come across is how to properly do archiving. I've been using the ARCHIVE storage engine in MySQL because the compact row format actually compresses blocks of rows, not columnar data, giving you a 10:1 (or more) compression ratio on log data while still maintaining all of the meta data. The main drawback is that the archive storage engine is poorly documented: specifically, if MySQL crashes while an archive table is open, it will mark that table as crashed and rebuild the entire table on startup. It will usually have to do this for all archive tables under normal operation, which means that time to recover is on the order of many hours on even modest number of tables. There is no (documented) way to configured this or to change the table status, since it's not actually "marked" crashed.

Then there's the challenge of performing the conversion from normal log table to compressed log table. I found that it takes so long to compress large tables that it's better just to record everything twice: once to the short-term, uncompressed tables, once to the compressed tables. Obviously, that situation is non-optimal, and I am all for suggestions as to how bulk data should be handled and welcome discussions on the topic.

Matthew Hall

10:55 p.m.

On Sat, Jan 01, 2011 at 02:24:10PM -0600, Martin Holste wrote:

...

My main feature to add (aside from the two you mentioned already on the roadmap) would be a way to use the keys from a patterndb database so that the db and collection in Mongo stay the same, but the key names change with every patterndb rule. That's really the big payoff with Mongo-- you don't have to define a rigid schema, so you don't have to know the column names ahead of time. That's a big deal considering that the patterndb can change on the fly. Being confined to predefined templates in the config limits the potential.

This is why I asked in my earlier mail if it's possible to set up the mongo driver to log all vars in a message or a subset of vars in a message. I was hoping it'd be possible for the schema to change somewhat dynamically based on what's present in the messages. Matthew.

Gergely Nagy

2 Jan 2 Jan

12:51 a.m.

...

This is why I asked in my earlier mail if it's possible to set up the mongo driver to log all vars in a message or a subset of vars in a message. I was hoping it'd be possible for the schema to change somewhat dynamically based on what's present in the messages.

You can set it up to log a set of vars, and it will only actually insert the non-empty values. Say, if you have something like this: destination d_mongo { mongodb( keys("host", "program", "pid", "message") values("$HOST", "$PROGRAM", "$PID", "$MSGONLY") ); }; If a message does not contain a PID, then that will not be added to the document, only the rest. Thus, if you set a maximum of vars, that'll do just what you need, and only add those that do have a value. To the best of my knowledge it is not possible to log all available variables (that would be bad too, since there are overlapping macros), but you can set up a selected maximum set, and the driver will Do The Right Thing, and only store those parts of it, that are set.

Balint Kovacs

3 Jan 3 Jan

8:52 a.m.

On 01/02/2011 12:51 AM, Gergely Nagy wrote:

...

...
This is why I asked in my earlier mail if it's possible to set up the mongo driver to log all vars in a message or a subset of vars in a message. I was hoping it'd be possible for the schema to change somewhat dynamically based on what's present in the messages. You can set it up to log a set of vars, and it will only actually insert the non-empty values.

Say, if you have something like this:

destination d_mongo { mongodb( keys("host", "program", "pid", "message") values("$HOST", "$PROGRAM", "$PID", "$MSGONLY") ); };

If a message does not contain a PID, then that will not be added to the document, only the rest.

Thus, if you set a maximum of vars, that'll do just what you need, and only add those that do have a value.

To the best of my knowledge it is not possible to log all available variables (that would be bad too, since there are overlapping macros), but you can set up a selected maximum set, and the driver will Do The Right Thing, and only store those parts of it, that are set. Hi,

first of all, thanks for the great work. I agree with Matthew, that it would be really important to make this driver "dynamic", as it would be a great tool combined with patterndb for reporting without the need to pre-define fields and a dozen of destination statements. It is actually not that hard to achieve (again, syslog-ng is a breeze), pdbtool does quite the same when emitting all variables, the nv_table_foreach() function is there to iterate over all of the name-value pairs. However the NVTable struct stores the builtin and dynamic values separately and with a small copy-paste coding in nvtable.c you can grab only the dynamic values. Please find a patch attached that introduces the flags() option for the mongodb driver and the auto_nvpairs flag, that inserts all dynamic name-value pairs into the DB as well. I'm sure that there's a better way to implement some parts of it, so please somebody review and clean up if possible :) Usage would look something like this: destination d_mongo { mongodb( database("logs") keys("host", "program", "pid", "message") values("$HOST", "$PROGRAM", "$PID", "$MSGONLY") flags(auto_nvpairs) ); }; No performance measurements were done yet, I would be glad to see it on the same box and same settings as the previous ones. Of course this will be a bit slower, ad it makes sense only if you use it in conjunction with patterndb, but I expect no drastic drop in performance. (Disclaimer: I am not a developer, this code is far from being ready for production, may leak, etc, etc) Balint

Gergely Nagy

10:38 a.m.

...

I agree with Matthew, that it would be really important to make this driver "dynamic", as it would be a great tool combined with patterndb for reporting without the need to pre-define fields and a dozen of destination statements.

Aha! Apologies for being confused before: I have to admit, I never used patterndb before, and totally forgot about it.

...

It is actually not that hard to achieve (again, syslog-ng is a breeze), pdbtool does quite the same when emitting all variables, the nv_table_foreach() function is there to iterate over all of the name-value pairs.

However the NVTable struct stores the builtin and dynamic values separately and with a small copy-paste coding in nvtable.c you can grab only the dynamic values.

Please find a patch attached that introduces the flags() option for the mongodb driver and the auto_nvpairs flag, that inserts all dynamic name-value pairs into the DB as well. I'm sure that there's a better way to implement some parts of it, so please somebody review and clean up if possible :)

The patch looks good on first read, but I'll have a closer look tonight, and run a quick benchmark aswell, if all goes well. Thanks!

Gergely Nagy

10:02 p.m.

...

The patch looks good on first read, but I'll have a closer look tonight, and run a quick benchmark aswell, if all goes well.

The patch looked fine on the second read too, and I integrated it, with a few changes: Instead of using a flag, I introduced a patterndb_key("foo") setting, which, if turned on, will put the patterndb results under the specified key, as a sub-document. If not specified, it will do nothing extra. In my opinion, this solution is clearer, and results in a better structured log entry. Usage is like this: destination d_mongo { mongodb( patterndb_key("patterndb") ); }; The resulting log entry in mongodb looks something like this:

...

db.logs.find() { "_id" : ObjectId("4d2235525edd07af78f648f9"), "date" : "2011-01-03 21:45:06", "facility" : "auth", "level" : "info", "host" : "localhost", "program" : "sshd", "pid" : "12674", "message" : "Accepted publickey for algernon from ::1 port 59690 ssh2", "patterndb" : { ".classifier.class" : "system", ".classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c", "usracct.authmethod" : "publickey for algernon from ::1 port 59690 ssh2", "usracct.username" : "algernon from ::1 port 59690 ssh2", "usracct.device" : "::1 port 59690 ssh2", "usracct.service" : "ssh2", "usracct.type" : "login", "usracct.sessionid" : "12674", "usracct.application" : "sshd", "secevt.verdict" : "ACCEPT" } } { "_id" : ObjectId("4d2235525edd07af78f648fa"), "date" : "2011-01-03 21:45:06", "facility" : "authpriv", "level" : "info", "host" : "localhost", "program" : "sshd", "pid" : "12674", "message" : "pam_unix(sshd:session): session opened for user algernon by (uid=0)", "patterndb" : { ".classifier.class" : "unknown" } }

As you can see, the second log entry is not recognised by patterndb, thus only an unknown classifier.class is logged, and nothing else. It also highlights a few problems in the patterndb I used for sshd, namely that it doesn't like ipv6 all that much. The changes are now pushed to my repository. I'll do a couple of benchmarks later tonight. -- |8]

Martin Holste

10:14 p.m.

Great idea to have a dedicated, user-configurable sub-key. One suggestion: I think that key names cannot contain dots in Mongo. They don't really make sense because this: "patterndb" : { ".classifier.class" : "system", ".classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c", "usracct.authmethod" : "publickey for algernon from ::1 port 59690 ssh2", "usracct.username" : "algernon from ::1 port 59690 ssh2", "usracct.device" : "::1 port 59690 ssh2", "usracct.service" : "ssh2", "usracct.type" : "login", "usracct.sessionid" : "12674", "usracct.application" : "sshd", "secevt.verdict" : "ACCEPT" } should really look like this: "patterndb" : { "classifier": { "class" : "system", "rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c" }, "usracct": { "authmethod" : "publickey for algernon from ::1 port 59690 ssh2", "username" : "algernon from ::1 port 59690 ssh2", "device" : "::1 port 59690 ssh2", "service" : "ssh2", "type" : "login", "sessionid" : "12674", "application" : "sshd", }, "secevt":{ "verdict" : "ACCEPT" } } I recognize, however, that this is not a trivial conversion. As a start, just doing a simple substitution of "." for "_" on keys would probably work just fine. On Mon, Jan 3, 2011 at 3:02 PM, Gergely Nagy <algernon@balabit.hu> wrote:

...

...
The patch looks good on first read, but I'll have a closer look tonight, and run a quick benchmark aswell, if all goes well.

The patch looked fine on the second read too, and I integrated it, with a few changes:

Instead of using a flag, I introduced a patterndb_key("foo") setting, which, if turned on, will put the patterndb results under the specified key, as a sub-document. If not specified, it will do nothing extra.

In my opinion, this solution is clearer, and results in a better structured log entry.

Usage is like this:

destination d_mongo { mongodb( patterndb_key("patterndb") ); };

The resulting log entry in mongodb looks something like this:

...
db.logs.find() { "_id" : ObjectId("4d2235525edd07af78f648f9"), "date" : "2011-01-03 21:45:06", "facility" : "auth", "level" : "info", "host" : "localhost", "program" : "sshd", "pid" : "12674", "message" : "Accepted publickey for algernon from ::1 port 59690 ssh2", "patterndb" : { ".classifier.class" : "system", ".classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c", "usracct.authmethod" : "publickey for algernon from ::1 port 59690 ssh2", "usracct.username" : "algernon from ::1 port 59690 ssh2", "usracct.device" : "::1 port 59690 ssh2", "usracct.service" : "ssh2", "usracct.type" : "login", "usracct.sessionid" : "12674", "usracct.application" : "sshd", "secevt.verdict" : "ACCEPT" } } { "_id" : ObjectId("4d2235525edd07af78f648fa"), "date" : "2011-01-03 21:45:06", "facility" : "authpriv", "level" : "info", "host" : "localhost", "program" : "sshd", "pid" : "12674", "message" : "pam_unix(sshd:session): session opened for user algernon by (uid=0)", "patterndb" : { ".classifier.class" : "unknown" } }

As you can see, the second log entry is not recognised by patterndb, thus only an unknown classifier.class is logged, and nothing else.

It also highlights a few problems in the patterndb I used for sshd, namely that it doesn't like ipv6 all that much.

The changes are now pushed to my repository. I'll do a couple of benchmarks later tonight.

-- |8]

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Gergely Nagy

10:28 p.m.

On Mon, 2011-01-03 at 15:14 -0600, Martin Holste wrote:

...

Great idea to have a dedicated, user-configurable sub-key. One suggestion: I think that key names cannot contain dots in Mongo.

They can. Database names can't contain dots, but collection and key names can contain pretty much anything. The example I posted earlier was taken from my mongodb directly, I only changed the formatting - so yeah, it does allow dots, however suprising that may be :)

...

They don't really make sense because this:

"patterndb" : { ".classifier.class" : "system", ".classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c", "usracct.authmethod" : "publickey for algernon from ::1 port 59690 ssh2", "usracct.username" : "algernon from ::1 port 59690 ssh2", "usracct.device" : "::1 port 59690 ssh2", "usracct.service" : "ssh2", "usracct.type" : "login", "usracct.sessionid" : "12674", "usracct.application" : "sshd", "secevt.verdict" : "ACCEPT" }

should really look like this:

"patterndb" : { "classifier": { "class" : "system", "rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c" }, "usracct": { "authmethod" : "publickey for algernon from ::1 port 59690 ssh2", "username" : "algernon from ::1 port 59690 ssh2", "device" : "::1 port 59690 ssh2", "service" : "ssh2", "type" : "login", "sessionid" : "12674", "application" : "sshd", }, "secevt":{ "verdict" : "ACCEPT" } }

I agree, that would be awesome to have, and I might just go ahead and implement it, but only as a togglable option (since it requires additional processing).

...

I recognize, however, that this is not a trivial conversion. As a start, just doing a simple substitution of "." for "_" on keys would probably work just fine.

No need to, dots are fine with mongo. -- |8]

Gergely Nagy

10:38 p.m.

On Mon, 2011-01-03 at 22:28 +0100, Gergely Nagy wrote:

...

On Mon, 2011-01-03 at 15:14 -0600, Martin Holste wrote:

...
Great idea to have a dedicated, user-configurable sub-key. One suggestion: I think that key names cannot contain dots in Mongo.

They can. Database names can't contain dots, but collection and key names can contain pretty much anything.

Actually, nevermind that. It appears mongodb will happily store key names with dots, but we can't query them. I'll see what I can do: will probably go with splitting into sub-documents. -- |8]

Martin Holste

10:53 p.m.

I see the confusion now. What I did was this:

...

db.createCollection("test"); { "ok" : 1 } db.getCollection("test").insert({"some.key": 1}); Mon Jan 3 15:10:09 uncaught exception: can't have . in field names [some.key]

This was discussed (and resolved under the following bug): http://jira.mongodb.org/browse/SERVER-1988 It was actually a bug that you could insert keys that contained a dot, and it now errors on it as of Dec 17th 2010. On Mon, Jan 3, 2011 at 3:38 PM, Gergely Nagy <algernon@balabit.hu> wrote:

...

On Mon, 2011-01-03 at 22:28 +0100, Gergely Nagy wrote:

...
On Mon, 2011-01-03 at 15:14 -0600, Martin Holste wrote:

...
Great idea to have a dedicated, user-configurable sub-key. One suggestion: I think that key names cannot contain dots in Mongo.

They can. Database names can't contain dots, but collection and key names can contain pretty much anything.

Actually, nevermind that. It appears mongodb will happily store key names with dots, but we can't query them.

I'll see what I can do: will probably go with splitting into sub-documents.

-- |8]

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Gergely Nagy

11:06 p.m.

On Mon, 2011-01-03 at 15:53 -0600, Martin Holste wrote:

...

I see the confusion now. What I did was this:

...
db.createCollection("test"); { "ok" : 1 } db.getCollection("test").insert({"some.key": 1}); Mon Jan 3 15:10:09 uncaught exception: can't have . in field names [some.key]

This was discussed (and resolved under the following bug): http://jira.mongodb.org/browse/SERVER-1988

It was actually a bug that you could insert keys that contained a dot, and it now errors on it as of Dec 17th 2010.

Aha, I see. Guess my mongodb (1.4.4) from debian squeeze is a bit old then. Well, all the better: I'll do the splitting into sub-documents trick. That will result in a nicer structure anyway. -- |8]

Gergely Nagy

4 Jan 4 Jan

11:51 a.m.

...

"patterndb" : { ".classifier.class" : "system", ".classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c", "usracct.authmethod" : "publickey for algernon from ::1 port 59690 ssh2", "usracct.username" : "algernon from ::1 port 59690 ssh2", "usracct.device" : "::1 port 59690 ssh2", "usracct.service" : "ssh2", "usracct.type" : "login", "usracct.sessionid" : "12674", "usracct.application" : "sshd", "secevt.verdict" : "ACCEPT" }

should really look like this:

"patterndb" : { "classifier": { "class" : "system", "rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c" }, "usracct": { "authmethod" : "publickey for algernon from ::1 port 59690 ssh2", "username" : "algernon from ::1 port 59690 ssh2", "device" : "::1 port 59690 ssh2", "service" : "ssh2", "type" : "login", "sessionid" : "12674", "application" : "sshd", }, "secevt":{ "verdict" : "ACCEPT" } }

I recognize, however, that this is not a trivial conversion. As a start, just doing a simple substitution of "." for "_" on keys would probably work just fine.

For the time being, the current tip of my branch converts . and $ to _ in dynamic key names (which is stricter than what mongodb allows, but it was simpler to implement it this way). I also have an idea about how to convert the stuff to a well structured format. Actually, I have a few ideas, all with pros and cons: #1: Insert the root document, update with dynamic values We would insert the root document first, up to and including the patterndb: {} sub document. Then we'd iterate over the keys, and use mongodb's update method to add the rest of the stuff:

...

db.logs.update({_id: <id>}, {$set: {"patterndb.classifier.class": "system"}})

This has the upside of being almost trivial to implement, but has three notable flaws: it will result in more network traffic, and inserting a log message will not be atomic, since the dynamic values are added one at a time. It also has a good chance of fragmenting the database (though, mongodb is said to be clever enough to leave some padding space for objects to grow, which might save us in this case). It is also possible to do bulk updates, like this:

...

db.logs.update({_id: <id>}, {$set: {"patterndb.classifier.class": "system", "patterndb.classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c"}, "patterndb.secevt.verdict": "ACCEPT"} })

With this, we can reduce the whole operation to two steps: inserting the first, static content, then the dynamic values. However, all of the mentioned flaws remain even with this, they're just not as serious as if we'd insert one by one. #2: Construct the whole document within syslog-ng This has the upside of keeping network traffic to a minimum, and inserts will remain atomic. The downside is that I have no idea how to implement this properly and reliably yet. And my gut feeling is, that whatever solution I end up with, this method will be considerably slower and would require more processing power. #3: Keep the status quo and leave it unstructured. No extra work required on either side, and the values are still reasonably easily queryable. I'll implement #1 tonight, and make it so that one can choose between that and #3, for example with a flag(dynamic_values_restructure) or somesuch option. Gotta find a decent name for the flag, though. -- |8]

Gergely Nagy

1:39 p.m.

...

#1: Insert the root document, update with dynamic values

We would insert the root document first, up to and including the patterndb: {} sub document. Then we'd iterate over the keys, and use mongodb's update method to add the rest of the stuff:

...
db.logs.update({_id: <id>}, {$set: {"patterndb.classifier.class": "system"}})

This has the upside of being almost trivial to implement, but has three notable flaws: it will result in more network traffic, and inserting a log message will not be atomic, since the dynamic values are added one at a time. It also has a good chance of fragmenting the database (though, mongodb is said to be clever enough to leave some padding space for objects to grow, which might save us in this case).

It is also possible to do bulk updates, like this:

...
db.logs.update({_id: <id>}, {$set: {"patterndb.classifier.class": "system", "patterndb.classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c"}, "patterndb.secevt.verdict": "ACCEPT"} })

With this, we can reduce the whole operation to two steps: inserting the first, static content, then the dynamic values. However, all of the mentioned flaws remain even with this, they're just not as serious as if we'd insert one by one.

Good news: we can use upserts and get rid of all the flaws:

...

db.logs.update({_id: <id>}, {$set: {message: "some message", <rest of the static keys>, "patterndb.classifier.class": "system", "patterndb.classifier.rule_id": "0xdeadbeef", "patterndb.secevt.verdict": "ACCEPT"} }, true)

We just have to pre-generate the ID, which is luckily easy, as the mongodb driver has a function to do just that. In return, we get an atomic insert, only one message towards the mongodb server, and no fragmentation. And it's dead easy to add this to my mongodb destination, since the dynamic values are already dot-separated, just the way we want them (I only have to strip the leading dots). This will hit my branch sometime tonight, at which point I'll redo the benchmark tests. -- |8]

Martin Holste

5:47 p.m.

This is very good, especially since Mongo really simplifies creating indexes. Specifically, you can create an index on the "patterndb" key in your example message, and it will automatically index all subkeys and values. See the manual page here: http://www.mongodb.org/display/DOCS/Using+Multikeys+to+Simulate+a+Large+Numb... for the specific example. Moreover, you can choose to index only certain subkeys to save inserting effort and disk space. All of this lends itself very nicely to patterndb. On Tue, Jan 4, 2011 at 6:39 AM, Gergely Nagy <algernon@balabit.hu> wrote:

...

...
#1: Insert the root document, update with dynamic values

We would insert the root document first, up to and including the patterndb: {} sub document. Then we'd iterate over the keys, and use mongodb's update method to add the rest of the stuff:

...
db.logs.update({_id: <id>}, {$set: {"patterndb.classifier.class": "system"}})

This has the upside of being almost trivial to implement, but has three notable flaws: it will result in more network traffic, and inserting a log message will not be atomic, since the dynamic values are added one at a time. It also has a good chance of fragmenting the database (though, mongodb is said to be clever enough to leave some padding space for objects to grow, which might save us in this case).

It is also possible to do bulk updates, like this:

...
db.logs.update({_id: <id>}, {$set: {"patterndb.classifier.class": "system", "patterndb.classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c"}, "patterndb.secevt.verdict": "ACCEPT"} })

With this, we can reduce the whole operation to two steps: inserting the first, static content, then the dynamic values. However, all of the mentioned flaws remain even with this, they're just not as serious as if we'd insert one by one.

Good news: we can use upserts and get rid of all the flaws:

...
db.logs.update({_id: <id>}, {$set: {message: "some message", <rest of the static keys>, "patterndb.classifier.class": "system", "patterndb.classifier.rule_id": "0xdeadbeef", "patterndb.secevt.verdict": "ACCEPT"} }, true)

We just have to pre-generate the ID, which is luckily easy, as the mongodb driver has a function to do just that. In return, we get an atomic insert, only one message towards the mongodb server, and no fragmentation.

And it's dead easy to add this to my mongodb destination, since the dynamic values are already dot-separated, just the way we want them (I only have to strip the leading dots).

This will hit my branch sometime tonight, at which point I'll redo the benchmark tests.

-- |8]

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Gergely Nagy

8:20 p.m.

On Tue, 2011-01-04 at 13:39 +0100, Gergely Nagy wrote:

...

...
#1: Insert the root document, update with dynamic values

...

This will hit my branch sometime tonight, at which point I'll redo the benchmark tests.

Done! And implemented in such a way that the static keys which one can specify in the keys() option can also contain dots, and they'll be handled properly (ie, turned into neat sub-documents). Thus, with a block like this: destination d_mongodb { mongodb( dynamic_values("dyn") keys("date", "host", "log.facility", "log.level", "program.name", "program.pid", "message") values("$DATE", "$HOST", "$FACILITY", "$LEVEL", "$PROGRAM", "$PID", "$MSGONLY") ); }; We can end up with a log entry like this: { "_id" : ObjectId("4d2370879d864e560000000a"), "date" : "Jan 4 20:09:59", "dyn" : { "classifier" : { "class" : "system", "rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c" }, "secevt" : { "verdict" : "ACCEPT" }, "usracct" : { "application" : "sshd", "authmethod" : "publickey for algernon from 127.0.0.1 port 33659 ssh2", "device" : "127.0.0.1 port 33659 ssh2", "service" : "ssh2", "sessionid" : "10424", "type" : "login", "username" : "algernon from 127.0.0.1 port 33659 ssh2" } }, "host" : "localhost", "log" : { "facility" : "auth", "level" : "info" }, "message" : "Accepted publickey for algernon from 127.0.0.1 port 33659 ssh2", "program" : { "name" : "sshd", "pid" : "10424" } } Beautiful, isn't it? (And yes, my patterndb rules are still horrid; I'll fix them before I run the benchmarks) And to show you the queries:

...

db.logs.find().count() 4 db.logs.find({"dyn.usracct.application": "sshd"}, {date: 1, host: 1, log: 1, "dyn.classifier.class": 1, message: 1, "dyn.secevt": 1}) { "_id" : ObjectId("4d2370879d864e560000000a"), "date" : "Jan 4 20:09:59", "dyn" : { "classifier" : { "class" : "system" }, "secevt" : { "verdict" : "ACCEPT" } }, "host" : "localhost", "log" : { "facility" : "auth", "level" : "info" }, "message" : "Accepted publickey for algernon from 127.0.0.1 port 33659 ssh2" } { "_id" : ObjectId("4d2371689d864e560000000d"), "date" : "Jan 4 20:13:44", "dyn" : { "classifier" : { "class" : "system" } }, "host" : "localhost", "log" : { "facility" : "authpriv", "level" : "info" }, "message" : "pam_unix(sshd:session): session closed for user algernon" }

Simply awesome. Thanks to everyone who contributed ideas and nudged me into the right direction! -- |8]

Martin Holste

9:19 p.m.

Wow, this is amazing! Now the following is possible: web interface drives patterndb XML file creation, created file is automatically loaded by syslog-ng and new patterns are implemented in parsing, new key/value pairs are automatically logged correctly into mongo. So, a fully dynamic parsing solution now exists with a database backend which requires no destination configuration changes. Even the column indexes are dynamic so that new keys are automatically indexed. Not too shabby! On Tue, Jan 4, 2011 at 1:20 PM, Gergely Nagy <algernon@balabit.hu> wrote:

...

On Tue, 2011-01-04 at 13:39 +0100, Gergely Nagy wrote:

...
...
#1: Insert the root document, update with dynamic values

...
This will hit my branch sometime tonight, at which point I'll redo the benchmark tests.

Done! And implemented in such a way that the static keys which one can specify in the keys() option can also contain dots, and they'll be handled properly (ie, turned into neat sub-documents).

Thus, with a block like this:

destination d_mongodb { mongodb( dynamic_values("dyn") keys("date", "host", "log.facility", "log.level", "program.name", "program.pid", "message") values("$DATE", "$HOST", "$FACILITY", "$LEVEL", "$PROGRAM", "$PID", "$MSGONLY") ); };

We can end up with a log entry like this:

{ "_id" : ObjectId("4d2370879d864e560000000a"), "date" : "Jan 4 20:09:59", "dyn" : { "classifier" : { "class" : "system", "rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c" }, "secevt" : { "verdict" : "ACCEPT" }, "usracct" : { "application" : "sshd", "authmethod" : "publickey for algernon from 127.0.0.1 port 33659 ssh2", "device" : "127.0.0.1 port 33659 ssh2", "service" : "ssh2", "sessionid" : "10424", "type" : "login", "username" : "algernon from 127.0.0.1 port 33659 ssh2" } }, "host" : "localhost", "log" : { "facility" : "auth", "level" : "info" }, "message" : "Accepted publickey for algernon from 127.0.0.1 port 33659 ssh2", "program" : { "name" : "sshd", "pid" : "10424" } }

Beautiful, isn't it? (And yes, my patterndb rules are still horrid; I'll fix them before I run the benchmarks)

And to show you the queries:

...
db.logs.find().count() 4 db.logs.find({"dyn.usracct.application": "sshd"}, {date: 1, host: 1, log: 1, "dyn.classifier.class": 1, message: 1, "dyn.secevt": 1}) { "_id" : ObjectId("4d2370879d864e560000000a"), "date" : "Jan 4 20:09:59", "dyn" : { "classifier" : { "class" : "system" }, "secevt" : { "verdict" : "ACCEPT" } }, "host" : "localhost", "log" : { "facility" : "auth", "level" : "info" }, "message" : "Accepted publickey for algernon from 127.0.0.1 port 33659 ssh2" } { "_id" : ObjectId("4d2371689d864e560000000d"), "date" : "Jan 4 20:13:44", "dyn" : { "classifier" : { "class" : "system" } }, "host" : "localhost", "log" : { "facility" : "authpriv", "level" : "info" }, "message" : "pam_unix(sshd:session): session closed for user algernon" }

Simply awesome. Thanks to everyone who contributed ideas and nudged me into the right direction!

-- |8]

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Gergely Nagy

5 Jan 5 Jan

12:05 a.m.

On Tue, 2011-01-04 at 20:20 +0100, Gergely Nagy wrote:

...

On Tue, 2011-01-04 at 13:39 +0100, Gergely Nagy wrote:

...
...
#1: Insert the root document, update with dynamic values

...
This will hit my branch sometime tonight, at which point I'll redo the benchmark tests.

Done! And implemented in such a way that the static keys which one can specify in the keys() option can also contain dots, and they'll be handled properly (ie, turned into neat sub-documents).

...

destination d_mongodb { mongodb( dynamic_values("dyn") keys("date", "host", "log.facility", "log.level", "program.name", "program.pid", "message") values("$DATE", "$HOST", "$FACILITY", "$LEVEL", "$PROGRAM", "$PID", "$MSGONLY") ); };

Using this block, a completely non-scientific test: * Inserting sshd login messages: + non-capped, non-indexed collection: 12k msg/sec + capped (10Gb, 1k msgs), non-indexed collection: 3k msg/sec + capped (1Mb, 100msg), non-indexed collection: 10k msg/sec * Inserting loggen generated messages: + non-capped, non-indexed collection: 12.2k msg/sec + capped (10Gb, 1k msgs), non-indexed collection: 3.1k msg/sec + capped (1Mb, 100msg), non-indexed collection: 10.1k msg/sec + capped (10k, 100msgs), non-indexed collection: 15k msg/sec (this one is terribly useless, but included for the sake of it) They're within a margin of error, which means that dynamic values do not add a significant overhead by the looks of it. The numbers are pretty much the same as when I tested without dyn. values a few days ago. Mind you, these benchmarks are completely non-scientific, done on my desktop, running a ton of other things at the same time. Do note how the cap sizes affect performance: you gotta choose the appropriate one, if you're going for capping, otherwise performance plummets. My 100 loggen messages used up roughly 42k space, so setting the cap at 50k/100msgs would yield the best results, I suppose. But, that's mongodb tuning, which - thankfully - is none of my business :D -- |8]

Martin Holste

2:32 a.m.

Cool. The code must indeed be pretty low-overhead. I suspect the main difference in insert rates is probably that one amount fits in RAM while larger message counts do not. On Tue, Jan 4, 2011 at 5:05 PM, Gergely Nagy <algernon@balabit.hu> wrote:

...

On Tue, 2011-01-04 at 20:20 +0100, Gergely Nagy wrote:

...
On Tue, 2011-01-04 at 13:39 +0100, Gergely Nagy wrote:

...
...
#1: Insert the root document, update with dynamic values

...
This will hit my branch sometime tonight, at which point I'll redo the benchmark tests.

Done! And implemented in such a way that the static keys which one can specify in the keys() option can also contain dots, and they'll be handled properly (ie, turned into neat sub-documents).

...
destination d_mongodb { mongodb( dynamic_values("dyn") keys("date", "host", "log.facility", "log.level", "program.name", "program.pid", "message") values("$DATE", "$HOST", "$FACILITY", "$LEVEL", "$PROGRAM", "$PID", "$MSGONLY") ); };

Using this block, a completely non-scientific test:

* Inserting sshd login messages: + non-capped, non-indexed collection: 12k msg/sec + capped (10Gb, 1k msgs), non-indexed collection: 3k msg/sec + capped (1Mb, 100msg), non-indexed collection: 10k msg/sec * Inserting loggen generated messages: + non-capped, non-indexed collection: 12.2k msg/sec + capped (10Gb, 1k msgs), non-indexed collection: 3.1k msg/sec + capped (1Mb, 100msg), non-indexed collection: 10.1k msg/sec + capped (10k, 100msgs), non-indexed collection: 15k msg/sec (this one is terribly useless, but included for the sake of it)

They're within a margin of error, which means that dynamic values do not add a significant overhead by the looks of it. The numbers are pretty much the same as when I tested without dyn. values a few days ago.

Mind you, these benchmarks are completely non-scientific, done on my desktop, running a ton of other things at the same time.

Do note how the cap sizes affect performance: you gotta choose the appropriate one, if you're going for capping, otherwise performance plummets.

My 100 loggen messages used up roughly 42k space, so setting the cap at 50k/100msgs would yield the best results, I suppose. But, that's mongodb tuning, which - thankfully - is none of my business :D

-- |8]

______________________________________________________________________________ Member info: https://lists.balabit.hu/mailman/listinfo/syslog-ng Documentation: http://www.balabit.com/support/documentation/?product=syslog-ng FAQ: http://www.campin.net/syslog-ng/faq.html

Balazs Scheidler

14 Jan 14 Jan

1:37 p.m.

On Tue, 2011-01-04 at 13:39 +0100, Gergely Nagy wrote:

...

...
#1: Insert the root document, update with dynamic values

We would insert the root document first, up to and including the patterndb: {} sub document. Then we'd iterate over the keys, and use mongodb's update method to add the rest of the stuff:

...
db.logs.update({_id: <id>}, {$set: {"patterndb.classifier.class": "system"}})

This has the upside of being almost trivial to implement, but has three notable flaws: it will result in more network traffic, and inserting a log message will not be atomic, since the dynamic values are added one at a time. It also has a good chance of fragmenting the database (though, mongodb is said to be clever enough to leave some padding space for objects to grow, which might save us in this case).

It is also possible to do bulk updates, like this:

...
db.logs.update({_id: <id>}, {$set: {"patterndb.classifier.class": "system", "patterndb.classifier.rule_id" : "4dd5a329-da83-4876-a431-ddcb59c2858c"}, "patterndb.secevt.verdict": "ACCEPT"} })

With this, we can reduce the whole operation to two steps: inserting the first, static content, then the dynamic values. However, all of the mentioned flaws remain even with this, they're just not as serious as if we'd insert one by one.

Good news: we can use upserts and get rid of all the flaws:

...
db.logs.update({_id: <id>}, {$set: {message: "some message", <rest of the static keys>, "patterndb.classifier.class": "system", "patterndb.classifier.rule_id": "0xdeadbeef", "patterndb.secevt.verdict": "ACCEPT"} }, true)

We just have to pre-generate the ID, which is luckily easy, as the mongodb driver has a function to do just that. In return, we get an atomic insert, only one message towards the mongodb server, and no fragmentation.

And it's dead easy to add this to my mongodb destination, since the dynamic values are already dot-separated, just the way we want them (I only have to strip the leading dots).

This will hit my branch sometime tonight, at which point I'll redo the benchmark tests.

It is not just patterndb that can generate dynamic values in a log message, so I'd prefer it to be put in the same level as all the other values. I understand that the user would need some means to select which nv pairs need to be added to the document and also with this operation she also needs a means to select a whole set, not just single values, which syslog-ng doesn't have right now. This would be useful for both mongodb, SQL and probably some other formats too. Any ideas? -- Bazsi

Gergely Nagy

2:40 p.m.

...

It is not just patterndb that can generate dynamic values in a log message, so I'd prefer it to be put in the same level as all the other values.

I understand that the user would need some means to select which nv pairs need to be added to the document and also with this operation she also needs a means to select a whole set, not just single values, which syslog-ng doesn't have right now. This would be useful for both mongodb, SQL and probably some other formats too.

Any ideas?

This was covered in one of your earlier mails aswell, but since then, I had a few more ideas, so I'll reiterate (mind you, my knowledge of nvtables and what it is used for is lacking): One can already easily 'filter' the standard stuff with templates (as there's not many of them, and the set is known and finite - thus listing all the ones one wants is an acceptable option). What is missing, is a way to globally restrict what nvtable pairs the various driver instances see. The most flexible solution - in my opinion - would be to have an iterator function in nvtables that can filter keys, based on various user-settable criteria. Then, one could have a config like the one you mentioned earlier: filter-keys('.snmp.*' ltrim ('.snmp.') prefix('foo.')); This would get parsed into an appropriate structure, and would get passed down to the filtering function. The various drivers - mongodb, SQL and whatever else where it's appropriate - could then use this filtering mechanism. The filtering itself could be implemented with either fnmatch(), which is reasonably fast I believe, or whatever other way. It'd even be possible to add flags later on, so one can choose between shell glob based filtering, regexes, or whatever else we come up with in the future. The hardest part in my opinion, is to figure out how to parse and store what actions one needs to take on the keys. As a first step, I'd suggest doing only filtering, and adding the rest of the features gradually. I'm fairly certain I can come up with a prototype by monday (provided nothing unexpected happens during the mongodb driver porting - but that is almost complete by now). -- |8]

Balazs Scheidler

6 p.m.

On Fri, 2011-01-14 at 14:40 +0100, Gergely Nagy wrote:

...

...
It is not just patterndb that can generate dynamic values in a log message, so I'd prefer it to be put in the same level as all the other values.

I understand that the user would need some means to select which nv pairs need to be added to the document and also with this operation she also needs a means to select a whole set, not just single values, which syslog-ng doesn't have right now. This would be useful for both mongodb, SQL and probably some other formats too.

Any ideas?

This was covered in one of your earlier mails aswell, but since then, I had a few more ideas, so I'll reiterate (mind you, my knowledge of nvtables and what it is used for is lacking):

One can already easily 'filter' the standard stuff with templates (as there's not many of them, and the set is known and finite - thus listing all the ones one wants is an acceptable option).

What is missing, is a way to globally restrict what nvtable pairs the various driver instances see.

The most flexible solution - in my opinion - would be to have an iterator function in nvtables that can filter keys, based on various user-settable criteria.

Then, one could have a config like the one you mentioned earlier:

filter-keys('.snmp.*' ltrim ('.snmp.') prefix('foo.'));

This would get parsed into an appropriate structure, and would get passed down to the filtering function.

The various drivers - mongodb, SQL and whatever else where it's appropriate - could then use this filtering mechanism.

The filtering itself could be implemented with either fnmatch(), which is reasonably fast I believe, or whatever other way. It'd even be possible to add flags later on, so one can choose between shell glob based filtering, regexes, or whatever else we come up with in the future.

The hardest part in my opinion, is to figure out how to parse and store what actions one needs to take on the keys.

As a first step, I'd suggest doing only filtering, and adding the rest of the features gradually.

I'm fairly certain I can come up with a prototype by monday (provided nothing unexpected happens during the mongodb driver porting - but that is almost complete by now).

I might give up on the ability to exchange prefixes of name-value pairs when expanding them into mongodb documents. Without that requirement, this feature could be fairly simple, and would be quite easy to adapt to SQL and welf and probably many other things we come up in the future. What this boils down to, is that in addition to providing the current keys/values options in mongodb (and the similar columns/values in SQL), we could have a combination of the two: destination d_mongo { mongodb(... value-pairs(('host', '$HOST'), 'PROGRAM', '*') ); }; There are 3 forms of pairs supported: * (name, value): traditional syslog-ng templating, name specifies the key, value is a syslog-ng template (containing macros) * name: name is both the name of the key and the name of the nv-pair in syslog-ng, in essence, equivalent to ('name', '${name}') described in the first syntax * glob: in this case the result is all the name-value pairs matched by the glob string, the name of the key is the same as the nvpair in syslog-ng, e.g. it produces a series of ('name', '${name}') pairs matching the specified glob. I don't expect regexps would be needed in this case. NVPairs are supposed to have structured names after all. This would be somewhat similar to my $(format-welf) template function idea described earlier on this mailing list. Although that too provided exchange-prefix functionality (and assuming we come to a feasible model that gets adapted to mongodb and/or sql, I'd probably change the idea of format-welf too, that's especially easy as it isn't implemented yet :) ). -- Bazsi

Gergely Nagy

7:08 p.m.

...

I might give up on the ability to exchange prefixes of name-value pairs when expanding them into mongodb documents.

If implemented right, changing prefixes (or any other part of the key names) isn't particularly hard. The hard part is designing good syntax for that in the config file, and storing that information. I mean, once a filtering function can iterate over nvtable, and has access to a set of rules describing what to do with each key, we're pretty much done. Deciding what the rules can be, and designing the syntax for them, that's the tougher cookie, as far as I'm concerned.

...

Without that requirement, this feature could be fairly simple, and would be quite easy to adapt to SQL and welf and probably many other things we come up in the future.

Aye.

...

What this boils down to, is that in addition to providing the current keys/values options in mongodb (and the similar columns/values in SQL), we could have a combination of the two:

destination d_mongo { mongodb(... value-pairs(('host', '$HOST'), 'PROGRAM', '*') ); };

There are 3 forms of pairs supported: * (name, value): traditional syslog-ng templating, name specifies the key, value is a syslog-ng template (containing macros) * name: name is both the name of the key and the name of the nv-pair in syslog-ng, in essence, equivalent to ('name', '${name}') described in the first syntax * glob: in this case the result is all the name-value pairs matched by the glob string, the name of the key is the same as the nvpair in syslog-ng, e.g. it produces a series of ('name', '${name}') pairs matching the specified glob.

Heh, that's a very elegant way. I'll get right on that during the weekend, unless someone beats me to it. -- |8]

Balazs Scheidler

12:56 p.m.

On Sat, 2011-01-01 at 14:24 -0600, Martin Holste wrote:

...

Super cool! At those rates, I think few will benefit from the bulk insert benefits, so I'd put that low on the feature priority list, especially with the opportunity to create bugs with the complexity. My main feature to add (aside from the two you mentioned already on the roadmap) would be a way to use the keys from a patterndb database so that the db and collection in Mongo stay the same, but the key names change with every patterndb rule. That's really the big payoff with Mongo--you don't have to define a rigid schema, so you don't have to know the column names ahead of time. That's a big deal considering that the patterndb can change on the fly. Being confined to predefined templates in the config limits the potential. Bazsi, any idea how to do this?

sorry for not answering any sooner, I was skimming through these emails, but never had the time to actually think about this stuff. we would definitely need a way to query the contents of a message in a structured way. e.g. if a message is a set of name-value pairs, it'd be nice to select a subset of those NV pairs in a single operation, in order to put them to a structured output format. for instance with either mongodb or sql, it'd make sense to put all name-value pairs starting with a given prefix to the output in a single operation. for example: mongodb(nv-pairs(".snmp.*")) Which would select a set of nv pairs from the message and put them in keys. A kind of name-transformation would be useful too: mongodb(nv-pairs(".snmp.*" ltrim('.snmp.') prefix('foo.')) Which would result in all NV pairs with a name beginning with .snmp. to become foo prefixed. the same could be applied when formatting WELF logs, perhaps would also be useful in rewrite rules. hmm.. maybe I should refresh my XSLT memories to see how this looks like in XPath/XQuery. -- Bazsi

Gergely Nagy

1:26 p.m.

On Fri, 2011-01-14 at 12:56 +0100, Balazs Scheidler wrote:

...

On Sat, 2011-01-01 at 14:24 -0600, Martin Holste wrote:

...
Super cool! At those rates, I think few will benefit from the bulk insert benefits, so I'd put that low on the feature priority list, especially with the opportunity to create bugs with the complexity. My main feature to add (aside from the two you mentioned already on the roadmap) would be a way to use the keys from a patterndb database so that the db and collection in Mongo stay the same, but the key names change with every patterndb rule. That's really the big payoff with Mongo--you don't have to define a rigid schema, so you don't have to know the column names ahead of time. That's a big deal considering that the patterndb can change on the fly. Being confined to predefined templates in the config limits the potential. Bazsi, any idea how to do this?

sorry for not answering any sooner, I was skimming through these emails, but never had the time to actually think about this stuff.

we would definitely need a way to query the contents of a message in a structured way.

e.g. if a message is a set of name-value pairs, it'd be nice to select a subset of those NV pairs in a single operation, in order to put them to a structured output format.

for instance with either mongodb or sql, it'd make sense to put all name-value pairs starting with a given prefix to the output in a single operation.

for example:

mongodb(nv-pairs(".snmp.*"))

Which would select a set of nv pairs from the message and put them in keys. A kind of name-transformation would be useful too:

mongodb(nv-pairs(".snmp.*" ltrim('.snmp.') prefix('foo.'))

Which would result in all NV pairs with a name beginning with .snmp. to become foo prefixed.

Ugh, that'd be a bit tricky to implement, but if shell glob-like syntax is an acceptable solution, I can make that work with relative ease - at least the selection part. Rewriting the keys would be another thing, something much trickier. However, both of these should - in my opinion - be implemented outside of the mongodb driver, so other destinations can use them in the future (if someone comes up with a driver for, say, couchdb, that could use this too; I can even imagine the sql destination using this...). A general filtering solution for nvtables would probably do the trick. The pattern matching can be done by, say, fnmatch(), but a better option would be to allow filtering nvtable keys by an arbitrary function - I'm not familiar with nvtables however, will take a closer look as soon as possible. That same function could be used to rewrite keys... Lets say, we have a function like this: gchar *nvtables_filter_cb (const gchar *key, gpointer user_data); We could then implement a filter that filters out (returns NULL) for anything that doesn't match .snmp.*, and if a key does match, it strips .snmp. and prefixes with 'foo.'. Wouldn't even be too hard to accomplish, I believe. But alas, this is in my opinion outside of the scope of the mongodb driver, and would be an independent feature of syslog-ng. One that the driver would certainly benefit from. I'll happily sit down and code a few ideas up, once I'm comfortable with the mongodb driver itself (adding extra features once the driver is stable would be my preferred course of action :). -- |8] PS: Funny how I get from "tricky to implement" to "hey, I'll code it up in a weekend" within 2 minutes. I love how syslog-ng makes this possible =)

Gergely Nagy

1 Jan 1 Jan

3:30 p.m.

A little update on the state of the driver: last night, I arrived to a state where I consider it good enough for my own purposes (already using it in production), today I did some benchmarking (completely unscientific, mind you) to see if and where I can improve the driver. A standard setup, logging to a file resulted in 24k message/sec, we'll use that for comparsion. Logging the same data to a capped (at 1000 messages) mongodb collection netted 18k messages/sec, while logging to an uncapped and unindexed mongodb collection is around 13k messages/sec. All tests were run on the same computer, using the same loggen commandline, the only change is the destionation in the syslog-ng config. Each test ran for 10 minutes. The numbers could probably be upped with suitable configuration and a more appropriate test environment, but I'm not really into that stuff, the current performance fits my needs perfectly well. I haven't tested an SQL destination, but my gut feeling is, mongodb's a lot faster already. And there's obviously a lot of cases I haven't tested: query speed while writes are flowing in; how indexing affects it all, and so on, since those scenarios are either not part of my use case, or I don't feel knowledgable enough to draw the proper conclusions. I'll let someone else do proper benchmarking, I'll stick to coding :) Now, the next thing I explored is if I can speed things up easily: for this reason, I had a look at callgrind's output, and concluded that most of the CPU time is spent outside of the mongodb driver, speeding up the driver would be possible, but it'd need some nasty tricks I'm not too keen on implementing. For the record, most of the time was spent in template resolution (resolving the collection name and the values to log) - there's not much I can do to speed those up. Another way to speed things up, especially when network speed starts to matter, would be to push the syslog-ng<->mongodb communication into a writer thread, much like the SQL driver is doing. I attempted to do that, but ran into a few blocking problems: the original idea was to collect a set amount of log messages and insert them in bulk. MongoDB has support for this, so that part is trivial. The problem is, that even with bulk insert, I can only insert into a single collection at a time. Since the collection name can contain macros, in order to do bulk inserts, I'd have to store the queue on a per-collection basis, and that would make things trickier, and more than likely would negate all the benefits of inserting in bulk. It would also be an option to disable macro support in collection(), but that has quite a few negative consequences, and I really like this functionality anyway. However, splitting the writing out to a thread, but with skipping the bulk insert part, is still preferable, due to mongo_insert() being a blocking call. Implementing this is my current plan for today. So, to sum it up, the current state is like this: * The driver works reasonably well, at an - in my opionion - good speed * It handles error cases reasonably gracefully: it detects network errors, and will try to reconnect after time_reopen seconds. In the meantime, messages are dropped, though. * Supports authentication * The key-value pairs to log can be configured (and the values can contain macros, obviously) * The collection name can contain macros aswell * Empty values are not stored in the database My TODO list for the driver at the moment is something along these lines: * Better error handling, preferably with no message dropping * mongodb communication moved to a separate thread I have set up a small project page with some rudimentary documentation at http://asylum.madhouse-project.org/projects/syslog-ng/mongodb/ if anyone's interested in trying out the driver.

Gergely Nagy

2 Jan 2 Jan

2:26 p.m.

Another little update: I ported the mongodb destination driver from using the mongodb C driver to the C++ driver, for a few reasons: * The C driver had to be bundled with the source, which I dislike with a passion. * The C++ driver is much more mature, and a lot more tested aswell. At the moment, there's a small bridge between the C and the C++ code, neatly separated into two little files. Functionality remained the same, stability hopefully improved, and there's less code to maintain within the syslog-ng driver. It's available on the algernon/dest/mongodb-cpp branch in my repository - I haven't merged it onto the main algernon/dest/mongodb branch just yet, there's a few little things I want to iron out first. Not to mention that I'm not really sure about introducing a (partly) C++ module to syslog-ng (even if it's optional, and not compiled by default).

Gergely Nagy

15 Jan 15 Jan

12:34 a.m.

It's been a while I spam^Wnotified the list with mongodb updates, so the recent news from the past few days are as follows: * The syslog-ng 3.2 version is abandoned, obsolete. * The driver has been ported to 3.3, and the database writer was split out into a separate thread. * The underlying mongodb connector was replaced: threw out the former, and implemented my own (for various reasons, which are too numerous to list here). The new mongodb connector will be maintained as a separate project, but a version of it will be embedded into syslog-ng for convenience's sake. The canonical repo for it is over at github: https://github.com/algernon/libmongo-client While this might not seem much, the above changes brought a few good side effects: * If we loose connection to MongoDB, for one reason or the other, the driver does not drop messages immediately anymore: we have a queue (configurable via log_fifo_size()). * Database writes are in a separate thread, thus, network latency does not affect the speed at which the driver can accept messages. * Due to the new underlying connector, we handle error cases a lot better. As in, we don't throw up and abort(). Though, error handling still has to improve, but the driver's in a much better state. * There's support for $SEQNUM in templates now. Though I can't really imagine a situation where I could use it, but I needed them for ObjectID generation anyway, so why not expose them to the templates aswell? However, there's one downside too: we lost authentication support (temporarily, until I re-implement that in the connector). The speed is about the same as 3.2's, threading didn't improve our performance (well, it did, if we consider network latency, but my tests are local) - I didn't expect that to improve to begin with. I'm actually surprised that despite changing the connector library, and porting over to threads, the driver still maintained its speed. My project page should be (mostly) up to date, but for the adventurous: git clone -b modules/afmongodb git://git.madhouse-project.org/syslog-ng/syslog-ng-3.3.git The modules/afmongodb branch will hopefully be quiet during the weekend. -- |8]

Balazs Scheidler

16 Jan 16 Jan

4:06 p.m.

Hi, I just wanted to let you know that I can pull your patches any time, but I'd like an explicit "pull" request with a proper repo/branch information to tell me that it is a state which can be pulled. Also, I before pulling I'd like to ask you to * fold related changes into a single patch in its final form (e.g. I wouldn't want to add the old mongo-c-client based implementation if possible), * please split off the patch on layers (e.g. one for mongo client lib, and one for the mongodb destination, it probably doesn't make sense to separate the syslog-ng plugin glue code, but that could come 3rd). * please remember to sign off each individual patch And now some review comments (I did the review on your HEAD, e.g. not doing a patch-by-patch review). * some of my comments would apply to the SQL driver as well, it is somewhat in a need of cleanup, as my focus during 3.3 development was more the traditional input/output plugins. These are not release critical (e.g. the code could still be pulled) I'm marking the comments below if this is the case and in case a better solution is found for MongoDB, a similar approch should be taken in SQL as well. * in syslog-ng 3.3, LogQueue is taking care of its own locking, so no need to separately lock it, at least for _push/pop operations. I see that you are using SQL-like wakeup mechanism, which requires you to know the length of the queue atomicaly, which still requires this external lock. A better solution would probably be to use log_queue_check_items() API and standard ivykis threads for the writer [applies to SQL, not RC] * when naming/defining mutexes, please try to name it according to the data it protects, and not code (writer_thread_mutex should probably be named something else. In SQL this is using a similar name, but that has historic roots). [applies to SQL, not RC] * afmongodb_dd_free() doesn't free keys/values [RC] * start/stop_thread() creates/frees mutexes/condvars, is there a reason not to do this once in dd_new() and free them in free() ? [RC] * is there a reason mongodb_dd_connect() is protected by a mutex? it is only protected from one code path, but it is only called from the writer thread, so it should be single threaded anyway. [RC] * worker_insert(): * it queries the time for each invocation. time(NULL) is dreadfully slow and we're trying to phase it out, especially in slow paths. cached_g_current_time_sec() should be used instead. [RC] * [oid generation] srand() should not be called here, if you absolutely must rely on srand/rand, then srand() should be called once at startup. maybe it would be better to use RAND_bytes(), although that would pull in a libcrypto as a requirement for mongodb as well (non-RC if well hidden behind a function) * [value, coll variables] if at all possible, please reuse GString based buffers in @self, especially if you don't need to protect them via a mutex. allocation heavy code can really affect performance (and although with SQL the numbers are quite low, the numbers you mentioned with mongodb, this may not be the case). This may apply to bson_new_sized() too (e.g. reuse the bson object header after a call similar to g_string_truncate(0)) (non-RC, but preferable) * [formatting collection name] g_string_prepend() is slow, add the literal to the beginning, and then use log_template_append_format() to append to the end (no need to move strings around). [RC] * [av_foreach_buffer] please don't define a struct for this as its name is just not covering what it does, I usually use an array of pointers in this case. [non-RC] * [dynamic values] the discussion about globbing on the name-value pairs should probably be closed before the autovalues stuff can go in. [RC] * bson stuff: looks nice. you mentioned you had some unit tests to cover this, can you add that to modules/afmongodb/tests ? * mongo-client.c: * perhaps a mongo state struct wouldn't hurt even if currently it would be an fd there. * mongo_packet_send(): perhaps you could use writev() to combine the header/data parts, instead of sending them in two chunks and two syscall roundtrips * mongo-wire.c: * struct _mongo_packet, please don't use #pragma there, just add a comment that new fields should be added with care. You perfectly filled all padding bytes with fields anyway. * perhaps using GByteArray here is an overkill, you size the array by calculating the required space anyway, this way you could hang the data array right the end of struct mongodb_packet. [non-RC] * wouldn't there be a way to avoid copying bson structs around? e.g. you currently [non-RC, just food for thought): * build 3 BSON objects (selector, update, set), possibly containing further nested because of subdocuments * then build a mongodb_packet which again copies the whole structure, essentially building everything and then copying everything again. * perhaps it'd be possible to reference the bson objects directly from the mongo_packet where they should be embedded? and then mongo_packet_send() could use writev() to send out the whole thing, without having to copy them to a linear buffer? Even though this message is lengthy, I really like the code here, and would put that into my tree immediately. But I let you do the stuff still open and to take the fame for your efforts :) On Sat, 2011-01-15 at 00:34 +0100, Gergely Nagy wrote:

...

It's been a while I spam^Wnotified the list with mongodb updates, so the recent news from the past few days are as follows:

* The syslog-ng 3.2 version is abandoned, obsolete. * The driver has been ported to 3.3, and the database writer was split out into a separate thread. * The underlying mongodb connector was replaced: threw out the former, and implemented my own (for various reasons, which are too numerous to list here). The new mongodb connector will be maintained as a separate project, but a version of it will be embedded into syslog-ng for convenience's sake. The canonical repo for it is over at github: https://github.com/algernon/libmongo-client

While this might not seem much, the above changes brought a few good side effects:

* If we loose connection to MongoDB, for one reason or the other, the driver does not drop messages immediately anymore: we have a queue (configurable via log_fifo_size()). * Database writes are in a separate thread, thus, network latency does not affect the speed at which the driver can accept messages. * Due to the new underlying connector, we handle error cases a lot better. As in, we don't throw up and abort(). Though, error handling still has to improve, but the driver's in a much better state. * There's support for $SEQNUM in templates now. Though I can't really imagine a situation where I could use it, but I needed them for ObjectID generation anyway, so why not expose them to the templates aswell?

However, there's one downside too: we lost authentication support (temporarily, until I re-implement that in the connector).

The speed is about the same as 3.2's, threading didn't improve our performance (well, it did, if we consider network latency, but my tests are local) - I didn't expect that to improve to begin with.

I'm actually surprised that despite changing the connector library, and porting over to threads, the driver still maintained its speed.

My project page should be (mostly) up to date, but for the adventurous:

git clone -b modules/afmongodb git://git.madhouse-project.org/syslog-ng/syslog-ng-3.3.git

The modules/afmongodb branch will hopefully be quiet during the weekend.

-- Bazsi

Gergely Nagy

4:58 p.m.

...

Also, I before pulling I'd like to ask you to * fold related changes into a single patch in its final form (e.g. I wouldn't want to add the old mongo-c-client based implementation if possible), * please split off the patch on layers (e.g. one for mongo client lib, and one for the mongodb destination, it probably doesn't make sense to separate the syslog-ng plugin glue code, but that could come 3rd). * please remember to sign off each individual patch

A'ight! Once I fixed the problems below, I'll prepare the pull branch accordingly. One note, though: the client lib is 'embedded' at the moment by copying the sources files over, into the modules/afmongodb/ directory. It's not the best, I'll see if I can do better (more about this down below). In the current form, separating the destination driver from the client lib would be tricky at best.

...

* in syslog-ng 3.3, LogQueue is taking care of its own locking, so no need to separately lock it, at least for _push/pop operations. I see that you are using SQL-like wakeup mechanism, which requires you to know the length of the queue atomicaly, which still requires this external lock. A better solution would probably be to use log_queue_check_items() API and standard ivykis threads for the writer [applies to SQL, not RC]

Noted, I'll have a closer look at ivykis.

...

* when naming/defining mutexes, please try to name it according to the data it protects, and not code (writer_thread_mutex should probably be named something else. In SQL this is using a similar name, but that has historic roots). [applies to SQL, not RC]

Ok.

...

* afmongodb_dd_free() doesn't free keys/values [RC]

Off the top of my head, I don't see which key/values should be free'd there: self->fields is cleaned up as far as I see.

...

* start/stop_thread() creates/frees mutexes/condvars, is there a reason not to do this once in dd_new() and free them in free() ? [RC]

No reason that I can think of, will move them over.

...

* is there a reason mongodb_dd_connect() is protected by a mutex? it is only protected from one code path, but it is only called from the writer thread, so it should be single threaded anyway. [RC]

Good catch, will fix, thanks!

...

* worker_insert(): * it queries the time for each invocation. time(NULL) is dreadfully slow and we're trying to phase it out, especially in slow paths. cached_g_current_time_sec() should be used instead. [RC]

Partially-fixed: The OID generation was moved to the mongo client library. Though, that still calls time() for now. I'll add a way to supply our own time, and then the mongodb destination can pass the cached time.

...

* [oid generation] srand() should not be called here, if you absolutely must rely on srand/rand, then srand() should be called once at startup. maybe it would be better to use RAND_bytes(), although that would pull in a libcrypto as a requirement for mongodb as well (non-RC if well hidden behind a function)

This is partially fixed too, as OID generation was moved to the mongo client (I just haven't pushed those changes yet, wanted to keep my HEAD stable for a few days). The rand() is used to generate the machine ID... I think I have an acceptable way to solve that. As for librypto dependency: I will most likely end up having to depend on it, as mongodb authentication relies on MD5, and I don't feel like embedding a random md5 lib, so will probably turn to libcrypto anyway. And by doing so, I get RAND_bytes() for free, too.

...

* [value, coll variables] if at all possible, please reuse GString based buffers in @self, especially if you don't need to protect them via a mutex. allocation heavy code can really affect performance (and although with SQL the numbers are quite low, the numbers you mentioned with mongodb, this may not be the case). This may apply to bson_new_sized() too (e.g. reuse the bson object header after a call similar to g_string_truncate(0)) (non-RC, but preferable)

Noted, will look into it.

...

* [formatting collection name] g_string_prepend() is slow, add the literal to the beginning, and then use log_template_append_format() to append to the end (no need to move strings around). [RC]

D'oh. Suits me for not checking the headers: didn't know about log_template_append_format().

...

* [av_foreach_buffer] please don't define a struct for this as its name is just not covering what it does, I usually use an array of pointers in this case. [non-RC]

Noted, will fix.

...

* [dynamic values] the discussion about globbing on the name-value pairs should probably be closed before the autovalues stuff can go in. [RC]

Agreed. Would you prefer a pull request without the autovalues first, and another one after the value-pairs() stuff is implemented, or one pull request after value-pairs()?

...

* bson stuff: looks nice. you mentioned you had some unit tests to cover this, can you add that to modules/afmongodb/tests ?

In the long run, I'd like to figure out a way to properly embed my mongo-client library. The current way of copying over the sources is a bit... lame. I'm not quite sure about what the best way would be, though. I would like to embed the client lib along with the test cases & whatnot, but copying over isn't going to work then in the long run. Anyway: technically, yes, I could add it, wouldn't even be hard. But I'd prefer finding an embedding solution that's easier to handle. An option would be to use git submodules, and teach the syslog-ng build system to recurse into the submodule directory if it exists, and skip the mongodb driver if it doesn't: this would mean that git checkouts continue to work, even if one doesn't update the submodules, but also, once the client lib is checked out, make dist and similar will pick it up. However, the above idea might be a bit too intrusive. A compromise might be to copy my whole lib to lib/mongo-client/ for example, along with test cases.

...

* mongo-client.c: * perhaps a mongo state struct wouldn't hurt even if currently it would be an fd there.

Partially fixed: it's on my TODO list for the client lib for today.

...

* mongo_packet_send(): perhaps you could use writev() to combine the header/data parts, instead of sending them in two chunks and two syscall roundtrips

Was on my TODO list, it climbed a bit higher.

...

* mongo-wire.c: * struct _mongo_packet, please don't use #pragma there, just add a comment that new fields should be added with care. You perfectly filled all padding bytes with fields anyway.

Noted, will fix.

...

* perhaps using GByteArray here is an overkill, you size the array by calculating the required space anyway, this way you could hang the data array right the end of struct mongodb_packet. [non-RC]

Noted, will fix.

...

* wouldn't there be a way to avoid copying bson structs around? e.g. you currently [non-RC, just food for thought): * build 3 BSON objects (selector, update, set), possibly containing further nested because of subdocuments * then build a mongodb_packet which again copies the whole structure, essentially building everything and then copying everything again. * perhaps it'd be possible to reference the bson objects directly from the mongo_packet where they should be embedded? and then mongo_packet_send() could use writev() to send out the whole thing, without having to copy them to a linear buffer?

This is another thing on my TODO list, just at a lower priority: the first priority was to get the lib working, optimisation came lagging behind (and I'm not quite there yet).

...

Even though this message is lengthy, I really like the code here, and would put that into my tree immediately. But I let you do the stuff still open and to take the fame for your efforts :)

\o/ -- |8]

Gergely Nagy

10:05 p.m.

Most of the mentioned issues are fixed in git, one way or the other (details below). However, I'm seeing some weird problems from time to time... Namely, when I shut down the mongodb server to test reconnect and perhaps a few other things: with the mongodb server down, when I do an msg_error() from the writer thread, syslog-ng ends up segfaulting, due to indirectly calling cached_g_current_time() (via the log_msg_init()) - which, as far as I experienced so far, shouldn't be called from anywhere but the main thread. I'm not quite sure how to get around that, except for not generating messages in the worker thread, which would be a bit painful. (This issue was present for a while, I think, will check if I can bisect it) I probably screwed up something around the mutexes too, as I threw loggen at syslog-ng and after a while it just stopped, probably in a neat little deadlock. Current status as of this writing:

...

...
* start/stop_thread() creates/frees mutexes/condvars, is there a reason not to do this once in dd_new() and free them in free() ? [RC]

No reason that I can think of, will move them over.

Moved the mutex/condvar stuff to dd_init() and dd_deinit() respectively. The reason they weren't moved to _new/_free() is that the worker thread depends on stuff that is set up by dd_init(), so it can't start at dd_new() time. I might be missing something though, in which case please bonk me on the head. (One option would be to have a wakeup cond for the thread: I would set up the cond vars and mutexes from _new(), but only issue the wakeup from _init() - problem solved. At the moment I'm having a few issues grasping the process of module starting & shutdown. I'll have another go at it tomorrow.)

...

...
* is there a reason mongodb_dd_connect() is protected by a mutex? it is only protected from one code path, but it is only called from the writer thread, so it should be single threaded anyway. [RC]

Good catch, will fix, thanks!

Fixed in git.

...

...
* worker_insert(): * it queries the time for each invocation. time(NULL) is dreadfully slow and we're trying to phase it out, especially in slow paths. cached_g_current_time_sec() should be used instead. [RC]

Partially-fixed: The OID generation was moved to the mongo client library. Though, that still calls time() for now. I'll add a way to supply our own time, and then the mongodb destination can pass the cached time.

Fixed in git. I added a self->last_msg_stamp, which is updated under a lock in dd_queue(), and passed to the OID generator in the writer thread.

...

* [oid generation] srand() should not be called here, if you

...
absolutely must rely on srand/rand, then srand() should be called once at startup. maybe it would be better to use RAND_bytes(), although that would pull in a libcrypto as a requirement for mongodb as well (non-RC if well hidden behind a function)

This is partially fixed too, as OID generation was moved to the mongo client (I just haven't pushed those changes yet, wanted to keep my HEAD stable for a few days).

The rand() is used to generate the machine ID... I think I have an acceptable way to solve that.

Fixed in git. The client library now has a mongo_util_oid_init() method which will set up the machine id and the pid. This is called from dd_new(), neither srand, nor rand, nor getpid, nor time gets called from afmongodb anymore. And getpid/srand/rand only gets indirectly called one time during dd_new().

...

...
* [formatting collection name] g_string_prepend() is slow, add the literal to the beginning, and then use log_template_append_format() to append to the end (no need to move strings around). [RC]

D'oh. Suits me for not checking the headers: didn't know about log_template_append_format().

Fixed in git.

...

* mongo-client.c:

...
* perhaps a mongo state struct wouldn't hurt even if currently it would be an fd there.

Partially fixed: it's on my TODO list for the client lib for today.

Fixed in git.

...

* mongo_packet_send(): perhaps you could use writev() to combine the

...
header/data parts, instead of sending them in two chunks and two syscall roundtrips

Was on my TODO list, it climbed a bit higher.

Fixed in git.

...

* mongo-wire.c:

...
* struct _mongo_packet, please don't use #pragma there, just add a comment that new fields should be added with care. You perfectly filled all padding bytes with fields anyway.

Noted, will fix.

Fixed in git. Though, the client library grew another struct (which is not used by syslog-ng yet) which still uses #pragma, to avoid padding. I'll see if I can get around that easily.

...

* perhaps using GByteArray here is an overkill, you size the array by

...
calculating the required space anyway, this way you could hang the data array right the end of struct mongodb_packet. [non-RC]

Noted, will fix.

Fixed in git.

...

* [av_foreach_buffer] please don't define a struct for this as its

...
name is just not covering what it does, I usually use an array of pointers in this case. [non-RC]

Noted, will fix.

Fixed in git.

...

...
* [value, coll variables] if at all possible, please reuse GString based buffers in @self, especially if you don't need to protect them via a mutex. allocation heavy code can really affect performance (and although with SQL the numbers are quite low, the numbers you mentioned with mongodb, this may not be the case). This may apply to bson_new_sized() too (e.g. reuse the bson object header after a call similar to g_string_truncate(0)) (non-RC, but preferable)

Noted, will look into it.

value & coll moved to self (as self->current_value and self->current_namespace), they're allocated by the worker thread upon startup, and freed when shutting down. And since the "db." prefix doesn't change, ever, I also store the namespace prefix length in self, and truncate to that point during insert. The bson objects will be moved there too later tonight - I will have to update the client lib for that, and am knee-deep in syslog-ng at the moment, and don't want to switch out.

...

...
* when naming/defining mutexes, please try to name it according to the data it protects, and not code (writer_thread_mutex should probably be named something else. In SQL this is using a similar name, but that has historic roots). [applies to SQL, not RC]

Ok.

(Partly) fixed in git. I removed the writer_thread_mutex, and added suspend_mutex (held when checking/setting suspend state) and queue_mutex (held when checking queue length). Also got rid of explicit mutex locking around log_queue_push & _pop calls. Once I figure out how to use log_queue_check_items() and ivykis threads, queue_mutex will be gone too.

...

...
* in syslog-ng 3.3, LogQueue is taking care of its own locking, so no need to separately lock it, at least for _push/pop operations. I see that you are using SQL-like wakeup mechanism, which requires you to know the length of the queue atomicaly, which still requires this external lock. A better solution would probably be to use log_queue_check_items() API and standard ivykis threads for the writer [applies to SQL, not RC]

Noted, I'll have a closer look at ivykis.

I think I have a rough idea how this should work. If all goes well, I'll push this out tonight aswell.

...

...
* afmongodb_dd_free() doesn't free keys/values [RC]

Off the top of my head, I don't see which key/values should be free'd there: self->fields is cleaned up as far as I see.

Haven't looked further, I'll have a look at it tomorrow. -- |8]

5378

Age (days ago)

5395

Last active (days ago)

List overview

Download

38 comments

6 participants

participants (6)

Balazs Scheidler
Balint Kovacs
Gergely Nagy
Gergely Nagy
Martin Holste
Matthew Hall