[syslog-ng] MongoDB destination driver

Gergely Nagy algernon at balabit.hu
Sun Jan 16 16:58:09 CET 2011


> Also, I before pulling I'd like to ask you to
>   * fold related changes into a single patch in its final form (e.g. I
> wouldn't want to add the old mongo-c-client based implementation if
> possible), 
>   * please split off the patch on layers (e.g. one for mongo client lib,
> and one for the mongodb destination, it probably doesn't make sense to
> separate the syslog-ng plugin glue code, but that could come 3rd).
>   * please remember to sign off each individual patch
> 

A'ight! Once I fixed the problems below, I'll prepare the pull branch
accordingly.

One note, though: the client lib is 'embedded' at the moment by copying
the sources files over, into the modules/afmongodb/ directory. It's not
the best, I'll see if I can do better (more about this down below).

In the current form, separating the destination driver from the client
lib would be tricky at best.

> * in syslog-ng 3.3, LogQueue is taking care of its own locking, so no
> need to separately lock it, at least for _push/pop operations. I see
> that you are using SQL-like wakeup mechanism, which requires you to know
> the length of the queue atomicaly, which still requires this external
> lock. A better solution would probably be to use log_queue_check_items()
> API and standard ivykis threads for the writer [applies to SQL, not RC]

Noted, I'll have a closer look at ivykis.

> * when naming/defining mutexes, please try to name it according to the
> data it protects, and not code (writer_thread_mutex should probably be
> named something else. In SQL this is using a similar name, but that has
> historic roots). [applies to SQL, not RC]

Ok.

> * afmongodb_dd_free() doesn't free keys/values [RC]

Off the top of my head, I don't see which key/values should be free'd
there: self->fields is cleaned up as far as I see.

> * start/stop_thread() creates/frees mutexes/condvars, is there a reason
> not to do this once in dd_new() and free them in free() ? [RC]

No reason that I can think of, will move them over.

> * is there a reason mongodb_dd_connect() is protected by a mutex? it is
> only protected from one code path, but it is only called from the writer
> thread, so it should be single threaded anyway. [RC]

Good catch, will fix, thanks!

> * worker_insert():
>   * it queries the time for each invocation. time(NULL) is dreadfully
> slow and we're trying to phase it out, especially in slow paths.
> cached_g_current_time_sec() should  be used instead. [RC]

Partially-fixed: The OID generation was moved to the mongo client
library. Though, that still calls time() for now. I'll add a way to
supply our own time, and then the mongodb destination can pass the
cached time.

> * [oid generation] srand() should not be called here, if you
> absolutely must rely on srand/rand, then srand() should be called once
> at startup. maybe it would be better to use RAND_bytes(), although that
> would pull in a libcrypto as a requirement for mongodb as well (non-RC
> if well hidden behind a function)

This is partially fixed too, as OID generation was moved to the mongo
client (I just haven't pushed those changes yet, wanted to keep my HEAD
stable for a few days).

The rand() is used to generate the machine ID... I think I have an
acceptable way to solve that.

As for librypto dependency: I will most likely end up having to depend
on it, as mongodb authentication relies on MD5, and I don't feel like
embedding a random md5 lib, so will probably turn to libcrypto anyway.
And by doing so, I get RAND_bytes() for free, too.

> * [value, coll variables] if at all possible, please reuse GString
> based buffers in @self, especially if you don't need to protect them via
> a mutex. allocation heavy code can really affect performance (and
> although with SQL the numbers are quite low, the numbers you mentioned
> with mongodb, this may not be the case). This may apply to
> bson_new_sized() too (e.g. reuse the bson object header after a call
> similar to g_string_truncate(0)) (non-RC, but preferable)

Noted, will look into it.

> * [formatting collection name] g_string_prepend() is slow, add the
> literal to the beginning, and then use log_template_append_format() to
> append to the end (no need to move strings around). [RC]

D'oh. Suits me for not checking the headers: didn't know about
log_template_append_format().

> * [av_foreach_buffer] please don't define a struct for this as its
> name is just not covering what it does, I usually use an array of
> pointers in this case. [non-RC]

Noted, will fix.

> * [dynamic values] the discussion about globbing on the name-value
> pairs should probably be closed before the autovalues stuff can go in.
> [RC]

Agreed. Would you prefer a pull request without the autovalues first,
and another one after the value-pairs() stuff is implemented, or one
pull request after value-pairs()?

> * bson stuff: looks nice. you mentioned you had some unit tests to cover
> this, can you add that to modules/afmongodb/tests ?

In the long run, I'd like to figure out a way to properly embed my
mongo-client library. The current way of copying over the sources is a
bit... lame.

I'm not quite sure about what the best way would be, though. I would
like to embed the client lib along with the test cases & whatnot, but
copying over isn't going to work then in the long run.

Anyway: technically, yes, I could add it, wouldn't even be hard. But I'd
prefer finding an embedding solution that's easier to handle.

An option would be to use git submodules, and teach the syslog-ng build
system to recurse into the submodule directory if it exists, and skip
the mongodb driver if it doesn't: this would mean that git checkouts
continue to work, even if one doesn't update the submodules, but also,
once the client lib is checked out, make dist and similar will pick it
up.

However, the above idea might be a bit too intrusive. A compromise might
be to copy my whole lib to lib/mongo-client/ for example, along with
test cases.

> * mongo-client.c:
>   * perhaps a mongo state struct wouldn't hurt even if currently it
> would be an fd there. 

Partially fixed: it's on my TODO list for the client lib for today.

> * mongo_packet_send(): perhaps you could use writev() to combine the
> header/data parts, instead of sending them in two chunks and two syscall
> roundtrips

Was on my TODO list, it climbed a bit higher.

> * mongo-wire.c:
>   * struct _mongo_packet, please don't use #pragma there, just add a
> comment that new fields should be added with care. You perfectly filled
> all padding bytes with fields anyway.

Noted, will fix.

> * perhaps using GByteArray here is an overkill, you size the array by
> calculating the required space anyway, this way you could hang the data
> array right the end of struct mongodb_packet. [non-RC]

Noted, will fix.

> * wouldn't there be a way to avoid copying bson structs around? e.g.
> you currently [non-RC, just food for thought):
>      * build 3 BSON objects (selector, update, set), possibly containing
> further nested because of subdocuments
>      * then build a mongodb_packet which again copies the whole
> structure, essentially building everything and then copying everything
> again.
>      * perhaps it'd be possible to reference the bson objects directly
> from the mongo_packet where they should be embedded? and then
> mongo_packet_send() could use writev() to send out the whole thing,
> without having to copy them to a linear buffer?

This is another thing on my TODO list, just at a lower priority: the
first priority was to get the lib working, optimisation came lagging
behind (and I'm not quite there yet).


> Even though this message is lengthy, I really like the code here, and
> would put that into my tree immediately. But I let you do the stuff
> still open and to take the fame for your efforts :)

\o/

-- 
|8]





More information about the syslog-ng mailing list