Re: [syslog-ng] MongoDB destination driver

16 Jan 2011

      Hi,

I just wanted to let you know that I can pull your patches any time, but
I'd like an explicit "pull" request with a proper repo/branch
information to tell me that it is a state which can be pulled.

Also, I before pulling I'd like to ask you to
  * fold related changes into a single patch in its final form (e.g. I
wouldn't want to add the old mongo-c-client based implementation if
possible), 
  * please split off the patch on layers (e.g. one for mongo client lib,
and one for the mongodb destination, it probably doesn't make sense to
separate the syslog-ng plugin glue code, but that could come 3rd).
  * please remember to sign off each individual patch

And now some review comments (I did the review on your HEAD, e.g. not
doing a patch-by-patch review).

* some of my comments would apply to the SQL driver as well, it is
somewhat in a need of cleanup, as my focus during 3.3 development was
more the traditional input/output plugins. These are not release
critical (e.g. the code could still be pulled) I'm marking the comments
below if this is the case and in case a better solution is found for
MongoDB, a similar approch should be taken in SQL as well.

* in syslog-ng 3.3, LogQueue is taking care of its own locking, so no
need to separately lock it, at least for _push/pop operations. I see
that you are using SQL-like wakeup mechanism, which requires you to know
the length of the queue atomicaly, which still requires this external
lock. A better solution would probably be to use log_queue_check_items()
API and standard ivykis threads for the writer [applies to SQL, not RC]

* when naming/defining mutexes, please try to name it according to the
data it protects, and not code (writer_thread_mutex should probably be
named something else. In SQL this is using a similar name, but that has
historic roots). [applies to SQL, not RC]

* afmongodb_dd_free() doesn't free keys/values [RC]

* start/stop_thread() creates/frees mutexes/condvars, is there a reason
not to do this once in dd_new() and free them in free() ? [RC]

* is there a reason mongodb_dd_connect() is protected by a mutex? it is
only protected from one code path, but it is only called from the writer
thread, so it should be single threaded anyway. [RC]

* worker_insert():
  * it queries the time for each invocation. time(NULL) is dreadfully
slow and we're trying to phase it out, especially in slow paths.
cached_g_current_time_sec() should  be used instead. [RC]

  * [oid generation] srand() should not be called here, if you
absolutely must rely on srand/rand, then srand() should be called once
at startup. maybe it would be better to use RAND_bytes(), although that
would pull in a libcrypto as a requirement for mongodb as well (non-RC
if well hidden behind a function)

  * [value, coll variables] if at all possible, please reuse GString
based buffers in @self, especially if you don't need to protect them via
a mutex. allocation heavy code can really affect performance (and
although with SQL the numbers are quite low, the numbers you mentioned
with mongodb, this may not be the case). This may apply to
bson_new_sized() too (e.g. reuse the bson object header after a call
similar to g_string_truncate(0)) (non-RC, but preferable)

  * [formatting collection name] g_string_prepend() is slow, add the
literal to the beginning, and then use log_template_append_format() to
append to the end (no need to move strings around). [RC]

  * [av_foreach_buffer] please don't define a struct for this as its
name is just not covering what it does, I usually use an array of
pointers in this case. [non-RC]

  * [dynamic values] the discussion about globbing on the name-value
pairs should probably be closed before the autovalues stuff can go in.
[RC]

* bson stuff: looks nice. you mentioned you had some unit tests to cover
this, can you add that to modules/afmongodb/tests ?

* mongo-client.c:
  * perhaps a mongo state struct wouldn't hurt even if currently it
would be an fd there. 

  * mongo_packet_send(): perhaps you could use writev() to combine the
header/data parts, instead of sending them in two chunks and two syscall
roundtrips

* mongo-wire.c:
  * struct _mongo_packet, please don't use #pragma there, just add a
comment that new fields should be added with care. You perfectly filled
all padding bytes with fields anyway.

  * perhaps using GByteArray here is an overkill, you size the array by
calculating the required space anyway, this way you could hang the data
array right the end of struct mongodb_packet. [non-RC]

  * wouldn't there be a way to avoid copying bson structs around? e.g.
you currently [non-RC, just food for thought):
     * build 3 BSON objects (selector, update, set), possibly containing
further nested because of subdocuments
     * then build a mongodb_packet which again copies the whole
structure, essentially building everything and then copying everything
again.
     * perhaps it'd be possible to reference the bson objects directly
from the mongo_packet where they should be embedded? and then
mongo_packet_send() could use writev() to send out the whole thing,
without having to copy them to a linear buffer?

Even though this message is lengthy, I really like the code here, and
would put that into my tree immediately. But I let you do the stuff
still open and to take the fame for your efforts :)

On Sat, 2011-01-15 at 00:34 +0100, Gergely Nagy wrote:
...
It's been a while I spam^Wnotified the list with mongodb updates, so the
recent news from the past few days are as follows:
* The syslog-ng 3.2 version is abandoned, obsolete.
* The driver has been ported to 3.3, and the database writer was split
out into a separate thread.
* The underlying mongodb connector was replaced: threw out the former,
and implemented my own (for various reasons, which are too numerous to
list here). The new mongodb connector will be maintained as a separate
project, but a version of it will be embedded into syslog-ng for
convenience's sake. The canonical repo for it is over at github:
https://github.com/algernon/libmongo-client
While this might not seem much, the above changes brought a few good
side effects:
* If we loose connection to MongoDB, for one reason or the other, the
driver does not drop messages immediately anymore: we have a queue
(configurable via log_fifo_size()).
* Database writes are in a separate thread, thus, network latency does
not affect the speed at which the driver can accept messages.
* Due to the new underlying connector, we handle error cases a lot
better. As in, we don't throw up and abort(). Though, error handling
still has to improve, but the driver's in a much better state.
* There's support for $SEQNUM in templates now. Though I can't really
imagine a situation where I could use it, but I needed them for ObjectID
generation anyway, so why not expose them to the templates aswell?
However, there's one downside too: we lost authentication support
(temporarily, until I re-implement that in the connector).
The speed is about the same as 3.2's, threading didn't improve our
performance (well, it did, if we consider network latency, but my tests
are local) - I didn't expect that to improve to begin with.
I'm actually surprised that despite changing the connector library, and
porting over to threads, the driver still maintained its speed.
My project page should be (mostly) up to date, but for the adventurous:
git clone -b modules/afmongodb git://git.madhouse-project.org/syslog-ng/syslog-ng-3.3.git
The modules/afmongodb branch will hopefully be quiet during the weekend.
-- 
Bazsi

Re: [syslog-ng] MongoDB destination driver

Balazs Scheidler