Hello,
I am running syslog-ng (1.4.17) on Linux (2.4.21) and received an error
when the disk partition became full. Looking at the code, it does not look
as though the error was handled very well. After the error, syslog-ng
started to take up 26% of my CPU time and was unresponsive to other
logged messages to the file that was being written to when the error
occurred (even though the room was made in the partition).
The specifics are:
Received this error –
Jun 25 13:16:57 port1-1 syslog-ng[147]:
io.c: do_write: write() failed (errno 28),
No space left on device
So, do_write() correctly identifies the error and puts out a nice
message that the disk is full.
int res = write(self->fd, data, length);
if(res < 0)
{
switch(errno)
{
case EINTR:
case EWOULDBLOCK:
return 0;
default:
werror(
“io.c: do_write: write() failed (errno %i), %z\n”);
}
...
return(res);
}
The only place I can see where do_write is called is write_callback()
where I would have liked to have seen a “if(w < -1) do_something” but
the value of w is just sent in as an argument to BUF_FLUSH().
static void write_callback(struct nonblocking_fd *fd)
{
CAST(io_fd, self, fd);
int res;
struct fd_write w =
{{ STACK_HEADER, do_write), fd->fd, self->fsync };
assert(self->buffer);
res = BUF_FLUSH(self->buffer, &w.super);
So, if w is an error (-1), what is &w.super equal to? Something not
good.
Maybe I have this wrong. Maybe there is another place where
do_write() is called. Regardless, the value of w is not being checked
so what is the proper course of action if w is -1?
Note: my syslog-ng is running on an embedded system that does
not have much flash memory to store these logs. The system has
a process that truncates the log files when they exceed specific limits,
but the granularity of the file size checking can mean that one or more
processes can exceed their limit(s) and cause the disk to be full before
the checker process wakes up and does the cleanup, as happened in
this case.
Thanks.
John Feeney