Hello, The attached patch comes from http://dev.riseup.net/patches/syslog-ng what it does is provide a simple filter to strip out unwanted regular expressions from logs, as well as an IP alias that enables you to strip out IP addresses from your logs. This patch has been applied to the Debian package of syslog-ng, I am writing here to let people know about it and submit it for consideration into syslog-ng.
From the README:
This patch adds the capability to syslog-ng that allows you to strip out any given regexp or all IP addresses from log messages before they are written to disk. The goal is to give the system administrator the means to implement site logging policies, by allowing them easy control over exactly what data they retain in their logfiles, regardless of what a particular daemon might think is best. Background: Data retention has become a hot legal topic for ISPs and other Online Service Providers (OSPs). There are many instances where it is preferable to keep less information on users than is collected by default on many systems. In the United States it is not currently required to retain data on users of a server, but you may be required to provide all data on a user which you have retained. OSPs can protect themselves from legal hassles and added work by choosing what data they wish to retain.
From "Best Practices for Online Service Providers" (http://www.eff.org/osp):
As an intermediary, the OSP finds itself in a position to collect and store detailed information about its users and their online activities that may be of great interest to third parties. The USA PATRIOT Act also provides the government with expanded powers to request this information. As a result, OSP owners must deal with requests from law enforcement and lawyers to hand over private user information and logs. Yet, compliance with these demands takes away from an OSP's goal of providing users with reliable, secure network services. In this paper, EFF offers some suggestions, both legal and technical, for best practices that balance the needs of OSPs and their users' privacy and civil liberties. Rather than scrubbing the information you don't want in logs, this patch ensures that the information is never written to disk. Also, for those daemons which log through syslog facilities, this patch provides a convenient single configuration to limit what you wish to log. Here are some related links: Best Practices for Online Service Providers http://www.eff.org/osp http://www.eff.org/osp/20040819_OSPBestPractices.pdf EPIC International Data Retention Page http://www.epic.org/privacy/intl/data_retention.html Working Paper on Usage Log Data Management (from Computer, Freedom, and Privacy conference) http://cryptome.org/usage-logs.htm How to use it This patch adds the filter "strip". For example: filter f_strip {strip(<regexp>);}; This will strip out all matches of the regular expression on logs to which the filter is applied and replaces all matches with the fixed length four dashes ("----"). In place of a regular expression, you can put "ips", which will replace all internet addresses with 0.0.0.0 <http://0.0.0.0>. For example: filter f_strip {strip(ips);}; You can alter what the replacement strings are by using replace: replace(ips,"0.0.0.0 <http://0.0.0.0>") <--- this is the same as strip(ips) replace(<regex>,"----") <--- this is the same as strip(regex) We provide a debian package of 1.6.7 with this patch added (the repository is http://deb.riseup.net/debian unstable main), or you can retrieve the patch yourself from http://dev.riseup.net/websvn/listing.php?repname=syslog-ng-anon&path=%2F&sc=... apply it with: # patch -p1 < syslog-ng-anon.diff
On Mon, 30 May 2005 19:12:51 CDT, micah milano said:
The attached patch comes from http://dev.riseup.net/patches/syslog-ng what it does is provide a simple filter to strip out unwanted regular expressions from logs, as well as an IP alias that enables you to strip out IP addresses from your logs.
Interesting. Does it apply the regexp to *the entire message* (a quick read of the code indicates so)? Also, I see in make_filter_replace: if (strcasecmp(re,"ips") == 0) { re = "...([\\.\\-](25 Was the \\- intended? And just to *ensure* that rocks and rotten tomatoes are heaved at me: Any plans to expand that RE to cover IPv6 addresses? ;)
Valdis.Kletnieks@vt.edu said:
On Mon, 30 May 2005 19:12:51 CDT, micah milano said:
The attached patch comes from http://dev.riseup.net/patches/syslog-ng what it does is provide a simple filter to strip out unwanted regular expressions from logs, as well as an IP alias that enables you to strip out IP addresses from your logs.
Interesting. Does it apply the regexp to *the entire message* (a quick read of the code indicates so)?
yes. perhaps it should not?
Also, I see in make_filter_replace:
if (strcasecmp(re,"ips") == 0) { re = "...([\\.\\-](25
Was the \\- intended?
Many ISPs set the reverse dns to include the ip address in the form 69-90-134-155-myisp.com, so I thought it would be useful to remove those as well.
Any plans to expand that RE to cover IPv6 addresses? ;)
Yes. Alas, IPv6 is complicated. I had a pcre which worked, but had some difficulty converting it to regexp. Eventually, I plan to do so. Any suggestions for what the regexp should be? -elijah
On Tue, 31 May 2005 11:28:29 PDT, Elijah said:
Valdis.Kletnieks@vt.edu said:
Interesting. Does it apply the regexp to *the entire message* (a quick read of the code indicates so)? yes. perhaps it should not?
That's fine, as long as that's the documented and understood behavior. It occurred to me that probably some explicit decision should be made and documented regarding $HOST/$MACHINE - it's reasonable to *not* filter those, because if you're running a central syslog server, you probably want to *keep* the information that the message came from your NNTP server, but *redact* the end user's IP address in the NNTP server's logs. However, this may come as a surprise if a site has end-user IP addresses syslog()ing to the central server (no, I don't know why you'd do that, but it could happen ;)
Also, I see in make_filter_replace:
if (strcasecmp(re,"ips") == 0) { re = "...([\\.\\-](25
Was the \\- intended?
Many ISPs set the reverse dns to include the ip address in the form 69-90-134-155-myisp.com, so I thought it would be useful to remove those as well.
OK.. I can see why you'd want to do that. However, I'm not convinced that it's a good idea to try to clean up the text strings of PTR entries, as that's just providing a false sense of security. Consider these hosts: % host 195.197.6.1 % host 195.197.6.73 % host 195.197.6.74 You'll almost certainly end up with this in the logs. ;)
Any plans to expand that RE to cover IPv6 addresses? ;)
Yes. Alas, IPv6 is complicated. I had a pcre which worked, but had some difficulty converting it to regexp. Eventually, I plan to do so. Any suggestions for what the regexp should be?
No.. not at this time of the morning, sorry.. ;)
Hi, I just throw in my 2 cents ...
The attached patch comes from http://dev.riseup.net/patches/syslog-ng
Gives you a 404 at first until you click on login.
what it does is provide a simple filter to strip out unwanted regular expressions from logs, as well as an IP alias that enables you to strip out IP addresses from your logs. This patch has been applied to the Debian package of syslog-ng, I am writing here to let people know about it and submit it for consideration into syslog-ng.
While basically, despite all the written data retention documents for ISPs and OSPs, I think this is a bad idea, I also see that this patch is rather non-intrusive and might indeed help a couple of people working in these business fields. Bad idea not least because the logic of hiding data should be in the frontend and/or the extraction process (ETL) and not in the data storage. On a central syslog server you'd like to have data mining theories applied for example, where you need the whole set of raw data, unfiltered. Well, only partially unfiltered, since one will certainly apply filters in their log statements. But as I said above, the patch is non-intrusive and has certain eligibility.
Data retention has become a hot legal topic for ISPs and other Online Service Providers (OSPs). There are many instances where it is preferable to keep less information on users than is collected by default on many systems. In the United States it is not currently required to retain data on users of a server, but you may be required to provide all data on a user which you have retained.
This is a hot topic in Switzerland for example, where legislative reforms have taken place which might demand just exactly that striping is not allowed.
Rather than scrubbing the information you don't want in logs, this patch ensures that the information is never written to disk. Also, for those daemons which log through syslog facilities, this patch provides a convenient single configuration to limit what you wish to log.
This is not entirely true. With your patch you add a third method of dealing with information. But it's not on the same level as the other two. Method 1: have log statements which omit certain log lines, and don't set a catchall log statement Method 2: build a filter for lines you'd like to match and forget. Add a destination statement with /dev/null as file destination. Method 3: strip the lines. Method 1 and 2 drop information, but basically maintain their value of truth. Method 3 changes the information gain and thus, strongly speaking, dilutes the truth. Dealing with the legal aspects of information gain/loss with regard to dilution is a delicate matter.
This patch adds the filter "strip". For example:
filter f_strip {strip(<regexp>);};
I don't see the necessity to provide a keyword strip as a subset of replace. Please drop it, while referring to the equivalent lines below, written by you.
replace(ips,"0.0.0.0 <http://0.0.0.0>") <--- this is the same as strip(ips) replace(<regex>,"----") <--- this is the same as strip(regex)
We provide a debian package of 1.6.7 with this patch added (the repository is http://deb.riseup.net/debian unstable main), or you can retrieve the patch yourself from http://dev.riseup.net/websvn/listing.php?repname=syslog-ng-anon&path=%2F&sc=... <http://dev.riseup.net/websvn/listing.php?repname=syslog-ng-anon&path=%2F&sc=0> and apply it with:
# patch -p1 < syslog-ng-anon.diff
------------------------------------------------------------------------
diff -uNr orig/syslog-ng-1.6.7/doc/Makefile.am new/syslog-ng-1.6.7/doc/Makefile.am --- orig/syslog-ng-1.6.7/doc/Makefile.am 2005-03-04 09:58:08.000000000 -0600 +++ new/syslog-ng-1.6.7/doc/Makefile.am 2005-05-30 18:26:29.986769706 -0500 @@ -4,7 +4,7 @@
EXTRA_DIST = $(man_MANS) stresstest.sh syslog-ng.old.txt \ syslog-ng.conf.demo syslog-ng.conf.sample \ - syslog-ng.conf.solaris - + syslog-ng.conf.solaris README.syslog-ng-anon \ + syslog-ng-anon.conf
diff -uNr orig/syslog-ng-1.6.7/doc/Makefile.in new/syslog-ng-1.6.7/doc/Makefile.in --- orig/syslog-ng-1.6.7/doc/Makefile.in 2005-04-09 05:50:58.000000000 -0500 +++ new/syslog-ng-1.6.7/doc/Makefile.in 2005-05-30 18:29:45.194741054 -0500 @@ -116,7 +116,9 @@
EXTRA_DIST = $(man_MANS) stresstest.sh syslog-ng.old.txt \ syslog-ng.conf.demo syslog-ng.conf.sample \ - syslog-ng.conf.solaris + syslog-ng.conf.solaris README.syslog-ng-anon \ + syslog-ng-anon.conf +
subdir = doc ACLOCAL_M4 = $(top_srcdir)/aclocal.m4 diff -uNr orig/syslog-ng-1.6.7/doc/README.syslog-ng-anon new/syslog-ng-1.6.7/doc/README.syslog-ng-anon --- orig/syslog-ng-1.6.7/doc/README.syslog-ng-anon 1969-12-31 18:00:00.000000000 -0600 +++ new/syslog-ng-1.6.7/doc/README.syslog-ng-anon 2005-05-30 18:25:40.828858265 -0500 @@ -0,0 +1,93 @@ +syslog-ng-anon + + This patch adds the capability to syslog-ng that allows you to strip + out any given regexp or all IP addresses from log messages before + they are written to disk. The goal is to give the system administrator + the means to implement site logging policies, by allowing them easy + control over exactly what data they retain in their logfiles, + regardless of what a particular daemon might think is best. + +Background: + + Data retention has become a hot legal topic for ISPs and other Online + Service Providers (OSPs). There are many instances where it is preferable + to keep less information on users than is collected by default on many + systems. In the United States it is not currently required to retain + data on users of a server, but you may be required to provide all data + on a user which you have retained. OSPs can protect themselves from legal + hassles and added work by choosing what data they wish to retain. + + From "Best Practices for Online Service Providers" + (http://www.eff.org/osp): + + As an intermediary, the OSP [Online Service Provider] finds itself in + a position to collect and store detailed information about its users + and their online activities that may be of great interest to third + parties. The USA PATRIOT Act also provides the government with + expanded powers to request this information. As a result, OSP owners + must deal with requests from law enforcement and lawyers to hand over + private user information and logs. Yet, compliance with these demands + takes away from an OSP's goal of providing users with reliable, + secure network services. In this paper, EFF offers some suggestions, + both legal and technical, for best practices that balance the needs + of OSPs and their users' privacy and civil liberties. + + Rather than scrubbing the information you don't want in logs, this patch + ensures that the information is never written to disk. Also, for those + daemons which log through syslog facilities, this patch provides a + convenient single configuration to limit what you wish to log. + + Here are some related links: + + Best Practices for Online Service Providers + http://www.eff.org/osp + http://www.eff.org/osp/20040819_OSPBestPractices.pdf + + EPIC International Data Retention Page + http://www.epic.org/privacy/intl/data_retention.html + + Working Paper on Usage Log Data Management (from Computer, Freedom, and + Privacy conference) http://cryptome.org/usage-logs.htm + + +Installing syslog-ng-anon + + Applying the patch + + This patch has been tested against the following versions of syslog-ng: + . version 1.6.7 + . Debian package syslog-ng_1.6.7-2 + + + To use this patch, obtain the source for syslog-ng + (http://www.balabit.com/downloads/syslog-ng/1.6/src/) and the latest + syslog-ng-anon patch (http://dev.riseup.net/patches/syslog-ng/). + Uncompress the syslog-ng source and then apply the patch: + + % tar -zxvf syslog-ng.tar.gz + % cd syslog-ng + % patch -p1 < syslog-ng-anon.diff + + Then compile and install syslog-ng as normal. + + Debian package + + Alternately, you can install syslog-ng-anon from this repository: + deb http://deb.riseup.net/debian unstable main + + How to use it + + This patch adds the filter "strip". For example: + + filter f_strip {strip(<regexp>);}; + + This will strip out all matches of the regular expression on logs to + which the filter is applied and replaces all matches with the fixed length + four dashes ("----"). + + In place of a regular expression, you can put "ips", which will replace all + internet addresses with 0.0.0.0. For example: + + filter f_strip {strip(ips);}; + + You can alter what the replacement strings are by using replace: diff -uNr orig/syslog-ng-1.6.7/doc/syslog-ng-anon.conf new/syslog-ng-1.6.7/doc/syslog-ng-anon.conf --- orig/syslog-ng-1.6.7/doc/syslog-ng-anon.conf 1969-12-31 18:00:00.000000000 -0600 +++ new/syslog-ng-1.6.7/doc/syslog-ng-anon.conf 2005-05-30 18:25:40.828858265 -0500 @@ -0,0 +1,243 @@ +# +# Configuration file for syslog-ng under Debian. +# Customized for riseup.net using syslog-ng-anon patch +# (http://dev.riseup.net/patches/syslog-ng/) +# +# see http://www.campin.net/syslog-ng/expanded-syslog-ng.conf +# for examples. +# +# levels: emerg alert crit err warning notice info debug +# + +############################################################ +## global options + +options { + chain_hostnames(0); + time_reopen(10); + time_reap(360); + sync(0); + log_fifo_size(2048); + create_dirs(yes); + group(adm); + perm(0640); + dir_perm(0755); + use_dns(no); +}; + +############################################################ +## universal source + +source s_all { + internal(); + unix-stream("/dev/log"); + file("/proc/kmsg" log_prefix("kernel: ")); +}; + +############################################################ +## generic destinations + +destination df_facility_dot_info { file("/var/log/$FACILITY.info"); }; +destination df_facility_dot_notice { file("/var/log/$FACILITY.notice"); }; +destination df_facility_dot_warn { file("/var/log/$FACILITY.warn"); }; +destination df_facility_dot_err { file("/var/log/$FACILITY.err"); }; +destination df_facility_dot_crit { file("/var/log/$FACILITY.crit"); }; + +############################################################ +## generic filters + +filter f_strip { strip(ips); }; +filter f_at_least_info { level(info..emerg); }; +filter f_at_least_notice { level(notice..emerg); }; +filter f_at_least_warn { level(warn..emerg); }; +filter f_at_least_err { level(err..emerg); }; +filter f_at_least_crit { level(crit..emerg); }; + +############################################################ +## auth.log + +filter f_auth { facility(auth, authpriv); }; +destination df_auth { file("/var/log/auth.log"); }; +log { + source(s_all); + filter(f_auth); + destination(df_auth); +}; + +############################################################ +## daemon.log + +filter f_daemon { facility(daemon); }; +destination df_daemon { file("/var/log/daemon.log"); }; +log { + source(s_all); + filter(f_daemon); + destination(df_daemon); +}; + +############################################################ +## kern.log + +filter f_kern { facility(kern); }; +destination df_kern { file("/var/log/kern.log"); }; +log { + source(s_all); + filter(f_kern); + destination(df_kern); +}; + +############################################################ +## user.log + +filter f_user { facility(user); }; +destination df_user { file("/var/log/user.log"); }; +log { + source(s_all); + filter(f_user); + destination(df_user); +}; + +############################################################ +## sympa.log + +filter f_sympa { program("^(sympa|bounced|archived|task_manager)"); }; +destination d_sympa { file("/var/log/sympa.log"); }; +log { + source(s_all); + filter(f_sympa); + destination(d_sympa); + flags(final); +}; + +############################################################ +## wwsympa.log + +filter f_wwsympa { program("^wwsympa"); }; +destination d_wwsympa { file("/var/log/wwsympa.log"); }; +log { + source(s_all); + filter(f_wwsympa); + filter(f_strip); + destination(d_wwsympa); + flags(final); +}; + +############################################################ +## ldap.log + +filter f_ldap { program("slapd"); }; +destination d_ldap { file("/var/log/ldap.log"); }; +log { + source(s_all); + filter(f_ldap); + destination(d_ldap); + flags(final); +}; + +############################################################ +## postfix.log + +# special source because of chroot jail +#source s_postfix { unix-stream("/var/spool/postfix/dev/log" keep-alive(yes)); }; +filter f_postfix { program("^postfix/"); }; +destination d_postfix { file("/var/log/postfix.log"); }; +log { + source(s_all); + filter(f_postfix); + filter(f_strip); + destination(d_postfix); + flags(final); +}; + +############################################################ +## courier.log + +filter f_courier { program("courier|imap|pop"); }; +destination d_courier { file("/var/log/courier.log"); }; +log { + source(s_all); + filter(f_courier); + filter(f_strip); + destination(d_courier); + flags(final); +}; + +############################################################ +## maildrop.log + +filter f_maildrop { program("^maildrop"); }; +destination d_maildrop { file("/var/log/maildrop.log"); }; +log { + source(s_all); + filter(f_maildrop); + destination(d_courier); + flags(final); +}; + +############################################################ +## mail.log + +filter f_mail { facility(mail); }; +destination df_mail { file("/var/log/mail.log"); }; + +log { + source(s_all); + filter(f_mail); + destination(df_mail); +}; + +############################################################ +## messages.log + +filter f_messages { + level(debug,info,notice) + and not facility(auth,authpriv,daemon,mail,user,kern); +}; +destination df_messages { file("/var/log/messages.log"); }; +log { + source(s_all); + filter(f_messages); + destination(df_messages); +}; + +############################################################ +## errors.log + +filter f_errors { + level(warn,err,crit,alert,emerg) + and not facility(auth,authpriv,daemon,mail,user,kern); +}; +destination df_errors { file("/var/log/errors.log"); }; +log { + source(s_all); + filter(f_errors); + destination(df_errors); +}; + +############################################################ +## emergencies + +filter f_emerg { level(emerg); }; +destination du_all { usertty("*"); }; +log { + source(s_all); + filter(f_emerg); + destination(du_all); +}; + +############################################################ +## console messages + +filter f_xconsole { + facility(daemon,mail) + or level(debug,info,notice,warn) + or (facility(news) + and level(crit,err,notice)); +}; +destination dp_xconsole { pipe("/dev/xconsole"); }; +log { + source(s_all); + filter(f_xconsole); + destination(dp_xconsole); +}; + diff -uNr orig/syslog-ng-1.6.7/src/cfg-grammar.y new/syslog-ng-1.6.7/src/cfg-grammar.y --- orig/syslog-ng-1.6.7/src/cfg-grammar.y 2004-09-17 04:21:06.000000000 -0500 +++ new/syslog-ng-1.6.7/src/cfg-grammar.y 2005-05-30 18:25:40.826858634 -0500 @@ -89,7 +89,7 @@ %token KW_REMOVE_IF_OLDER KW_LOG_PREFIX KW_PAD_SIZE
/* filter items*/ -%token KW_FACILITY KW_LEVEL KW_NETMASK KW_HOST KW_MATCH +%token KW_FACILITY KW_LEVEL KW_NETMASK KW_HOST KW_MATCH KW_STRIP KW_REPLACE
/* yes/no switches */ %token KW_YES KW_NO @@ -669,6 +669,8 @@ | KW_NETMASK '(' string ')' { $$ = make_filter_netmask($3); free($3); } | KW_HOST '(' string ')' { $$ = make_filter_host($3); free($3); } | KW_MATCH '(' string ')' { $$ = make_filter_match($3); free($3); } + | KW_STRIP '(' string ')' { $$ = make_filter_strip($3); free($3); } + | KW_REPLACE '(' string string ')' { $$ = make_filter_replace($3,$4); free($3); free($4); } | KW_FILTER '(' string ')' { $$ = make_filter_call($3); free($3); } ;
diff -uNr orig/syslog-ng-1.6.7/src/cfg-lex.l new/syslog-ng-1.6.7/src/cfg-lex.l --- orig/syslog-ng-1.6.7/src/cfg-lex.l 2005-05-30 18:27:50.829842715 -0500 +++ new/syslog-ng-1.6.7/src/cfg-lex.l 2005-05-30 18:25:40.827858450 -0500 @@ -140,6 +140,8 @@ { "netmask", KW_NETMASK }, { "host", KW_HOST }, { "match", KW_MATCH }, + { "strip", KW_STRIP }, + { "replace", KW_REPLACE },
/* on/off switches */ { "yes", KW_YES }, diff -uNr orig/syslog-ng-1.6.7/src/filters.c new/syslog-ng-1.6.7/src/filters.c --- orig/syslog-ng-1.6.7/src/filters.c 2004-01-13 12:08:02.000000000 -0600 +++ new/syslog-ng-1.6.7/src/filters.c 2005-05-30 18:25:40.827858450 -0500 @@ -163,6 +163,7 @@ (name filter_expr_re) (super filter_expr_node) (vars + (replace string) (regex special-struct regex_t #f free_regexp))) */
@@ -226,6 +227,78 @@ return &self->super; }
+struct filter_expr_node *make_filter_strip(const char *re) +{ + if (strcasecmp(re,"ips") == 0) + return make_filter_replace(re,"0.0.0.0"); + else + return make_filter_replace(re,"----"); +} + +#define FMIN(a,b) (a)<(b) ? (a):(b) + +static int do_filter_replace(struct filter_expr_node *c, + struct log_filter *rule UNUSED, + struct log_info *log) +{ + CAST(filter_expr_re, self, c); + char * buffer = log->msg->data; + int snippet_size; + regmatch_t pmatch; + char new_msg[2048]; + char * new_msg_max = new_msg+2048; + char * new_msg_ptr = new_msg; + int replace_length = strlen(self->replace->data); + + int error = regexec(&self->regex, buffer, 1, &pmatch, 0); + if (error != 0) return 1; + while (error==0) { + /* copy string snippet which preceeds matched text */ + snippet_size = FMIN(pmatch.rm_so, new_msg_max-new_msg_ptr); + memcpy(new_msg_ptr, buffer, snippet_size); + new_msg_ptr += snippet_size; + + /* copy replacement string */ + snippet_size = FMIN(replace_length, new_msg_max-new_msg_ptr); + memcpy(new_msg_ptr, self->replace->data, snippet_size); + new_msg_ptr += snippet_size; + + /* search for next match */ + buffer += pmatch.rm_eo; + error = regexec (&self->regex, buffer, 1, &pmatch, REG_NOTBOL); + } + /* copy the rest of the old msg */ + snippet_size = FMIN(strlen(buffer),new_msg_max-new_msg_ptr); + memcpy(new_msg_ptr, buffer, snippet_size); + new_msg_ptr += snippet_size; + + ol_string_free(log->msg); + log->msg = c_format_cstring("%s", new_msg_ptr-new_msg,new_msg); + return 1; +} + +struct filter_expr_node *make_filter_replace(const char *re, const char *replacement) +{ + int regerr; + NEW(filter_expr_re, self); + self->super.eval = do_filter_replace; + self->replace = format_cstring(replacement); + + if (strcasecmp(re,"ips") == 0) { + re = "(25[0-5]|2[0-4][0-9]|[0-1]?[0-9]?[0-9])([\\.\\-](25[0-5]|2[0-4][0-9]|[0-1]?[0-9]?[0-9])){3}"; + } + regerr = regcomp(&self->regex, re, REG_ICASE | REG_EXTENDED); + if (regerr) { + char errorbuf[256]; + regerror(regerr, &self->regex, errorbuf, sizeof(errorbuf)); + werror("Error compiling regular expression: \"%z\" (%z)\n", re, errorbuf); + KILL(self); + return NULL; + } + + return &self->super; +} + static int do_filter_prog(struct filter_expr_node *c, struct log_filter *rule UNUSED, struct log_info *log) diff -uNr orig/syslog-ng-1.6.7/src/filters.h new/syslog-ng-1.6.7/src/filters.h --- orig/syslog-ng-1.6.7/src/filters.h 2002-02-04 10:07:50.000000000 -0600 +++ new/syslog-ng-1.6.7/src/filters.h 2005-05-30 18:25:40.827858450 -0500 @@ -66,6 +66,8 @@ struct filter_expr_node *make_filter_netmask(const char *nm); struct filter_expr_node *make_filter_host(const char *re); struct filter_expr_node *make_filter_match(const char *re); +struct filter_expr_node *make_filter_strip(const char *re); +struct filter_expr_node *make_filter_replace(const char *re, const char *replacement); struct filter_expr_node *make_filter_call(const char *name);
#endif
-- ------------------------------------------------------------- addr://Rathausgasse 31, CH-5001 Aarau tel://++41 62 823 9355 http://www.terreactive.com fax://++41 62 823 9356 ------------------------------------------------------------- terreActive AG Wir sichern Ihren Erfolg -------------------------------------------------------------
ADDENDUM (hit the wrong button, sorry)
I don't see the necessity to provide a keyword strip as a subset of replace. Please drop it, while referring to the equivalent lines below, written by you.
replace(ips,"0.0.0.0 <http://0.0.0.0>") <--- this is the same as strip(ips) replace(<regex>,"----") <--- this is the same as strip(regex)
That's the place.
+ This patch adds the capability to syslog-ng that allows you to strip + out any given regexp or all IP addresses from log messages before + they are written to disk. The goal is to give the system administrator + the means to implement site logging policies, by allowing them easy + control over exactly what data they retain in their logfiles, + regardless of what a particular daemon might think is best.
This can also be done with a match and a /dev/null destination. Please be specific in what your patch achieves.
+ Data retention has become a hot legal topic for ISPs and other Online + Service Providers (OSPs). There are many instances where it is preferable + to keep less information on users than is collected by default on many + systems.
Over here it's more an issue of showing less information on users than is collected. When you work for the state, for banks or insurances, you'll notice that there the wind is blowing into the other direction. All, without loss, data is to be stored; and this under penalty even. At least here in Switzerland. If you lose a message while a potential "break-in" has occured or can be correlated it might cost you your head :).
diff -uNr orig/syslog-ng-1.6.7/doc/syslog-ng-anon.conf new/syslog-ng-1.6.7/doc/syslog-ng-anon.conf --- orig/syslog-ng-1.6.7/doc/syslog-ng-anon.conf 1969-12-31 18:00:00.000000000 -0600 +++ new/syslog-ng-1.6.7/doc/syslog-ng-anon.conf 2005-05-30 18:25:40.828858265 -0500 @@ -0,0 +1,243 @@
I don't think this sample file is needed.
+## sympa.log + +filter f_sympa { program("^(sympa|bounced|archived|task_manager)"); }; +destination d_sympa { file("/var/log/sympa.log"); }; +log { + source(s_all); + filter(f_sympa); + destination(d_sympa); + flags(final); +}; + +############################################################ +## wwsympa.log + +filter f_wwsympa { program("^wwsympa"); }; +destination d_wwsympa { file("/var/log/wwsympa.log"); }; +log { + source(s_all); + filter(f_wwsympa); + filter(f_strip); + destination(d_wwsympa); + flags(final); +};
Too specific to be in a package config file as a skeleton but this is only my view. Cynically I could argue that by skimming through your sample syslog-ng.conf file you don't seem to have any of the daemons chroot()'d, yet you
+ | KW_STRIP '(' string ')' { $$ = make_filter_strip($3); free($3); }
remove
+ | KW_REPLACE '(' string string ')' { $$ = make_filter_replace($3,$4); free($3); free($4); } | KW_FILTER '(' string ')' { $$ = make_filter_call($3); free($3); } ;
diff -uNr orig/syslog-ng-1.6.7/src/cfg-lex.l new/syslog-ng-1.6.7/src/cfg-lex.l --- orig/syslog-ng-1.6.7/src/cfg-lex.l 2005-05-30 18:27:50.829842715 -0500 +++ new/syslog-ng-1.6.7/src/cfg-lex.l 2005-05-30 18:25:40.827858450 -0500 @@ -140,6 +140,8 @@ { "netmask", KW_NETMASK }, { "host", KW_HOST }, { "match", KW_MATCH }, + { "strip", KW_STRIP },
remove
+struct filter_expr_node *make_filter_strip(const char *re) +{ + if (strcasecmp(re,"ips") == 0) + return make_filter_replace(re,"0.0.0.0"); + else + return make_filter_replace(re,"----"); +} +
remove
+#define FMIN(a,b) (a)<(b) ? (a):(b) + +static int do_filter_replace(struct filter_expr_node *c, + struct log_filter *rule UNUSED, + struct log_info *log) +{ + CAST(filter_expr_re, self, c); + char * buffer = log->msg->data; + int snippet_size; + regmatch_t pmatch; + char new_msg[2048]; + char * new_msg_max = new_msg+2048; + char * new_msg_ptr = new_msg; + int replace_length = strlen(self->replace->data); + + int error = regexec(&self->regex, buffer, 1, &pmatch, 0); + if (error != 0) return 1; + while (error==0) { + /* copy string snippet which preceeds matched text */ + snippet_size = FMIN(pmatch.rm_so, new_msg_max-new_msg_ptr); + memcpy(new_msg_ptr, buffer, snippet_size); + new_msg_ptr += snippet_size; + + /* copy replacement string */ + snippet_size = FMIN(replace_length, new_msg_max-new_msg_ptr); + memcpy(new_msg_ptr, self->replace->data, snippet_size); + new_msg_ptr += snippet_size; + + /* search for next match */ + buffer += pmatch.rm_eo; + error = regexec (&self->regex, buffer, 1, &pmatch, REG_NOTBOL); + } + /* copy the rest of the old msg */ + snippet_size = FMIN(strlen(buffer),new_msg_max-new_msg_ptr); + memcpy(new_msg_ptr, buffer, snippet_size); + new_msg_ptr += snippet_size; + + ol_string_free(log->msg); + log->msg = c_format_cstring("%s", new_msg_ptr-new_msg,new_msg); + return 1; +} + +struct filter_expr_node *make_filter_replace(const char *re, const char *replacement) +{ + int regerr; + NEW(filter_expr_re, self); + self->super.eval = do_filter_replace; + self->replace = format_cstring(replacement); + + if (strcasecmp(re,"ips") == 0) { + re = "(25[0-5]|2[0-4][0-9]|[0-1]?[0-9]?[0-9])([\\.\\-](25[0-5]|2[0-4][0-9]|[0-1]?[0-9]?[0-9])){3}"; + }
remove, also because not all IPs are logged in dotted decimals for example.
+ regerr = regcomp(&self->regex, re, REG_ICASE | REG_EXTENDED); + if (regerr) { + char errorbuf[256]; + regerror(regerr, &self->regex, errorbuf, sizeof(errorbuf)); + werror("Error compiling regular expression: \"%z\" (%z)\n", re, errorbuf); + KILL(self); + return NULL; + } + + return &self->super; +} + static int do_filter_prog(struct filter_expr_node *c, struct log_filter *rule UNUSED, struct log_info *log) diff -uNr orig/syslog-ng-1.6.7/src/filters.h new/syslog-ng-1.6.7/src/filters.h --- orig/syslog-ng-1.6.7/src/filters.h 2002-02-04 10:07:50.000000000 -0600 +++ new/syslog-ng-1.6.7/src/filters.h 2005-05-30 18:25:40.827858450 -0500 @@ -66,6 +66,8 @@ struct filter_expr_node *make_filter_netmask(const char *nm); struct filter_expr_node *make_filter_host(const char *re); struct filter_expr_node *make_filter_match(const char *re); +struct filter_expr_node *make_filter_strip(const char *re);
remove
+struct filter_expr_node *make_filter_replace(const char *re, const char *replacement); struct filter_expr_node *make_filter_call(const char *name);
Best regards, Roberto Nibali, ratz -- ------------------------------------------------------------- addr://Rathausgasse 31, CH-5001 Aarau tel://++41 62 823 9355 http://www.terreactive.com fax://++41 62 823 9356 ------------------------------------------------------------- terreActive AG Wir sichern Ihren Erfolg -------------------------------------------------------------
Roberto Nibali wrote:
The attached patch comes from http://dev.riseup.net/patches/syslog-ng
Gives you a 404 at first until you click on login.
Sorry, this was temporarily misdirected.
what it does is provide a simple filter to strip out unwanted regular expressions from logs...
.... Bad idea not least because the logic of hiding data should be in the frontend and/or the extraction process (ETL) and not in the data storage. On a central syslog server you'd like to have data mining theories applied for example, where you need the whole set of raw data, unfiltered. Well, only partially unfiltered, since one will certainly apply filters in their log statements.
I very much agree, it would be ideal to handle this problem elsewhere--but it would be a lot more work. The problem with the front end approach is that it would be very difficult to write patches for all the many daemons one might run. The problem with the post-processing and log scrubbing approach is that the data will likely sit around for many hours or days. You are right: this patch hurts log processing. You lose data. It is a trade-off between privacy and analysis. However, an administrator should be able to make this choice if they feel that it is more important to not retain sensitive data than it is to have a full history of everything logged.
Method 1: have log statements which omit certain log lines, and don't set a catchall log statement
Method 2: build a filter for lines you'd like to match and forget. Add a destination statement with /dev/null as file destination.
Method 3: strip the lines.
Method 1 and 2 drop information, but basically maintain their value of truth. Method 3 changes the information gain and thus, strongly speaking, dilutes the truth. Dealing with the legal aspects of information gain/loss with regard to dilution is a delicate matter.
[snip]... When you work for the state, for banks or insurances, you'll notice that there the wind is blowing into the other direction. All, without loss, data is to be stored; and this under penalty even. At least here in Switzerland. If you lose a message while a potential "break-in" has occured or can be correlated it might cost you your head :).
A delicate matter indeed! It is my understanding that there are legal problems with such modification of logs in France, the UK, and maybe Switzerland(?). I defer to the lawyers. The EFF seems to think that this 'dilution' is (a) legal in the U.S. and (b) advisable. (http://eff.org is the major civil liberties internet watchdog in the US). Method 1 and 2 are great, but most of the time there is still very useful information in logs even after extensive stripping. For example, suppose a log file of login attempts: username, ip, and if the attempt was successful. Even if you removed username and ip, it is very useful to know if there is a spike in failed login attempts, for example.
I don't see the necessity to provide a keyword strip as a subset of replace. Please drop it.
ok. It was included for historical reasons (a previous patch only did 'strip').
I don't think this sample file is needed.
I agree, it is incomplete and should not be included.
+ if (strcasecmp(re,"ips") == 0) { + re = "(25[0-5]|2[0-4][0-9]|[0-1]?[0-9]?[0-9])([\\.\\-](25[0-5]|2[0-4][0-9]|[0-1]?[0-9]?[0-9])){3}"; + }
remove, also because not all IPs are logged in dotted decimals for example.
Do you mean that it should also support IPv6? I am happy to include this in an update to the patch. It can get complex. Here is an example IPv6 regexp: http://blogs.msdn.com/mpoulson/archive/2005/01/10/350037.aspx
Const strIPv6Pattern as string = "\A(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\z" Const strIPv6Pattern_HEXCompressed as string = "\A((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)::((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)\z" Const StrIPv6Pattern_6Hex4Dec as string = "\A((?:[0-9A-Fa-f]{1,4}:){6,6})(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\z" Const StrIPv6Pattern_Hex4DecCompressed as string = "\A((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?) ::((?:[0-9A-Fa-f]{1,4}:)*)(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\z"
The tricky part is that you can mix decimal IPv4 with hex IPv6, and leave out multiple blocks of 0's, but not more than once. Anyone have a more elegant expression? -elijah
Hello
.... Bad idea not least because the logic of hiding data should be in the frontend and/or the extraction process (ETL) and not in the data storage. On a central syslog server you'd like to have data mining theories applied for example, where you need the whole set of raw data, unfiltered. Well, only partially unfiltered, since one will certainly apply filters in their log statements.
I very much agree, it would be ideal to handle this problem elsewhere--but it would be a lot more work.
I don't know, really. From your webpage I learn that you've also similar patches for other system "close" tools. So my first thought was: "is he really going to patch each and every tool out there that stores malign data"?
The problem with the front end approach is that it would be very difficult to write patches for all the many daemons one might run.
See, this is called problem shifting. It is not the responsibility of the different tool's authors but the one of the cooporate glueing them together into a product they sell. Example: If you are an ISP and let's say want to provide your customers with a simple monitoring framework where they can observe their servers, browse certain post-processed log files and generate alerts or pager alarms based on configurable triggers. This is a fairly common service of an ISP nowadays. From the ISP point of view, you've got all the date to provide and help eventual forensics. As the provider of the monitoring software you are responsible to strip out the information that has legal impact when presented to your customers. As such the application running as front-end must have the appropriate means to instrument the information. This solves two issues from a business point of view: o You have a certain base USP in that you can sell a product which does something more than just display data in a 1:1 mapping o You, as the business, are responsible to comply to certain acts, laws and regulations given by the authoritative force in your geographical location. This means, the ISP in our case, is responsible for the data integrity and the information handling and disclosure. This takes away the responsability from the tool's developers who most of the time are not under direct control of the company. There's more points which have to be considered, but it's far too off-topic for this mailinglist. You can contact me privatly regarding those points.
The problem with the post-processing and log scrubbing approach is that the data will likely sit around for many hours or days.
It's part of the security concept of OSPs/ISPs to maintain an accurate enough security policy regarding data handling and disclosure. It's not the task of each individual tool to define and adapt corporate governement in the field of IT security.
You are right: this patch hurts log processing. You lose data. It is a
Losing data is one thing, yes, but intended obfuscation is a legal matter ;). I know that my statement is maybe a bit too an strong argument to have practical consequences.
trade-off between privacy and analysis. However, an administrator should be able to make this choice if they feel that it is more important to not retain sensitive data than it is to have a full history of everything logged.
The driving force behind those "papers of suggestion or common practice" regarding data retention were not administrators but company running a business in these fields. As such the administrator is only a part of the decision chain in a firm and will certainly have to comply to corporate security guidelines, where data protection and disclosure must be handled.
[snip]... When you work for the state, for banks or insurances, you'll notice that there the wind is blowing into the other direction. All, without loss, data is to be stored; and this under penalty even. At least here in Switzerland. If you lose a message while a potential "break-in" has occured or can be correlated it might cost you your head :).
A delicate matter indeed! It is my understanding that there are legal problems with such modification of logs in France, the UK, and maybe Switzerland(?).
I would assume so, but I'd need to ask a lawyer.
I defer to the lawyers. The EFF seems to think that this 'dilution' is (a) legal in the U.S. and (b) advisable.
From the information point of view this makes sense, from an business model point of view this is a drawback.
(http://eff.org is the major civil liberties internet watchdog in the US).
... with far to little money to have important influences on the IT market in the US I believe ...
Method 1 and 2 are great, but most of the time there is still very useful information in logs even after extensive stripping. For example, suppose a log file of login attempts: username, ip, and if the attempt was successful. Even if you removed username and ip, it is very useful to know if there is a spike in failed login attempts, for example.
Absolutely, but what are you going to write in your executive summary? Last month we observed a unusual spike regarding failed login attempts to our foobar server (used for financial transaction) on week 19, between Friday and Saturday night. Due to data retention reasons (EFF) we do not have any IPs logged. We are thus not certain if this constitutes an act of crime (a hacker attempt) or if our application's unit test conducts which also need to connect to this live database container have gone wild.
ok. It was included for historical reasons (a previous patch only did 'strip').
Excellent. Redo you patch and I'd say this has a good chance of inclusion because it does have a valid use case, at least in the US and for people that see data retention from the adminstrators point of view.
I agree, it is incomplete and should not be included.
You have an excellent documentation online anyway. Debian folks will probably take your sample file :).
remove, also because not all IPs are logged in dotted decimals for example.
Do you mean that it should also support IPv6? I am happy to include this in an update to the patch.
Excellent.
It can get complex. Here is an example IPv6 regexp: http://blogs.msdn.com/mpoulson/archive/2005/01/10/350037.aspx
Const strIPv6Pattern as string = "\A(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\z" Const strIPv6Pattern_HEXCompressed as string = "\A((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)::((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)\z" Const StrIPv6Pattern_6Hex4Dec as string = "\A((?:[0-9A-Fa-f]{1,4}:){6,6})(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\z" Const StrIPv6Pattern_Hex4DecCompressed as string = "\A((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?) ::((?:[0-9A-Fa-f]{1,4}:)*)(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\z"
To be honest I cannot verify the correctness of those regexp, partly due to the unwillingness to spend the necessary time and partly due to the fact that I'm not that proficient with regexp.
The tricky part is that you can mix decimal IPv4 with hex IPv6, and leave out multiple blocks of 0's, but not more than once. Anyone have a more elegant expression?
Thank you for your valuable comments. Best regards, Roberto Nibali, ratz -- echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc
participants (6)
-
Elijah
-
elijah@riseup.net
-
micah milano
-
Roberto Nibali
-
Roberto Nibali
-
Valdis.Kletnieks@vt.edu