Roberto Nibali wrote:
The attached patch comes from http://dev.riseup.net/patches/syslog-ng
Gives you a 404 at first until you click on login.
Sorry, this was temporarily misdirected.
what it does is provide a simple filter to strip out unwanted regular expressions from logs...
.... Bad idea not least because the logic of hiding data should be in the frontend and/or the extraction process (ETL) and not in the data storage. On a central syslog server you'd like to have data mining theories applied for example, where you need the whole set of raw data, unfiltered. Well, only partially unfiltered, since one will certainly apply filters in their log statements.
I very much agree, it would be ideal to handle this problem elsewhere--but it would be a lot more work. The problem with the front end approach is that it would be very difficult to write patches for all the many daemons one might run. The problem with the post-processing and log scrubbing approach is that the data will likely sit around for many hours or days. You are right: this patch hurts log processing. You lose data. It is a trade-off between privacy and analysis. However, an administrator should be able to make this choice if they feel that it is more important to not retain sensitive data than it is to have a full history of everything logged.
Method 1: have log statements which omit certain log lines, and don't set a catchall log statement
Method 2: build a filter for lines you'd like to match and forget. Add a destination statement with /dev/null as file destination.
Method 3: strip the lines.
Method 1 and 2 drop information, but basically maintain their value of truth. Method 3 changes the information gain and thus, strongly speaking, dilutes the truth. Dealing with the legal aspects of information gain/loss with regard to dilution is a delicate matter.
[snip]... When you work for the state, for banks or insurances, you'll notice that there the wind is blowing into the other direction. All, without loss, data is to be stored; and this under penalty even. At least here in Switzerland. If you lose a message while a potential "break-in" has occured or can be correlated it might cost you your head :).
A delicate matter indeed! It is my understanding that there are legal problems with such modification of logs in France, the UK, and maybe Switzerland(?). I defer to the lawyers. The EFF seems to think that this 'dilution' is (a) legal in the U.S. and (b) advisable. (http://eff.org is the major civil liberties internet watchdog in the US). Method 1 and 2 are great, but most of the time there is still very useful information in logs even after extensive stripping. For example, suppose a log file of login attempts: username, ip, and if the attempt was successful. Even if you removed username and ip, it is very useful to know if there is a spike in failed login attempts, for example.
I don't see the necessity to provide a keyword strip as a subset of replace. Please drop it.
ok. It was included for historical reasons (a previous patch only did 'strip').
I don't think this sample file is needed.
I agree, it is incomplete and should not be included.
+ if (strcasecmp(re,"ips") == 0) { + re = "(25[0-5]|2[0-4][0-9]|[0-1]?[0-9]?[0-9])([\\.\\-](25[0-5]|2[0-4][0-9]|[0-1]?[0-9]?[0-9])){3}"; + }
remove, also because not all IPs are logged in dotted decimals for example.
Do you mean that it should also support IPv6? I am happy to include this in an update to the patch. It can get complex. Here is an example IPv6 regexp: http://blogs.msdn.com/mpoulson/archive/2005/01/10/350037.aspx
Const strIPv6Pattern as string = "\A(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\z" Const strIPv6Pattern_HEXCompressed as string = "\A((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)::((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?)\z" Const StrIPv6Pattern_6Hex4Dec as string = "\A((?:[0-9A-Fa-f]{1,4}:){6,6})(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\z" Const StrIPv6Pattern_Hex4DecCompressed as string = "\A((?:[0-9A-Fa-f]{1,4}(?::[0-9A-Fa-f]{1,4})*)?) ::((?:[0-9A-Fa-f]{1,4}:)*)(25[0-5]|2[0-4]\d|[0-1]?\d?\d)(\.(25[0-5]|2[0-4]\d|[0-1]?\d?\d)){3}\z"
The tricky part is that you can mix decimal IPv4 with hex IPv6, and leave out multiple blocks of 0's, but not more than once. Anyone have a more elegant expression? -elijah