问题
I want to monitor a specific folder. Every new file in this folder should be scanned for URLs. These URLs should be edited, if the domain is not in a defined whitelist.
Example:
blabla http://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html
Whitelist:
http://www.white.com
Result:
blabla httx://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html
What i have tried so far is iwatch with this xml:
<?xml version="1.0" ?>
<!DOCTYPE config SYSTEM "/etc/iwatch/iwatch.dtd" >
<config>
<guard email="root@localhost" name="IWatch"/>
<watchlist>
<title>URL_Filter</title>
<contactpoint email="admin@test.com" name="Administrator"/>
<path type="single" syslog="on" alert="off" events="create" exec="sed -i 's/http/httx' %f">/var/test</path>
</watchlist>
</config>
So with iwatch i can observe the folder "/var/test" for new files. With the sed command i can replace every "http" with "httx". But i have no idea how i could put in a whitelist so that some URLs are not replaced...
--- edit --- Additional information: I want to edit all incoming postfix mails, so that there are no clickable links in it, except some domains, which are on the whitelist. The reason for that is to protect against phishing mails.
Return-Path: <example@gmail.com>
X-Original-To: example@test.de
Delivered-To: example@test.de
Received: from mail-lf0-x236.google.com (mail-lf0-x236.google.com [IPv6:2a00:1450:4010:c07::236])
by xxxxxxx.hosteurope.de (Postfix) with ESMTPS id D255223CB59
for <example@test.de>; Mon, 11 Apr 2016 14:44:10 +0200 (CEST)
Received: by mail-lf0-x236.google.com with SMTP id c126so154788483lfb.2
for <example@test.de>; Mon, 11 Apr 2016 05:39:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=20120113;
h=mime-version:date:message-id:subject:from:to;
bh=WwH+NIkCWDEoIkwbeCI4pf0jP0ya/ctbQ81pUsA4G7s=;
b=ZS3Uo/cpVGNw3k38Js2+/DxVda0y2136oy4D4hsR0G25x2UjhyVU/yUcPl6qEdxt8i
CQXZHQbaf8pzCdDaSq4VL9RC/sIgZy3PQzj6Cyrp3WTi6SMmQ65NwNBWLVGnpPcuzNW1
IGC5N3rjj96ndYUAxia/tTcBX7ajS3Tw9Mc8yIaO13hSXMUCrTDIFZNzHR1ib7tLDpmX
6EVyFhquhIfJVOhcuPgWUUxHly/FmZ++ucoHR0Yozj+dc1GJ6/ZYzUAPdGICelDY7ieG
nvA7KH6+v6/zoWlbfkO9BmGzAPs6M4LGHilOjpMf/09Z2oMiV/WRDxe0WrCebQptpm2c
xHPg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20130820;
h=x-gm-message-state:mime-version:date:message-id:subject:from:to;
bh=WwH+NIkCWDEoIkwbeCI4pf0jP0ya/ctbQ81pUsA4G7s=;
b=hAOSzKjertcsQIT/PHoZKsiKxLba8gaKOCmyNg7nmiPJjCWqobNvM5nf3sZP1Xhysi
gGdvk9mmMugII8dsjc7mRhDkbCT1QKVz/0UBQ+CaP6sK7kGdWfdarphGgzUGA6Il5JZi
lP4DpEQHUpG1wJ1r+dN2f+UT8tyfIwapXwo3g7FnkPLxmCq9CeqJeRlagL6vAacon8z7
CjdTHB7fzEtYToSp+cDi3+yK4zS9p4rwF4H4Ds3bJqwM/PrcFJW0YYncDHdra5TwYf6U
K6VRX19iUhQT4kTVFCtoNW9SU8Ri+Rc5VfvVTKRh4KwZ2uW5x8y07ucB0vZcAQdEnms4
AWnQ==
X-Gm-Message-State: AD7BkJJEDmk9P+Kzcn1MT4lQxpU1aYU6x8uABSpohCbT7EeOFAXjT1y6n3sFcRj7tcfWc6eBAOL6bJ78jvVOlQ==
MIME-Version: 1.0
X-Received: by 10.112.63.196 with SMTP id i4mr8426739lbs.93.1460378359811;
Mon, 11 Apr 2016 05:39:19 -0700 (PDT)
Received: by 10.114.66.51 with HTTP; Mon, 11 Apr 2016 05:39:19 -0700 (PDT)
Date: Mon, 11 Apr 2016 14:39:19 +0200
Message-ID: <CADF5gVU+C4BZCSFSiWeiBipBnDu5jTU+FVmLJbSQSbtMM9JZcQ@mail.gmail.com>
Subject: test
From: Example <example@gmail.com>
To: example@test.de
Content-Type: multipart/alternative; boundary=001a1133d4405fd878053034d55a
X-Scanned-By: MIMEDefang 2.71 on 5.38.258.144
--001a1133d4405fd878053034d55a
Content-Type: text/plain; charset=UTF-8
http://www.example.com
http://www.white.com
--001a1133d4405fd878053034d55a
Content-Type: text/html; charset=UTF-8
<div dir="ltr"><div><a href="http://www.example.com">http://www.example.com</a><br></div><a href="http://www.white.com">http://www.white.com</a><br></div>
--001a1133d4405fd878053034d55a--
回答1:
Just realized the bash
script is un-necessary, we can do it using the following one-liner but it's really cryptic to read:
Input data:
$ cat data
sdfsdfsdfsdf http://www.whitedomain.com/red.html
bla http://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html
$ cat whitelist
http://www.white.com
http://www.whitedomain.com
$
Final Output:
$ sed -r '/'"$(sed -r 's/\\/\\\\/g;s/\//\\\//g;s/\^/\\^/g;s/\[/\\[/g;s/'\''/'\'"\\\\"\'\''/g;s/\]/\\]/g;s/\*/\\*/g;s/\$/\\$/g;s/\./\\./g' whitelist | paste -s -d '|')"'/! s/http/httx/g' data
sdfsdfsdfsdf http://www.whitedomain.com/red.html
bla httx://www.black.com/green/yellow.html blabla
sdfsdfsdfsdf http://www.white.com/red.html
$
Explanation:
Output of inner subshell command is a regex(to filter out lines during sed
substitution command)
$ sed -r 's/\\/\\\\/g;s/\//\\\//g;s/\^/\\^/g;s/\[/\\[/g;s/'\''/'\'"\\\\"\'\''/g;s/\]/\\]/g;s/\*/\\*/g;s/\$/\\$/g;s/\./\\./g' whitelist | paste -s -d '|'
http:\/\/www\.white\.com|http:\/\/www\.whitedomain\.com
Flow:
- form the regex dynamically using inner subshell command escaping all meta characters in
sed
and then piping it topaste
to add alternations - Using the above output in the
sed
command to filter out lines not having any of the whitelist domains and using those lines for substitution ofhttp
intohttx
Edit1: Since sed
is line oriented you will have to transform the data into lines of text like this:
$ cat data1
<div dir="ltr"><div><a href="http://www.white.com">http://www.white.com</a><br></div><a href="http://www.example.com">http://www.example.com</a><br></div>
$ cat whitelist
http://www.white.com
http://www.whitedomain.com
$ sed 's/</\n</g' data1 | sed -r '/'"$(sed -r 's/\\/\\\\/g;s/\//\\\//g;s/\^/\\^/g;s/\[/\\[/g;s/'\''/'\'"\\\\"\'\''/g;s/\]/\\]/g;s/\*/\\*/g;s/\$/\\$/g;s/\./\\./g' whitelist | paste -s -d '|')"'/! s/http/httx/g'
<div dir="ltr">
<div>
<a href="http://www.white.com">http://www.white.com
</a>
<br>
</div>
<a href="httx://www.example.com">httx://www.example.com
</a>
<br>
</div>
$
回答2:
You can use Perl to do that. I recommend installing the Regexp::Common package from CPAN and using Regexp::Common::URI to find the URIs, then maintain a whitelist of host names and check those. It's a bit long for a one-liner though.
use strict;
use warnings;
use Regexp::Common qw /URI/;
my %whitelist = (
'http://www.white.com' => 1,
'http://www.example.org' => 1,
);
while (my $line = <>) {
MATCH: foreach my $match ($line =~ /($RE{URI}{HTTP})/g ){
# check the whitelist
next MATCH if grep { $match =~ /^$_/i } %whitelist;
# no whitelist entry, replace
my $match_updated = $match;
$match_updated =~ s/^http/httx/;
$line =~ s/$match/$match_updated/;
}
print $line;
}
Save that as something meaningful, maybe remove_phishing_links.pl in a directory that the iwatch thingy can access. I'm doing ~
, but I have no clue if that would work. Now you would call that in your iwatch file with something like this.
<path
type="single"
syslog="on"
alert="off"
events="create"
exec="perl -i ~/remove_phishing_links.pl %f">/var/test</path>
It will, just like the sed
command, edit the file in %f
in place. It reads line by line, finds http URIs, checks if they start with any of the whitelist entries, and if not, replaces the http
with httx
.
Note that this will not work for base64 encoded MIME emails, or if there are line breaks within the URIs.
If you don't wnat to install Regexp::Common, you can also borrow the regular expression for URIs from the URI module documentation on CPAN and alter it to only find https?
.
来源:https://stackoverflow.com/questions/36543377/find-and-replace-urls-in-postfix-files-linux-ubuntu