问题
I have written a class, ProxyFinder which connects to random ips and first pings them, and if they respond, attempts to create a http proxy connection through common proxy ports.
Currently, it is set up just connecting to random ips. This is relatively fast, discovering a few proxys an hour. However, I would like to somehow check if I have already previously connected to an ip. First I tried keeping them in a list, but that was using over 10GB of ram.. I included a method that I tried in the code below which writes the data to a cache using a RandomAccessFile, but this is incredibly slow to search through the entire file for each connection as it gets larger.
I am storing the data in as small of format as possible, simply four bytes for each ip. Even though, this is 4 * 256 * 256 *256 * 256 bytes.. = 16gb of raw ram.. or a 16gb file to search each time you want to test another ip.
I also tried creating a separate thread to generate ips, check them against the file, and then add them to a queue that the probe threads could pull from. It could not keep up with the probe threads either.
How can I quickly check if I have already connected to an IP or not, without being incredibly slow or using ridiculous amounts of memory?
package net;
import java.io.File;
import java.io.RandomAccessFile;
import java.net.HttpURLConnection;
import java.net.InetAddress;
import java.net.InetSocketAddress;
import java.net.Proxy;
import java.net.URL;
import java.util.Arrays;
import java.util.concurrent.atomic.AtomicInteger;
/**
*
* @author Colby
*/
public class ProxyFinder {
/**
* @param args the command line arguments
*/
public static void main(String[] args) throws Exception {
int[] ports = {
1080, 3128, 3128, 8080
};
System.out.println("Starting network probe");
AtomicInteger counter = new AtomicInteger();
for (int i = 0; i < 500; i++) {
new Thread(() -> {
do {
try {
byte[] addrBytes = randomAddress();//could be getNextAddress also
if (addrBytes == null) {
break;
}
InetAddress addr = InetAddress.getByAddress(addrBytes);
if (ping(addr)) {
float percent = (float) ((counter.get() / (256f * 256f * 256f * 256f)) * 100F);
if (counter.incrementAndGet() % 10000 == 0) {
System.out.println("Searching " + percent + "% network search");
}
for (int port : ports) {
try {
Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(addr, port));
HttpURLConnection con = (HttpURLConnection) new URL("http://google.com").openConnection(proxy);
con.setConnectTimeout(1000);
con.setReadTimeout(1000);
con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", "Mozilla/5.0");
con.getContent();
con.disconnect();
System.out.println("Proxy found!" + addr.getHostAddress() + ":" + port + " Found at " + percent + "% network search");
} catch (Exception e) {
}
}
//
//System.out.println("Ping response: --" + addr.getHostAddress() + "-- Attempt: " + counter.get() + " Percent: " + percent + "%");
} else {
//System.out.println("Ping response failed: " + addr.getHostAddress() + " attempt " + counter.incrementAndGet());
}
} catch (Exception e) {
//e.printStackTrace();
}
} while (true);
}).start();
}
}
private static RandomAccessFile cache;
private static byte[] getNextAddress() throws Exception {
if (cache == null) {
cache = new RandomAccessFile(File.createTempFile("abc", ".tmp"), "rw");
}
byte[] check;
checkFile:
{
byte[] addr = new byte[4];
do {
check = randomAddress();
inner:
{
cache.seek(0);
while (cache.length() - cache.getFilePointer() > 0) {
cache.readFully(addr);
if (Arrays.equals(check, addr)) {
break inner;
}
}
cache.write(check);
break checkFile;
}
} while (true);
}
return check;
}
private static byte[] randomAddress() {
return new byte[]{(byte) (Math.random() * 256), (byte) (Math.random() * 256), (byte) (Math.random() * 256), (byte) (Math.random() * 256)};
}
private static boolean ping(InetAddress addr) throws Exception {
return addr.isReachable(500);
}
}
Also in case anyone is wondering, I've had this running for 12 hours now and it's discovered about 50 proxys, and pinged about 2.09664E-4% of the ip range which is about 1.2 million ips. not bad for the bandwidth allocated (0.5Mbps)
EDIT: I am starting to think that maybe the overhead of storing and checking all of these IPs would be even greater than simply connecting to many duplicates near the end of searching the ip range..
回答1:
I would not store the whole IP address because of the amount of data. To store them in a array of BitSet
would consume less memory.
edit previous code version removed, it was not correct
The version below generates random addresses and persist them in a file. If the persistence file of a previous run is found, the information of the seen addresses is restored from that file.
Following case was not handled correctly in the initial version:
assuming that no address was already seen
1.0.0.1 - seen false
2.0.0.2 - seen false
2.0.0.1 - seen true, which was wrong and is correctly handled by code below
See the comments in the code for further information.
public class KeepSeenAddresses {
static final int FILE_BUFFER_SIZE = 81_920;
static final int RANGES_SIZE = 256;
// to store 256 ranges of 255*255*255+1 addresses
static BitSet[] ranges;
// Random(1) is taken only for demonstration purpose, so the second
// application run will find the same seen addresses from previous run
static Random random = new Random(1);
// for normal use it's better to have better randomness
//static Random random = new Random(System.currentTimeMillis());
public static void main(String[] args)
throws IOException, ClassNotFoundException {
if (!readRanges()) {
initRanges();
}
// this case was failing in the initial solution
// uncomment this block to see how all edge cases
// which where mentioned in other comments are handled
/*
byte[][] addresses = {
{1, 0, 0, 1},
{2, 0, 0, 2},
{2, 0, 0, 1},
{1, 2, 3, 4},
{4, 3, 2, 1},
{(byte)128, 0, 0, 0},
{(byte)255, (byte)255, (byte)255, (byte)255}
};
seenAddress(addresses[0]);
seenAddress(addresses[1]);
seenAddress(addresses[3]);
seenAddress(addresses[5]);
seenAddress(addresses[6]);
for (byte[] addressBytes : addresses) {
System.out.printf("seen %s before: %s%n",
prettyAddress(addressBytes),
seenBefore(addressBytes)
);
}
*/
processAddresses();
persistRanges();
}
/**
* Read the seen addresses from a file.
*
* @return <code>true</code> if the file was found and has the expected
* number of ranges, otherwise <code>false</code>
* @throws IOException
* @throws ClassNotFoundException
*/
private static boolean readRanges() throws IOException, ClassNotFoundException {
File rangesStore = new File("addresses.bin");
if (!rangesStore.exists()) {
return false;
}
System.out.print("found previous rangesStore... ");
try (ObjectInputStream ois = new ObjectInputStream(
new BufferedInputStream(
new FileInputStream(rangesStore), FILE_BUFFER_SIZE
)
)) {
ranges = (BitSet[]) ois.readObject();
}
if (ranges.length != RANGES_SIZE) {
System.out.printf("wrong size of rangesStore: expected %d"
+ " found: %d%n", RANGES_SIZE, ranges.length);
return false;
} else {
System.out.printf("restored ranges: %d%n", ranges.length);
return true;
}
}
/**
* Initialize the address ranges array. All address flags will be set to
* <code>false</code>.
*/
private static void initRanges() {
System.out.print("initialize new rangesStore... ");
ranges = new BitSet[RANGES_SIZE];
for (int i = 0; i < RANGES_SIZE; i++) {
BitSet bitSet = new BitSet(255 * 255 * 255 + 1);
for (int j = 0; j < 255 * 255 * 255 + 1; j++) {
bitSet.clear(j);
}
ranges[i] = bitSet;
}
System.out.printf("initialized ranges: %d%n", RANGES_SIZE);
}
/**
* For demonstration purpose.<br>
* Generates some random IPv4 addresses. If the address was not seen before
* the flag for this address will be set to <code>true</code>.
*/
private static void processAddresses() {
for (int i = 0; i < 10; i++) {
byte[] addrBytes = randomAddress();
boolean seenBefore = seenBefore(addrBytes);
if (!seenBefore) {
seenAddress(addrBytes);
seenBefore = false;
}
System.out.printf("seen %s before: %s%n",
prettyAddress(addrBytes),
seenBefore
);
}
}
/**
* Persist the address ranges array. The file size is around 500MB.
*
* @throws IOException
*/
private static void persistRanges() throws IOException {
System.out.print("persist rangesStore... ");
try (ObjectOutputStream oos = new ObjectOutputStream(
new BufferedOutputStream(
new FileOutputStream("addresses.bin"), FILE_BUFFER_SIZE)
)) {
oos.writeObject(ranges);
}
System.out.printf("written ranges: %d%n", ranges.length);
}
/**
* Keep a flag which address has been seen already.
*
* @param addrBytes IPv4 address in four bytes
*/
static void seenAddress(byte[] addrBytes) {
int rangeIndex = (int) addrBytes[0] & 0xff;
int rangeOffset = ((int) addrBytes[1] & 0xff * 0xffff)
+ ((int) addrBytes[2] & 0xff * 0xff)
+ ((int) addrBytes[3] & 0xff);
ranges[rangeIndex].set(rangeOffset);
}
/**
* Check if the passed address was seen before.
*
* @param addrBytes IPv4 address in four bytes
* @return <code>true</code> if the address was seen before, otherwise
* <code>false</code>
*/
static boolean seenBefore(byte[] addrBytes) {
int rangeIndex = (int) addrBytes[0] & 0xff;
int rangeOffset = ((int) addrBytes[1] & 0xff * 0xffff) + ((int) addrBytes[2] & 0xff * 0xff) + ((int) addrBytes[3] & 0xff);
return ranges[rangeIndex].get(rangeOffset);
}
/**
* Convert the IPv4 address into pretty string.
*
* @param addrBytes IPv4 address in four bytes
* @return pretty String of the IPv4 address
*/
static String prettyAddress(byte[] addrBytes) {
return String.format("%03d.%03d.%03d.%03d",
(int) addrBytes[0] & 0xff,
(int) addrBytes[1] & 0xff,
(int) addrBytes[2] & 0xff,
(int) addrBytes[3] & 0xff);
}
/**
* Generate a random IPv4 address.
*
* @return four bytes of a random generated IPv4 address
*/
private static byte[] randomAddress() {
byte[] bytes = new byte[4];
for (int i = 0; i < bytes.length; i++) {
bytes[i] = (byte) random.nextInt(256);
}
return bytes;
}
}
回答2:
I have ported code from another solution here to fit this problem: Java- Mapping multi-dimensional arrays to single
The answer to the above question gives an in depth explanation of how the following code works. If anyone else would like to post a more in depth answer on this thread I will award it the answer.
static BitSet set;
static int pos(int i, int j, int k, int m) {
return ((256*256*256) * i) + ((256*256) * j) + (256 * k) + m;
}
static boolean get(byte[] addr) {
return set.get(pos(addr[0], addr[1], addr[2], addr[3]));
}
static void set(byte[] addr, boolean flag) {
set.set(pos(addr[0], addr[1], addr[2], addr[3]), flag);
}
回答3:
Use a data base like MySql and hibernarte with level 1 & 2 cache.
It will be faster than RAM if u configure a cache with hibernate and tune ur db to use few gb of cache too. i think they all do. can configure an external cache like ehcahe when an be configured to live on another process + file with limits on size and time. Db knows how to index and seek things faster than even pure RAM - at such sizes as your IP
Plus you can improve by partitioning table data & index by first char, 2nd char etc
来源:https://stackoverflow.com/questions/29009921/java-proxy-discovering-bot