I have a table with the following structure:
CREATE TABLE `geo_ip` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`start_ip` int(10) unsigned NOT NULL,
`end_ip
I've just run into the same problem. Since nobody answered the "WHY", and I figured it out, I'll write here an explanation for all future readers.
First, let's dissect the query.
where 2393196360 between start_ip and end_ip
really means
where start_ip <= C and end_ip >= C
so the engine will first use the index on start_ip, end_ip
to fetch all rows for which start_ip is smaller than C, and then further filter out the rows for which end_ip is also bigger than C.
When the engine looks for start_ip <= C
, and C
is a value big enough such that most, or all start_ips are smaller than C, this "first pass" will result in a lot of rows. It will happen every time C
is an IP on the higher end of the IP range.
Now, here's the main thing to realise: our dataset is made in such a way that for each start_ip, there is only an end_ip value, and this end_ip value is guaranteed to be lower than the next record's start_ip value. We are partitioning a range and the partitions do not overlap. But, in the general case, when it comes to two table fields, this does not have to be the case!
So, after the 'first pass', the engine will have to look through ALL records that match start_ip <= C
to make sure that they also match end_ip >= C
, despite the index. Having end_ip
as part of the compound index does not do much in our case; it would help only if we had multiple values for end_ip
for each value start_ip
, but we only have 1.
To give you an example, pretend that the columns were populated with the following data:
start_ip end_ip
1 10001
1 10002
1 10003
------------
2 10001
2 10002
2 10003
------------
...
------------
9999 10001
9999 10002
9999 10003
if you ran a query with start_ip <= 10000 AND end_ip >= 10000
, notice that ALL rows match the expression.
On the other hand, in our case, with our ip-ranges dataset, we have the guarantee that only ONE record will match any start_ip <= C AND end_ip >= C
expression, thanks to the way the ip data is structured. Specifically the record with the biggest value for start_ip
, among all those that match start_ip <= C
. That's why adding ORDER BY and LIMIT 1 works in this case, and is the cleanest solution, in my opinion.
Edit: I've just noticed that adding the ORDER BY start_ip DESC and LIMIT clauses may not be enough in some cases. If you run the query with a value that is not covered by any ranges in your data, for instance with private IPs like 127.0.0.1 or 192.168.*, the engine will still look at all records that match the start_ip <= C
expression, and the query will be slow. That's because since no records matches the the second part of the expression (end_ip >= C
), the LIMIT 1 clause never kicks in.
The solution I've found is to construct the query with a join so as to force the engine to first grab the record with the biggest value for start_ip
where start_ip <= C
, and only then check if end_ip is also >= C. Like this:
SELECT *
FROM
( select id FROM geo_ip WHERE start_ip <= C ORDER BY start_ip DESC LIMIT 1 ) limit_ip
INNER JOIN geo_ip ON limit_ip.id = geo_ip.id
WHERE geo_ip.end_ip >= C
This query will perform a single lookup, whether or not the specific ip C
is covered by the ranges in the table, and it only requires a single index on start_ip
(as well as id
as the primary key).