You are given an image of a surface photographed by a satellite.The image is a bitmap where water is marked by \'.\' and land is marked by \'*
\'. Adjacent group of
Without any apriori knowledge about the nature of islands the algorithm can not be made more efficient than O(n) in time, however memory-wise your algo can be improved. The visited array is simply redundant. Here is a quick attempt (pardon the usage of ASCII artihmetics - not so readable, but quicker to code)
#include <stdio.h>
#define COLS 12
int main()
{
char map[][COLS] = {
"*..........",
"**........*",
"...........",
"...*.......",
"*........*.",
"..........*"
};
int rows = sizeof(map)/sizeof(map[0]);
int i, j;
int main_count = 0;
if(map[0][0] == '*') {
main_count++;
}
for(j=0; j<COLS-1; j++) {
if((map[0][j]-map[0][j+1])==4) {
main_count++;
}
}
for(i=1; i<rows; ++i){
if(map[i][0] == '*' && (map[i][0]-map[i-1][0]-map[i-1][1])==-50) {
main_count++;
}
for(j=0; j<COLS-1; j++) {
if((map[i][j]-map[i][j+1])==4) {
if( j==COLS-2 && (map[i][j+1]-map[i-1][j]-map[i-1][j+1])==-50) {
main_count++;
}
if( j!=COLS-2 && (map[i][j+1]-map[i-1][j]-map[i-1][j+1])-map[i-1][j+1]==-96) {
main_count++;
}
}
}
}
printf("Number of islands: %d\n", main_count);
return 0;
}
I think the confusion here is that your algorithm does actually run in linear time, not quadratic time.
When using big-O notation, n
stands for the input size. Your input here is not just r
or just c
, but rather, r
* c
, as it is a grid of nodes. Your algorithm runs in O(r * c)
, as you said in your question... thus your algorithm runs in O(n)
which is linear time.
It seems to me that any algorithm that solves this problem will have to read each input cell once in the worst case. Thus the best running time you can hope for is O(n)
. As your algorithm runs in O(n)
you can't have any algorithm that runs of a faster order, in the worst case, than the algorithm you proposed.
I can think of some clever tricks. For example, if you have a block of *
s, you could only check the diagonals, in certain cases. That is, if you have
......
.****.
.****.
.****.
.****.
......
it won't matter if you only read these cells:
......
.*.*..
..*.*.
.*.*..
..*.*.
......
unless for example you have something in the bottom-left-most corner, in which case you would need to read that bottom-left-most *
. So maybe in certain cases your algorithm can run more quickly, but for the worst case (which is what O
measures), it will have to be O(n)
.
EDIT: Also even in that case where you only read half the nodes, the run-time would be O(n/2)
which is still of the same order (O(n)
).
Asymptotically your approach is the best O(n).
However, I noticed a couple of things:
First:
inside the function markVisited you check a cells neighbors in the order:
down, right, up, left, down-right, up-left, up-right, down-left
A better order would be:
left, right, up-left, up, up-right, down-left, down, down-right
This will play nicer in the cache since it is starting by reading values directly next to the current position, and sequentially in a given row.
(note that starting with the diagonals would mark visited to a larger number of locations, but since checking if a cell was visited is only the last check it wouldn't save any time).
Secondly:
In the worst case of a satellite image that contains only land, your algorithm will visit each cell multiple times, (something like once for each neighbor the cell has).
This means you are approximately doing eight times more array accesses than possibly needed.
I believe that you can solve this problem with a more or less linear pass over the data.
I'm currently thinking of an approach, if it works I'll add the code as an edit.
Create a undirected graph, where each island node connects to its neighboor island nodes.
While there are unvisited nodes:
Done.
Both (1) and (2) takes O(n) time.
This is highly related to connected component labeling. The number of connected component is just a byproduct of the labeling. Algorithm described in the linked wikipedia article works in linear time.