First let me phrase the proper question:
Q: There is a file containing more than a million points (x,y) each of which represents a star. There is a planet earth
Your algorithm is correct. Just remember that time complexity of your program is O(n . log 100 ) = O(n), unless number of closest points to find can vary.
To elaborate on the MaxHeap solution you would build a max-heap with the first k elements from the file ( k = 100 in this case ).
The key for the max-heap would be its distance from Earth (a,b). Distance between 2 points on a 2d plane can be calculated using:
dist = (x1,y1) to (x2,y2) = square_root((x2 - x1)^2 + (y2 - y1)^2);
This would take O(k) time to construct. For every subsequent element from k to n. ie (n - k) elements you need to fetch its distance from earth and compare it with the top of max-heap. If the new element to be inserted is closer to earth than the top of the max-heap, replace the top of the max-heap and call heapify on the new root of the heap.
This would take O((n-k)logk) time to complete. Finally we would be left with just the k elements in the max-heap. You can call heapify k times to return all these k elements. This is another O(klogk).
Overall time complexity would be O(k + (n-k)logk + klogk).
It's a famous question and there has been lot's of solution for that: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
if you did not find it useful, there are some other resources such as Rurk's computational geometry book.
import sys,os,csv
iFile=open('./file_copd.out','rU')
earth = [0,0]
##getDistance return distance given two stars
def getDistance(star1,star2):
return sqrt((star1[0]-star2[0])**2 +(star1[1]-star2[1])**2 )
##diction dict_galaxy looks like this {key,distance} key is the seq assign to each star, value is a list [distance,its cordinance]
##{1,[distance1,[x,y]];2,[distance2,[x,y]]}
dict_galaxy={}
#list_galaxy=[]
count = 0
sour=iFile.readlines()
for line in sour:
star=line.split(',') ##Star is a list [x,y]
dict_galaxy[count]=[getDistance(earth,star),star]
count++
###Now sort this dictionary based on their distance, and return you a list of keys.
list_sorted_key = sorted(dict_galaxy,key=lambda x:dict_galaxy[x][0])
print 'is this what you want %s'%(list_sorted_key[:100].to_s)
iFile.close()
You can actually do this in time O(n) and space O(k), where k is the number of closest points that you want, by using a very clever trick.
The selection problem is as follows: Given an array of elements and some index i, rearrange the elements of the array such that the ith element is in the right place, all elements smaller than the ith element are to the left, and all elements greater than the ith element are to the right. For example, given the array
40 10 00 30 20
If I tried to select based on index 2 (zero-indexed), one result might be
10 00 20 40 30
Since the element at index 2 (20) is in the right place, the elements to the left are smaller than 20, and the elements to the right are greater than 20.
It turns out that since this is a less strict requirement than actually sorting the array, it's possible to do this in time O(n), where n is the number of elements of the array. Doing so requires some complex algorithms like the median-of-medians algorithm, but is indeed O(n) time.
So how do you use this here? One option is to load all n elements from the file into an array, then use the selection algorithm to select the top k in O(n) time and O(n) space (here, k = 100).
But you can actually do better than this! For any constant k that you'd like, maintain a buffer of 2k elements. Load 2k elements from the file into the array, then use the selection algorithm to rearrange it so that the smallest k elements are in the left half of the array and the largest are in the right, then discard the largest k elements (they can't be any of the k closest points). Now, load k more elements from the file into the buffer and do this selection again, and repeat this until you've processed every line of the file. Each time you do a selection you discard the largest k elements in the buffer and retain the k closest points you've seen so far. Consequently, at the very end, you can select the top k elements one last time and find the top k.
What's the complexity of the new approach? Well, you're using O(k) memory for the buffer and the selection algorithm. You end up calling select on a buffer of size O(k) a total of O(n / k) times, since you call select after reading k new elements. Since select on a buffer of size O(k) takes time O(k), the total runtime here is O(n + k). If k = O(n) (a reasonable assumption), this takes time O(n), space O(k).
Hope this helps!