I had a job interview today. And was asked about complexity of std:set_intersection
. When I was answering I mentioned that
O(n+m)
is e
We'll show by rigorous Big-O analysis that you are indeed correct, given one possible choice of parameter of growth in your analysis. However, this does not necessarily mean that the viewpoint of the interviewer is incorrect, rather that his/her choice of parameter of growth differs. His/her prompt that your answer was outright incorrect, however, is questionable: you've possibly simply used two slightly different approaches to analyzing the asymptotic complexity of std::set_intersection
, both leading to the general consensus that the algorithm runs in linear time.
Lets start by looking at the reference of std::set_intersection
at cppreference (emphasis mine)
Parameters
first1
,last1
- the first range of elements to examine
first2
,last2
- the second range of elements to examineComplexity
At most
2·(N1+N2-1)
comparisons, whereN1 = std::distance(first1, last1) N2 = std::distance(first2, last2)
std::distance itself is naturally linear (worst case: no random access)
std::distance
...
Returns the number of elements between
first
andlast
.
We'll proceed to briefly recall the basic of Big-O notation.
We loosely state the definition of a function or algorithm f
being in O(g(n))
(to be picky, O(g(n))
being a set of functions, hence f ∈ O(...)
, rather than the commonly misused f(n) ∈ O(...)
).
If a function
f
is inO(g(n))
, thenc · g(n)
is an upper bound onf(n)
, for some non-negative constantc
such thatf(n) ≤ c · g(n)
holds, for sufficiently largen
(i.e. ,n ≥ n0
for some constantn0
).
Hence, to show that f ∈ O(g(n))
, we need to find a set of (non-negative) constants (c, n0)
that fulfils
f(n) ≤ c · g(n), for all n ≥ n0, (+)
We note, however, that this set is not unique; the problem of finding the constants (c, n0)
such that (+) holds is degenerate. In fact, if any such pair of constants exists, there will exist an infinite amount of different such pairs.
We proceed with the Big-O analysis of std::set_intersection
, based on the already known worst case number of comparisons of the algorithm (we'll consider one such comparison a basic operation).
set_intersection
exampleNow consider two ranges of elements, say range1
and range2
, and assume that the number of elements contained in these two ranges are m
and n
, respectively.
k = m+n
be the parameter of choice: we would still conclude that std::set_intersection
is of linear-time complexity, but rather in terms of k
(which is m+n
which is not max(m, n)
) than the largest of m
and n
. These are simply the preconditions we freely choose to set prior to proceeding with our Big-O notation/asymptotic analysis, and it's quite possibly that the interviewer had a preference of choosing to analyze the complexity using k
as parameter of growth rather than the largest of its two components.Now, from above we know that as worst case, std::set_intersection
will run 2 · (m + n - 1)
comparisons/basic operations. Let
h(n, m) = 2 · (m + n - 1)
Since the goal is to find an expression of the asymptotic complexity in terms of Big-O (upper bound), we may, without loss of generality, assume that n > m
, and define
f(n) = 2 · (n + n - 1) = 4n - 2 > h(n,m) (*)
We proceed to analyze the asymptotic complexity of f(n)
, in terms of Big-O notation. Let
g(n) = n
and note that
f(n) = 4n - 2 < 4n = 4 · g(n)
Now (choose to) let c = 4
and n0 = 1
, and we can state the fact that:
f(n) < 4 · g(n) = c · g(n), for all n ≥ n0, (**)
Given (**)
, we know from (+)
that we've now shown that
f ∈ O(g(n)) = O(n)
Furthermore, since `(*) holds, naturally
h ∈ O(g(n)) = O(n), assuming n > m (i)
holds.
If we switch our initial assumption and assume that m > n
, re-tracing the analysis above will, conversely, yield the similar result
h ∈ O(g(m)) = O(m), assuming m > n (ii)
Hence, given two ranges range1
and range2
holding m
and n
elements, respectively, we've shown that the asymptotic complexity of std::set_intersection
applied two these two ranges is indeed
O(max(m, n))
where we're chosen the largest of m
and n
as the parameter of growth of our analysis.
This is, however, not really valid annotation (at least not common) when speaking about Big-O notation. When we use Big-O notation to describe the asymptotic complexity of some algorithm or function, we do so with regard to some single parameter of growth (not two of them).
Rather than answering that the complexity is O(max(m, n))
we may, without loss of generality, assume that n
describes the number of elements in the range with the most elements, and given that assumption, simply state than an upper bound for the asymptotic complexity of std::set_intersection
is O(n)
(linear time).
A speculation as to the interview feedback: as mentioned above, it's possible that the interviewer simply had a firm view that the Big-O notation/asymptotic analysis should've been based on k = m+n
as parameter of growth rather than the largest of its two components. Another possibility could, naturally, be that the interviewer simply confusingly queried about the worst case of actual number of comparisons of std::set_intersection
, while mixing this with the separate matter of Big-O notation and asymptotic complexity.
Finally note that the analysis of worst case complexity of std::set_intersect
is not at all representative for the commonly studied non-ordered set intersection problem: the former is applied to ranges that are already sorted (see quote from Boost's set_intersection
below: the origin of std::set_intersection
), whereas in the latter, we study the computation of the intersection of non-ordered collections.
Description
set_intersection
constructs a sorted range that is the intersection of the sorted rangesrng1
andrng2
. The return value is the end of the output range.
As an example of the latter, the Intersection
set method of Python applies to non-ordered collections, and is applied to say sets s
and t
, it has an average case and a worst-case complexity of O(min(len(s), len(t)) and O(len(s) * len(t)), respectively. The huge difference between average and worst case in this implementation stems from the fact that hash based solutions generally works very well in practice, but can, for some applications, theoretically have a very poor worst-case performance.
For additional details of the latter problem, see e.g.