I am trying to understand how the Leveled Compaction Strategy in Cassandra works that guarantees 90% of all reads will be satisfied from a single sstable.
From
LeveledCompactionStrategy (LCS) in Cassandra implements the internals of LevelDB. You can check the exact implementation details in LevelDB implementation doc.
In order to give you a simple explanation take into account the following points:
In bold are the relevant details that justify the 90% reads from the same file (sstable). Let's do the math together and everything will become clearer (I hope :)
Imagine you have keys A,B,C,D,E in L0, and each keys takes 1MB of data.
Next we insert key F. Because level 0 is filled a compaction will create a file with [A,B,C,D,E] in level 1, and F will remain in level 0.
That's ~83% of data in 1 file in L1
Next we insert G,H,I,J and K. So L0 fills up again, L1 gets a new sstable with [I,G,H,I,J].
By now we have K in L0, [A,B,C,D,E] and [I,G,H,I,J] in L1
And that's ~90% of data in L1 :)
If we continue inserting keys we will get around the same behavior so, that's why you get 90% of reads served from roughly the same file/sstable.
A more in depth and detailed (what happens with updates and tombstones) info is given in this paragraph on the link I mentioned (the sizes for compaction election are different because they are LevelDB defaults, not C*s):
When the size of level L exceeds its limit, we compact it in a background thread. The compaction picks a file from level L and all overlapping files from the next level L+1. Note that if a level-L file overlaps only part of a level-(L+1) file, the entire file at level-(L+1) is used as an input to the compaction and will be discarded after the compaction. Aside: because level-0 is special (files in it may overlap each other), we treat compactions from level-0 to level-1 specially: a level-0 compaction may pick more than one level-0 file in case some of these files overlap each other.
A compaction merges the contents of the picked files to produce a sequence of level-(L+1) files. We switch to producing a new level-(L+1) file after the current output file has reached the target file size (2MB). We also switch to a new output file when the key range of the current output file has grown enough to overlap more then ten level-(L+2) files. This last rule ensures that a later compaction of a level-(L+1) file will not pick up too much data from level-(L+2).
Hope this helps!