Which data structure/s is used in implementation of editors like notepad. This data structure should be extensible, and should support various features like edition, deletion, s
We wrote an editor for an old machine (keep in mind that this was a while ago, about 1986, so this is from memory, and the state of the art may have advanced somewhat since then) which we managed to get to scream along, performance wise, by using fixed memory blocks from self-managed pools.
It had two pools, each containing a fixed number of specific-sized blocks (one pool was for line structures, the other for line-segment structures). It was basically a linked list of linked lists.
Memory was pre-allocated (for each region) from a 'malloc()
'-like call, and we used 65,535 blocks (0 through 65,534 inclusive, block number 65,535 was considered the null block, an end-of-list indicator).
This allowed each for 65, 535 lines (384K or 512K for the padded version) and about 1.6G of file size (taking 2G of allocated space), which was pretty big back then. That was the theoretical file size limit - I don't think we ever approached that in reality since we never allocated the full set of line segment structures.
Not having to call malloc()
for every little block of memory gave us a huge speed increase, especially as we could optimise our own memory allocation routines for fixed size blocks (including inlining the calls in the final optimised version).
The structures in the two pools were as follows, with each line being a single byte):
Line structure (6/8 bytes) Line-segment structure (32 bytes)
+--------+ +--------+
|NNNNNNNN| |nnnnnnnn|
|NNNNNNNN| |nnnnnnnn|
|PPPPPPPP| |pppppppp|
|PPPPPPPP| |pppppppp|
|bbbbbbbb| |LLLLLLLL|
|bbbbbbbb| |LLLLLLLL|
|........| |xxxxxxxx|
|........| :25 more :
+--------+ : x lines:
+--------+
where:
x
point to the line segment pool.N
was a block number for the next line (null meaning this was the last line in the file).P
the the block number for the previous line (null meaning this was the first line in the file).b
was the block number for the first line segment in that line (null meaning the line was empty)..
was reserved padding (to bump the structure out to 8 bytes).n
was the block number for the next line segment (null meaning this was the last segment in the line).p
was the block number for the previous line segment (null meaning this was the first segment in the line).L
was the block number for the segment's line block.x
was the 26 characters in that line segment.The reason the line structure was padded was to speed up the conversion of block numbers into actual memory locations (shifting left by 3 bits was much faster than multiplying by 6 in that particular architecture and extra memory used was only 128K, minimal compared to the total storage used) although we did provide the slower version for those who cared more about memory.
We also had an array of 100 16-bit values which contained the line segment (and line number so we could quickly go to specific lines) at roughly that percentage (so that array[7] was the line that was roughly 7% into the file) and two free pointers to maintain the free list in each pool (this was a very simple one way list where N
or n
in the structure indicated the next free block and free blocks were allocated from, and put back to, the front of these lists).
There was no need to keep a count of the characters in each line segment since 0-bytes were not valid in files. Each line segment was allowed to have 0-bytes at the end that were totally ignored. Lines were compressed (i.e., line segments were combined) whenever they were modified. This kept block usage low (without infrequent and lengthy garbage collection) and also greatly sped up search-and-replace operations.
The use of these structures allowed very fast editing, insertion, deletion, searching and navigation around the text, which is where you're likely to get most of your performance problems in a simple text editor.
The use of selections (we didn't implement this as it was a text mode editor that used vi-like commands such as 3d
to delete 3 lines or 6x
to delete 6 characters) could be implemented by having a {line#/block, char-pos}
tuple to mark positions in the text, and use two of those tuples for a selection range.
Wikipedia says many editors use a Gap Buffer. It is basically an array with an unused space in the middle. The cursor sits just before the gap, so deletion and insertion at the cursor is O(1). It should be pretty easy to implement.
Looking at the source code of Notepad++ (as Chris Ballance suggested in this thread here) shows that they also use a gap buffer. You could get some implementation ideas from that.
Check out Ropes. Handles fast insert/delete/edit of strings. Ranges are usually supported in Rope implementations, and scrolling can be done with an inverted index into the rope.
There is an excellent article about Piece Chains by James Brown, author of HexEdit.
In a nutshell: Piece chains allow you to record the changes made to the text. After loading, you have a piece chain that spans the entire text. Now you insert somewhere in the middle.
Instead of allocating a new buffer, copying the text around, etc., you create two new pieces and modify the existing one: The existing one now contains the text up to the insertion point (i.e. you just change the length of the piece), then you have a piece with the new text and after that a new piece with all the text after the insertion. The original text is left unchanged.
For undo/redo, you simple remember which pieces you added/removed/changed.
The most complex area when using piece chains is that there is no longer a 1:1 mapping between an offset in the visible text and the memory structure. You either have to search the chain or you must maintain a binary tree structure of some kind.
The usual thing is to have something like a list or array of arrays of characters. There has been a lot of stuff done on this over the years: you might have a look at this google search.
Check out the implementation of Notepad++, you can view the source on SourceForge