Data Structures
A data structure represent a particular way of organizing data in the computer memory and a set of operations on this data. The problem of designing efficient data structures is tightly linked with algorithm design. While some data structures are very specific to particular algorithms, there exist a set of data structures which are so common that it is important to know them, and to understand their pros and cons.
In this chapter we will recall the most common data structures, that you should have seen in the introduction to programming courses, and their runtime complexity for common operations. We will see that for a given set of operations, we usually have different possible implementations, each having its own benefits and limitations.
Lists
List is a very common data structure used to store values where each value is associated to an index between 0 and \(n-1\), with \(n\) the number of elements in the list (of course indices could start at 1 instead of 0, this is a convention).
Common operations on lists are the following:
Operation |
Description |
---|---|
|
get the element at index |
|
get the element before the given element |
|
get the element after the given element |
|
insert the new element |
|
insert the new element |
|
insert the new element |
|
remove the element at the beginning of the list |
|
remove the element at the end of the list |
|
remove the given element |
|
find the if the given element |
|
insert all the elements of the list |
The most common ways to implement those operations are dynamic arrays and linked lists.
Note
In object-oriented programming languages, the data-structure List is often represented by an interface or an abstract class. Dynamic arrays and linked lists are then two different classes that implement this interface or that inherit from this abstract class. For example in Java, List is the interface, ArrayList is a class implementing this interface as a dynamic array and LinkedList is a class implementing this interface as a linked list.
Dynamic array
Dynamic array is a very common implementation of lists which is readily available in the standard library of many programming languages: ArrayList in Java, std::vector in C++, list in Python, List in C#, Array in JavaScript…
In this case, the list elements are simply stored in an array. The size of the underlying array is called the capacity, it might be larger than or equal to the number of elements \(n\) in the list: the additional free space is used when new elements are added to the list.
The declaration of a dynamic array will usually looks like this:
class List:
def __init__(self):
self.array = [None]; # underlying array, current capacity is the length of the array
self.size = 0; # number of elements in the list
Elements in a dynamic array can be accessed in constant time \(O(1)\) (at
, previous
, next
).
Adding/removing an element at the front or somewhere in the middle of the array implies to move all the elements
after the position of the insertion or deletion, leading to a linear worst case runtime complexity \(\Theta(n)\).
For example, the pseudocode for removing the front element can be:
class List:
...
def remove_front(self):
# move all the elements to the left, first is discarded
for i in range(1, self.size):
self.array[i - 1] = self.array[i]
self.size = self.size - 1
Because the elements of a dynamic array are packed at the beginning, removing and adding elements at the end
of the array is simpler.
However, a difficulty happens when the size of the list reaches its capacity, inserting a new element then requires to increase the size of the underlying
array.
The pseudocode of the insert_back
function can be:
class List:
...
def insert_back(self, new_element):
if self.size == len(self.array): # underlying array is full
self.increase_array_size() # reallocate a new array: complexity O(L.size), defined below
self.array[self.size] = new_element # put new element at the end
self.size = self.size + 1; # increase size
In practice, these two operations, insert_back
and remove_back
, are done in constant amortized time \(\Theta(1)\).
The term amortized here means that over a sequence of \(k\) insertions at the back of the array, the average
complexity is constant, but, for a single insertion the complexity might indeed be worth (linear in this case).
Resizing dynamic arrays
In practice, “resizing the underlying array” means that we must allocate a new array and copy the old one in the new one, so we must copy the \(n\) elements of the array which obviously has a linear time complexity \(\Theta(n)\). The question is then, when we allocate a new array: what should be its new capacity? If the new capacity is too small, we might have to reallocate a new one quickly if several new elements are inserted (and thus recopy all the elements). If the new capacity is too large, then we will waste memory.
A nice result shows that, if the capacity is multiplied by a constant \(k > 1\) called the growth factor,
then the array reallocations happen rarely enough that they become negligible in average. The increase_array_size
function
of the dynamic array class could then be:
class List:
...
def increase_array_size(self):
# allocate new array
newCapacity = math.ceil(len(self.array) * growth_factor) # growth_factor > 1
newArray = [None] * newCapacity # new array of size newCapacity
# copy the old array in the new array
for i in range(self.size):
newArray[i] = self.array[i]
# replace the old array by the new one
self.array = newArray
Let’s consider a simple example, where we start with an empty list of capacity 1 and a growth factor of 2. Assume that we have a sequence of \(2^p\) insertions, with \(p>1\). Reallocations occur when the capacity is equal to 1, 2, 4, 8, 16, …, \(2^{p-1}\) and at each reallocation we have to copy all the elements of the array (so we have to copy 1, 2, 4, 8, 16 elements…). Thus, to insert our \(2^p\) elements, the total number of element copies will be:
As we have made \(2^p\) insertions, in average, we made:
copies. Therefore, in average, the number of copies per insertion is constant. Some insertions still take a linear time \(\Theta(n)\), but those costly insertions are rare and their cost is amortized by all the insertions that are done in constant time \(\Theta(1)\). This result can be generalized to any sequence of insertions and removals.
The growth factor is a trade-off between memory and runtime: a large growth factor will lead to less reallocations but will waste more memory.
Finding an element in a dynamic array can be done in linear time \(\Theta(n)\) with the linear search algorithm that was described in the introductory example Algorithm complexity.
Merging two dynamic arrays is also done in linear time with respect to the size of the two lists.
Dynamic arrays are extremely common in practice because they are simple to implement and efficient for most operations. It is thus essential to be comfortable with using them correctly in algorithms.
Assume we have a list of numbers A.
The list A can contain duplicated elements and the goal is to find a list B, denoted unique(A)
, containing all unique elements in A.
For example: a possible solution of unique([4,3,22,3,11,4,-1 ,3,-1])
is [-1,3,4,11,22]
.
Note that the order of the elements in unique(A) does not matter.
A simple strategy to do this is to sort the element of A and then browse the elements of the sorted list: whenever an element is different from the previous one, it means that we have found a new unique element.
Programming exercise : Implement unique in the
python notebook
Give the worst-case runtime complexity of your algorithm.
Consider the following problem: given two lists of integers A and B, we want to compute the intersection of the two lists, i.e. the list of integers that are present in both lists. The intersection list should not contain duplicates.
For example, if we have the two lists [3, 1, 5, 2, 4]
and [2, 7, 8, 5, 3]
, the intersection list, denoted
inter([3, 1, 5, 2, 4], [2, 7, 8, 5, 3])
, should be [3, 5, 2]
.
Note that the order of the elements in the intersection list is not important.
A simple strategy to do this is to sort the elements of A and B, and then, to browse the elements of the two sorted arrays in parallels:
if the current elements in A and B are the same we have found a new common element, if this element is different from the last found element, then we add it to the intersection list;,
if the current element of A (resp. B) is smaller than the current element of B (resp. A), then we move to the next element in A (resp. B).
Programming exercise : Implement inter in the python notebook
Give the worst-case runtime complexity of your algorithm in terms of the size of the two lists \(n\) and \(m\).
Linked list
Linked lists do not rely on a contiguous array to store elements, instead, they store each element in an individual node. In order to keep track of all the nodes in the list, each node also stores a link (pointer or reference) toward the next node in the list and possibly the previous one. A linked list that only stores a link toward the next node is called a singly linked list; a list where each node contains a link toward the previous and the next node is called a doubly linked list.
The first node of a linked list is called the head and the last one the tail. We assume that it is possible to know the size of singly and doubly linked lists in constant time \(\Theta(1)\) (in practice, the linked list data structure stores an integer which is incremented/decremented whenever an element is inserted/removed). Contrarily to dynamic arrays, it is possible to insert an element anywhere in a linked list in constant time \(\Theta(1)\) because there is no need to reorganize the other elements in the list: only a few links need to be updated. Removing an element can be done in constant time anywhere in a doubly linked list, however, in a singly linked list, we need to know the previous in order to update its reference to the next node: if this node is known the removal can be done in constant time, otherwise it’s in linear time.
However, with linked list, we loose efficient random access: to get the \(k\)-th element of a linked list, one must browse the \(k\) first cells of the list from its head, leading to a linear time \(\Theta(n)\) access. Accessing to the next element is done in constant time by following the link to the next node. With doubly linked list, accessing to the previous element is also done in constant time, however with singly linked list, as we don’t have a link to the previous node, one has to browse the list from its head to find the previous element, leading to a linear time operation.
Finding an element in a linked list can be done in linear time \(\Theta(n)\) with the linear search algorithm that was described in the introductory example Algorithm complexity.
Extending a linked list can be done in constant time by just gluing the tail of the first list with the head of the second list.
Summary
In summary, the runtime complexities of the operations these 3 possible implementation of list are:
list\operation |
|
|
|
|
|
|
|
|
|
---|---|---|---|---|---|---|---|---|---|
Dynamic array |
\(\Theta(1)\) |
\(\Theta(1)\) |
\(\Theta(1)\) |
\(\Theta(n)\) |
\(\Theta(1)^*\) |
\(\Theta(n)\) |
\(\Theta(n)\) |
\(\Theta(n)\) |
\(O(n+m)\) |
Singly linked list |
\(\Theta(n)\) |
\(\Theta(n)\) |
\(\Theta(1)\) |
\(\Theta(1)\) |
\(\Theta(n)\) |
\(\Theta(1)\) |
\(\Theta(n)\) |
\(\Theta(n)\) |
\(\Theta(1)\) |
Doubly linked list |
\(\Theta(n)\) |
\(\Theta(1)\) |
\(\Theta(1)\) |
\(\Theta(1)\) |
\(\Theta(1)\) |
\(\Theta(1)\) |
\(\Theta(1)\) |
\(\Theta(n)\) |
\(\Theta(1)\) |
with \(n\) the size of the list (and \(m\) the size of the second list in extend
) and \(*\) denotes amortized complexity.
The following algorithms take a list as parameter:
def algorithm_1(list):
for e in list:
if list.previous(e) is defined:
list.previous(e) = e
def algorithm_2(list):
for e in list:
if list.next(e) is defined:
list.next(e) = e/2
def algorithm_3(list):
for i in range(len(list) // 2):
list[i] = list[list.size() - 1 - i]
def algorithm_4(list):
for e in list:
if e < 0:
list.remove(e)
def algorithm_5(list):
x = list[0];
while x is defined:
if x > 0:
list.insert_front(x)
x = list.next(x)
Give the worst case complexity of each of these algorithms, for each of the three list implementations described above:
Algorithm |
Dynamic array |
Singly linked list |
Doubly linked list |
---|---|---|---|
|
|||
|
|||
|
|||
|
|||
|
Containers based on lists
Lists are very generic data-structures with a lot of operations allowed. In algorithms, we often like to use more specialized containers, like stacks or queues, whose stronger semantic helps to better understand the intent of the data structure.
Stacks
A stack, or LIFO (Last In First Out) data structure, is a container with only 2 operations:
push(ne)
: insert a new element at the top of the stackpop()
: remove and return the element at the top of the stacks
A third operation called top()
that returns the top element without removing it is also often proposed for efficiency
(note that it can be obtained by combining pop
and push
).
A stack can therefore only be accessed from one end.
Consider the following problem: You are given an expression in reverse Polish notation (RPN), also known as postfix notation.
In RPN, operators follow their operands, such as 3 4 +
instead of 3 + 4
. The goal is to evaluate the expression and return the final value.
You can assume that the expression is valid and contains only integers with a single digit and operators +
and *
.
For example, given the expression 3 4 + 5 *
, you should return 35 (since 3 + 4 = 7
and 7 * 5 = 35
).
A simple algorithm can solve this problem in linear time \(\Theta(n)\) with a stack. The idea is to browse the expression from left to right, then:
each time you encounter a number, push it on the stack;
each time you encounter an operator, pop the last two numbers from the stack, perform the operation on these two numbers and push the result back on the stack;
at the end, pop and return the last value remaining on the stack.
Programming exercise : Implement reverse_polish_evaluation in the python notebook
Consider the following problem: we have as input an array of characters (a string) of size \(n\), consisting of the characters ()[]
, and the goal is to determine whether
its parentheses ()
and square brackets []
are “balanced” or not.
For example:
“[()[]]” is balanced
“[()” is not balanced (the first left square bracket has no corresponding right square bracket)
“[(])” is not balanced (also the numbers of left and right paranthesis and square brackets match, they are interleaved)
A simple algorithm can solve this problem with a stack. The idea is the following one: we scan the string from left to right, and each time we encounter a left parenthesis or a left square bracket, we push it on the stack. Each time we encounter a right parenthesis or a right square bracket, we pop the stack and check that the popped element is the corresponding left parenthesis or left square bracket.
Programming exercise : Implement the function is_balanced in the python notebook
Give the worst-case runtime complexity of your algorithm.
Queues
A queue, or FIFO (First In First Out) data structure, is a container with only 3 operations:
enqueue(ne)
: insert a new element at the back of the queuepeek()
: return the element at the front of the queuedequeue()
: remove the element at the front of the queue
Elements are thus inserted at one end of the queue and retrieved at the other end.
Consider the following problem: given a positive integer \(n\), print all the binary numbers between 1 and \(n\) in increasing order.
For example, for \(n=10\), the algorithm should print: 1, 10, 11, 100, 101, 110, 111, 1000, 1001, 1010
A simple algorithm can solve this problem with a queue. The idea is the following one: we start with the number 1, and we enqueue it. Then, we dequeue the first number \(x\) in the queue, print it, and enqueue its two successors \(x0\) and \(x1\) in the queue (for example, if \(x=10\), we enqueue 100 and 101). We repeat this operation until we have printed \(n\) numbers.
Write this algorithm in pseudocode.
Dequeues
A dequeue (double-ended queue), pronounced deck, like “cheque”, is a container with 6 operations:
insert_front(ne)
: insert a new element at the front of the queueremove_front()
: remove the element at the front of the queuepeek_front()
: return the element at the front of the queueinsert_back(ne)
: insert a new element at the back of the queueremove_back()
: remove the element at the back of the queuepeek_back()
: return the element at the back of the queue
Elements can therefore be accessed from both ends of the dequeue.
Dequeues can be implemented with dynamic arrays or linked lists, an efficient implementation is proposed in an optional exercise bellow.
Priority queues
In a priority queue, each element is associated to a priority, i.e. a number which indicates the importance of the element. Instead of behave retrieved in the order of insertion, like in a classical queue, the elements are retrieved by order of priority: the most important element is served first. By convention, we assume that a lower the priority value, the more important the element is (for example, an element with priority 2 comes before an element with priority 8).
Common operations on priority queues are the following:
Operation |
Description |
---|---|
|
get the element of minimum priority |
|
remove the element of minimum priority |
|
insert the new element |
|
reduce the priority of the element |
|
merge the priority queue |
Priority queues are usually implemented with heaps which can themselves rely on a partially sorted contiguous array (they are like a dynamic array with an additional heap property) or on a generalization of linked list where nodes can have several successors. The most common variants of heaps and the worst case runtime complexity of their associated priority queue operations are the following:
heap\operation |
|
|
|
|
|
---|---|---|---|---|---|
Binary |
\(\Theta(1)\) |
\(\Theta(\log(n))\) |
\(O(\log(n))\) |
\(O(\log(n))\) |
\(\Theta(n)\) |
Binomial |
\(\Theta(1)\) |
\(\Theta(\log(n))\) |
\(\Theta(1)^*\) |
\(\Theta(\log(n))\) |
\(O(\log(n))\) |
Fibonacci |
\(\Theta(1)\) |
\(O(\log(n))^*\) |
\(\Theta(1)\) |
\(\Theta(1)^*\) |
\(\Theta(1)\) |
Where \(n\) is the size of the heap and * indicates amortized time complexity.
While Fibonacci heaps have the best runtime complexity, they are also the most complex to implement, with a large runtime and memory overhead.
That’s why, in practice, it may be more interesting to use simpler heap implementations, especially the binary heap which only required a dynamic array,
in particular when the merge
operation is no needed.
In the previous chapter we have seen two sorting algorithms with quadratic worst case runtime complexity. With the heap data structure we can define a new very simple sorting algorithm (assuming that the heap data structure is already implemented):
Insert all the elements of the unsorted array in a heap
Retrieve all the elements of the heap and put them in the order of arrival in a new list
Programming exercise : Implement heap sort in the python notebook
and verify that the experimental runtime matches with the theoretical complexity.
Assume that we use a binary heap in the above algorithm, what is the worst case runtime complexity of this algorithm ? (\(n\) is the size of the array)
Can we improve the worst case runtime complexity of heap-sort by using the Binomial or the Fibonacci heaps?
Given an array of numbers of size \(n\), print the \(k\leq n\) smallest elements of the array in increasing order (we assume that printing is done in constant time).
Propose a first algorithm in pseudocode with a linearithmic runtime \(\Theta(n\log(n))\) using heap-sort.
Propose a second algorithm in pseudocode with a runtime \(\Theta(n + k\log(n))\) using a binomial heap.
Propose a third algorithm in pseudocode with a runtime \(\Theta(n\log(k))\) using a binary heap (you can consider that the heap gives access to the maximal element instead of the minimal one). Hint: the heap doesn't need to hold more than *k* elements.
Dictionaries
A dictionary is a data structure where each stored element is associated to a unique key (no two elements can have the same key). In some way, dictionaries are a generalization of lists: in a list each element is associated to a unique integer, its index, between 0 and the length of the list minus 1; in a dictionary indices can be (mostly) anything and are called keys. Dictionaries are also called associative arrays or maps.
Common operations on dictionaries are:
Operation |
Description |
---|---|
|
insert the new element |
|
get the element associated to the key |
|
remove the element associated to the key |
There are mainly two efficient ways to implement dictionaries (with many variants): hash maps or search trees.
Hash maps rely on an auxiliary hash function that associates an index to any key which allows storing the elements
in a classical array at the index given by the hash of its associated key. However, it is possible that two different
keys have the same hash value, they are hence associated to the same index: in this case, we say that there is a collision
between the two keys and special technics must be used to handle this problem. If we assume that the hash function is good
(somehow, it produces few collisions), then the three operations put
, get
and remove
have a constant average runtime \(\Theta(1)\).
However, the worst case is linear \(O(n)\) with respect to the size \(n\) of the hash maps.
Search trees on the other hand assumes that the keys are totally ordered,
for any two keys \(k_1\) and \(k_2\), we have either
\(k_1\leq k_2\) or \(k_2\leq k_1\). The three operations put
, get
and remove
have a logarithmic average and worst case runtime \(O(\log(n))\).
Search trees are thus less efficient than hash maps in average, but they behave better in the worst case.
Warning
Dictionaries have become very popular as general purpose containers with high level interpreted languages such has Python and JavaScript. Existing implementations of dictionaries manage to have good runtime complexities in average, but they are complex data structure with a significant constant overhead both in terms of runtime and memory. While this overhead can be ignored in high level interpreted scripts, it becomes important when we consider the efficient implementation of algorithms, it is thus a good idea to avoid dictionaries when simpler data structure can be used.
Check your understanding
In the following, efficient must be understood as anything with a (amortized) runtime complexity stricly better than linear (\(o(n)\)).
A queue is a FIFO container |
|
A stack is a FIFO container |
|
A doubly linked list is a good container to store a static (that does not change over time) collection of items |
|
A dynamic array list is a good container to store a static (that does not change over time) collection of items |
|
Any element can be be efficiently accessed in a doubly linked list |
|
Finding the smallest element of a dictionary is efficient |
|
Finding the smallest element of a heap is efficient |
|
Finding the smallest element of a list is efficient |
|
We can access efficiently to an element of a given priority in a priority queue |
|
The insertion of new elements in a dictionary is efficient |
|
Searching if a key exists in a dictionary is efficient |
|
The insertion of new elements in the middle of a dynamic array is efficient |
|
The insertion of new elements at the front of a dynamic array is efficient |
|
The insertion of new elements at the back of a dynamic array is efficient |
|
In a queue, we can insert elements at the back and at the front |
|
We can insert/remove elements efficiently at the front and at the back of a dequeue |
|
Searching if an element exists in a list is efficient |
|
In a dynamic array, swapping two elements is efficient |
Going further
Compare the effect of two different resizing strategies for dynamic arrays:
Linear: the capacity is increased by a constant \(k=5\) each time the array is full.
Geometric: the capacity is multiplied by a constant \(k=2\) each time the array is full.
Programming exercise : Implement the two functions
new_size
andinsert_back
of the class dynamic array in thepython notebook
and compare the runtime of the two strategies for a sequence of insertions.
We have seen that dequeues are containers where elements can be inserted and removed at the front and back. Because the elements in classical dynamic arrays are packed at the front, they are not suitable for dequeue implementation: insertion and deletion at the front are not efficient. A simple way to alleviate this issue is to put the elements in the middle of the array instead of the front. It is then possible to insert new elements in the free space at the front and at the back.
If the space becomes insufficient at the front or at the back, a larger array is allocated and the elements of the old array are copied in the middle of the new one. If the new capacity is \(c\) and the number of elements in the dequeue is \(n\), the new start index is simply \(\lfloor (c-n)/2 \rfloor\)
Programming exercise : Implement a dequeue container based on a dynamic array, all operations must have a constant
(amortized) runtime, in the python notebook
For the array container, we will use a Numpy array (to avoid using Python lists which are already dynamic arrays). We will use a growth factor of 2. For simplicity we never shrink the array when elements are removed.
Singly linked lists are elegant and light data structures but the algorithms for common operations can become tricky with them!
Propose a non-recursive algorithm that that reverses a singly linked list of \(n\) elements in linear time \(\Theta(n)\) using only a constant amount of memory \(O(1)\) (you cannot allocate an array with the same size as the list for example).
Programming exercise : Implement the singly linked list reverse function in the python notebook