No one can ever master a programming language (PL) by studying it only without looking into and comparing it with other ones. Modern application software often requires a variety of components written in different PLs. Most importantly, language itself is not important at all; at least not important when compared to the fundamental ideas on architectures, frameworks, the design of the PL. A competing programmer can always get the hang of any PLs quickly.
Consequently, I make a list of PLs I would study seriously in the future.
C# or Java
A serious pure OOP and enterprise level PL is always necessary for the reason of productivity.
C/C++
The most powerful PL, which can be used to create almost everything.
Help to understand things under the hood.
C++: semi-OOP version of C, a horrible language which requires a life long time to study.
Bash/Python
Scripts make your life easier.
Javascript
Everything could be written in Javascript will be eventually written in Javascript.
Lisp
Help to understand the beauty of functional PL.
The following part is the first part of my series notes on the book Essential C#, which is redeemed as the best book for C# learners.
Dynamo successfully builds a highly available and scalable data store, which sacrifices consistency under certain failure scenarios but the data storage is still eventually consistent. The system is complex and thus this paper involves a huge number of details. However, the paper could not specify all of them. Some interesting parts seem to be missing. What are the supporting techniques used in Dynamo? How is it compared to other existing KV distributed store systems? Are there any possible extensions or future works?
This paper assumes that security concerns can be ignored for its internal use. Theoretically, security problems is still possible. At least, how can the administrator detect sybil attacks?
The author tried three partition schemes. The latter versions decouple partitioning and partition placement. The placement is changeable at run time. Consequently, the strategy 3 has a better efficiency. And the system recovers faster and it is easier to archive. However, the key space is partitioned equally into Q partitions, and every node assume Q/S tokens per node. The strategy 3 in Figure 7 is a little confusing. The author should specify that the span each node is responsible for is SIMILAR IN SIZE to each other, instead of exactly EQUAL IN SIZE to each other.
Integrate the file system for local database with HDFS. In the design of HadoopDB, the file system and the database are separated, which means that the more space one part is preallocated and the less space the other part will occupy. This design is not scalable because the system administrator has to partition the space by hand. We could investigate a way to let the database run on the HDFS directly. The database’s file descriptor can be adapted to HDFS directly. Or, even better, we can add an extra virtualized layer between the database and the HDFS.
Set up a better scheduler. It is mentioned in the paper that the original version of hadoop fair scheduler is not so good. It is queue-based and locally greedy and simply does not consider the big picture much. “Quincy: Fair Scheduling for Distributed Computing Clusters” advances a flow-based algorithm, which has a very good performance by increasing the throughput up to 40%.
classBook{booleancheckedOut=false;Book(booleancheckOut){checkedOut=checkOut;}voidcheckIn(){checkedOut=false;}publicvoidfinalize(){if(checkedOut)System.out.println("Error: checked out.");// Normally, you’ll also do this:// super.finalize(); // Call the base-class version}}publicclassTerminationCondition{//static Test monitor = new Test();publicstaticvoidmain(String[]args){Booknovel=newBook(true);// Proper cleanup:novel.checkIn();// Drop the reference, forget to clean up:newBook(true);// Force garbage collection & finalization:System.gc();}}/* output: Error: checked out. */
//Input:// A array// size size of A// p starting index of the span waiting to be sorted// r ending index of the span waiting to be sorted//Procedure:// sort left// sort right// merge by comparing two numbers from the sorted left and right pair by pair//Output:// A sorted array voidmergeSort(int*a,intb,inte){if(b<e){intmid=(b+e)/2;mergeSort(a,b,mid);mergeSort(a,mid+1,e);// mergeint*left=newint[mid-b+1+1];int*right=newint[e-mid+1];for(inti=b;i<=mid;i++){left[i-b]=a[i];}left[mid+1-b]=INT_MAX;for(inti=mid+1;i<=e;i++){right[i-mid-1]=a[i];}right[e-mid]=INT_MAX;inti=0,j=0;for(intk=b;k<=e;k++){a[k]=(left[i]<=right[j])?(left[i++]):(right[j++]);}delete[]left;delete[]right;}}
Θ(n lg n) but not operate in place
3 Growth of Functions 43
3.1 Asymptotic notation 43
O(). When we have only an asymptotic upper bound, we use O-notation.
O(g(n)) = {f(n): there exist positive constants c and n0 such that 0 <= f(n) <= c*g(n) for all n >= n0}.
if f(n) == c*g(n) is not possible, we use o()
Θ(). When we have both an asymptotic upper bound and an asymptotic lower bound, we use Θ-notation.
Θ(g(n)) = {f(n): there exist positive constants c1, c2, and n0 such that 0 <= c1g(n) <= f(n) <= c2g(n) for all n >= n0}
Ω(), When we have only an asymptotic lower bound, we use omega-notation.
uppercase_omega(g(n)) = {f(n): there exist positive constants c and n0 such that 0 <= c*g(n) <= f(n) for all n >= n0}
if f(n) == c*g(n) is not possible, we use ω()
3.2 Standard notations and common functions 53
4 Divide-and-Conquer 65
Three ways for solving recurrences
substitution method
recursive-tree method
master method
T(n) = a*T(n/b) + f(n), a>=1, b>1. It characterizes a divide- and-conquer algorithm that creates a subproblems, each of which is 1/b the size of the original problem, and in which the divide and combine steps together take f(n) time.
4.1 The maximum-subarray problem 68
In the stock market, you need to buy low and sell high. But the maximum profit does not always start at the lowest price or end at the highest price.
A brute-force solution
Choose 2 from n pairs of (date, price) and calculate the difference. Θ(n2)
A solution using divide-and-conquer
A transformation: create an array of the daily change in price, and find the maximum subarray.
find the max subarray on the left
find the max subarray on the right
find the max subarray crossing the middle
find max left, max right, and their sum
take the max of the above
TODO: codes
Θ(n*lg(n))
4.2 Strassen’s algorithm for matrix multiplication 75
O(n3) to O(n2.81)
4.3 The substitution method for solving recurrences
Guess the form of the solution.
Use mathematical induction to find the constants and show that the solution works.
4.4 The recursion-tree method for solving recurrences 88
depth * height
4.5 The master method for solving recurrences 93
5 Probabilistic Analysis and Randomized Algorithms 114
5.1 The hiring problem 114
5.2 Indicator random variables 118
5.3 Randomized algorithms 122
5.4 Probabilistic analysis and further uses of indicator random variables
II Sorting and Order Statistics Introduction 147
worst-case running time
average-cased running time
Insertion sort
Θ(n^2)
Θ(n^2)
Merge sort
Θ(n lg n)
Θ(n lg n)
Heapsort
O(n lg n)
-
Quicksort
Θ(n^2)
Θ(n lg n) (*)
Counting sort
Θ(k+n)
Θ(k+n)
Radix sort
Θ(d(n+k))
Θ(d(n+k))
Bucket sort
Θ(n^2)
Θ(n)
6 Heapsort 151
O(n lg n),Our heap data structure is not garbage-collected storage.
#include <iostream>#include <cstdlib>template<classT>classHeap{public:Heap():m_array(0),m_size(0),m_maxSize(0){}Heap(intmaxSize);~Heap(){delete[]m_array;}boolinsert(Titem);boolisEmpty();TfetchMax();voiddeleteMax();voiddisplay();//Transfer Array<T> to a heapvoidarrayToHeap(T*array,intsize);//Sort the heapvoidheapSort();private:voidlocate_down(intstart,intend);voidlocate_up(intstart,intend);private:T*m_array;intm_size;//num of elem storedintm_maxSize;//allocated memory};template<classT>Heap<T>::Heap(intmaxSize){m_array=newT[maxSize+1];m_size=0;m_maxSize=maxSize;}template<classT>THeap<T>::fetchMax(){returnm_array[1];}template<classT>boolHeap<T>::isEmpty(){returnm_size==0;}template<classT>voidHeap<T>::locate_down(intstart,intend){intc=start*2;inttemp;while(c<=end){if((c<end)&&(m_array[c]<m_array[c+1]))//Pick the larger one in both childs. ATTENTION: c==end is not approved!!c=c+1if(m_array[start]<m_array[c]){temp=m_array[start];m_array[start]=m_array[c];m_array[c]=temp;}start=c;c=2*c;}}template<classT>//overlap the top with the bottom, decrease the size, and locate_downvoidHeap<T>::deleteMax(){m_array[1]=m_array[m_size];--m_size;locate_down(1,m_size);}template<classT>voidHeap<T>::locate_up(intstart,intend){intloc=end;intparent=end/2;inttemp;while(parent>=start&&m_array[loc]<m_array[parent]){temp=m_array[loc];m_array[loc]=m_array[parent];m_array[parent]=temp;loc=parent;parent=loc/2;}}template<classT>boolHeap<T>::insert(Titem){if(m_size==m_maxSize){std::cerr<<"Heap is full";this->~Heap();return0;}++m_size;m_array[m_size]=item;locate_up(1,m_size);returntrue;}template<classT>voidHeap<T>::display(){for(inti=1;i<=m_size;i++)std::cout<<m_array[i]<<" ";std::cout<<std::endl;}template<classT>voidHeap<T>::arrayToHeap(T*array,intsize){m_maxSize=size;m_array=newT[m_maxSize+1];for(inti=1;i<=size;++i)this->insert(array[i-1]);this->heapSort();for(inti=1;i<size;++i)array[i-1]=m_array[i];}template<classT>voidHeap<T>::heapSort(){Ttemp;inti;for(i=m_size;i>=2;--i){temp=m_array[1];m_array[1]=m_array[i];m_array[i]=temp;locate_down(1,i-1);}}
6.1 Heaps 151
The (binary) heap data structure is an array object that we can view as a nearly complete binary tree.
12345678910
// calculate by shifting the binary representationintparent(i){returni/2;}intleft(i){return2*i;}intright(i){return2*i+1;}
max-heap: for every node i other than the root, A[parent(i)]>=A[i].
min-heap: for every node i other than the root, A[parent(i)]<=A[i].
height of a node: number of edges on the longest simple downward path from the node to a leaf.
height of the heap: height of its root. Θ(lg n)
6.2 Maintaining the heap property 154
12345678910111213
// Input:// A array// i index// Binary trees rooted at left(i) and right(i) are max-heaps// but A[i] might be smaller than its children. // Procedure:// exchange A[i] with the larger one between A[left[i]] and A[right[i]]// maxHeapify the new sub tree// Output:// This routine lets the value at A[i] float down in the max-heap. maxHeapify(A,i){TODO;}
conduct maxHeapify upon the array from the rear to the head
O(n)
6.4 The heapsort algorithm 159
build max-heap, exchange the root with the end of heap, and maxHeapify the rest of the heap.
6.5 Priority queues 162
one of the most popular applications of a heap: as an efficient priority queue. It can be used to schedule jobs on a shared computer. Highest-priority job first.
Despite this slow worst-case running time Θ(n2) , quicksort is often the best practical choice for sorting because it is remarkably efficient on the average: its expected running time is Θ(n lg n), and the constant factors hidden in the Θ(n lg n) notation are quite small.
7.1 Description of quicksort 170
Divide-and-conquer paradigm
Divide: element A[q] and subarrays on the left and right, s.t. left <= A[q] <= right
Worst-case partitioning: the partitioning routine produces one subproblem with n-1 elements and one with 0 elements.
T(n) = T(n-1) + T(0) + Θ(n)
Θ(n2)
Best-case partitioning: even possible split
T(n) = 2T(n/2) + Θ(n)
Θ(n lg n)
Balanced partitioning: Any split of constant proportionality yields a recursion tree of depth Θ(lg n), where the cost at each level is O(n).
Average case: PARTITION produces a mix of “good” and “bad” splits.
TODO
7.3 A randomized version of quicksort 179
pick the flag in the partition routine randomly (exchange with the one at the rear)
7.4 Analysis of quicksort 180
8 Sorting in Linear Time 191
comparison sorts: merge sort, heapsort, and quicksort sharing, must make Ω(n lg n) comparisons in the worst case to sort n elements.
However, counting sort, radix sort, and bucket sort…
8.1 Lower bounds for sorting 191
8.2 Counting sort 194
counting sort assumes that the input consists of integers in a small range
Create an extra array to record the occurrence of these number and put them back into a new array as the result. We usually use counting sort when we have k = O(n), in which case the running time is Θ(n).
dest[256]: 256 lists of bytes. each list should have enough space to hold source_n elements.
a pseudo-code would sort the list this way:
for i=0 to source_n do
dest[source[i]].append(source[i])
8.4 Bucket sort 200
bucket sort assumes that the input is generated by a random process that distributes elements uniformly and independently over the interval [0, 1).
Bucket sort divides the interval [0, 1) into n equal-sized subintervals, or buckets, and then distributes the n input numbers into the buckets. Since the inputs are uniformly and independently distributed over [0, 1), we do not expect many numbers to fall into each bucket. To produce the output, we simply sort the numbers in each bucket (e.g. insertion sort) and then go through the buckets in order, listing the elements in each.
Θ(n)
9 Medians and Order Statistics 213
Selection problem selects i-st smallest number among a set A of n (distinct) numbers
9.1 Minimum and maximum 214
How to find the min in A? Best way is to iterate it once (n-1 comparisons).
How to find min and max simultaneously? In fact, we can find both the minimum and the maximum using at most 3n/2 comparisons. 每次取一对,自己先比较,然后大的跟max比,小的跟min比,总共只需要比3n/2次。
#include <iostream>usingnamespacestd;intrandomizedSelect(double*A,intp,intr,inti);intrandomizedPatition(double*A,intp,intr);intpartition(double*A,intp,intr);// ----------------------------------------------------------------------------// select i-st largest number from A[p] to A[r]intrandomizedSelect(double*A,intp,intr,inti){if(p==r)returnA[p];intq=randomizedPatition(A,p,r);intk=q-p+1;if(i==k)// the pivot value is the answerreturnA[q];elseif(i<k)// on the leftreturnrandomizedSelect(A,p,q-1,i);else// on the rightreturnrandomizedSelect(A,q+1,r,i-k);}// ----------------------------------------------------------------------------// get the index of the number in the middle // s.t. subarray on the left <= A[index] <= subarray on the rightintrandomizedPatition(double*A,intp,intr){inti=random()%(p-r+1)+p;doubletmp=A[i];A[i]=A[r];A[r]=tmp;returnpartition(A,p,r);}// ----------------------------------------------------------------------------// pick A[r] as the flag// i for the left array <= A[r]// j for iterating the whole array // finally, exchange A[r] in the middle// return the new middle indexintpartition(double*A,intp,intr){doublex=A[r];inti=p-1;for(intj=p;j<=r-1;++j)if(A[j]<=x){++i;doubletmp=A[i];A[i]=A[j];A[j]=tmp;}doubletmp=A[i+1];A[i+1]=A[r];A[r]=tmp;returni+1;}// ----------------------------------------------------------------------------intmain(){doubleA[10]={8,1,3,2,4,7,5,6,9,0};for(inti=0;i<10;++i)cout<<A[i];cout<<endl;cout<<randomizedSelect(A,0,9,5)<<endl;return0;}
E[T(n)] = O(n)
9.3 Selection in worst-case linear time 220
O(n)
III Data Structures
Data structures: techniques for representing finite dynamic sets and manipulating them on a computer.
#include "stack.h"template<classT>classSeqStack:publicStack<T>{public:SeqStack(intmSize);~SeqStack(){delete[]s;}boolIsEmpty()const{returntop==-1;}boolIsFull()const{returntop==maxTop;}boolTop(T&x)const;//return the top element in xboolPush(Tx);TPop();voidClear(){top=-1;}private:inttop;intmaxTop;T*s;};template<classT>SeqStack<T>::SeqStack(intmSize){maxTop=mSize-1;s=newT[mSize];top=-1;}template<classT>boolSeqStack<T>::Top(T&x)const{if(IsEmpty()){std::cout<<"Empty"<<std::endl;returnfalse;}x=s[top];returntrue;}template<classT>boolSeqStack<T>::Push(Tx){if(IsFull()){std::cout<<"Overflow"<<std::endl;returnfalse;}s[++top]=x;returntrue;}template<classT>TSeqStack<T>::Pop(){if(IsEmpty()){std::cout<<"Underflow"<<std::endl;returnfalse;}returns[top--];}
template<classT>classLinearList{public:virtualboolIsEmpty()const=0;virtualintGetLength()const=0;virtualboolFind(inti,T&x)const=0;//return element x with index ivirtualintSearch(Tx)const=0;//if x is found, return its index, else -1virtualboolInsert(inti,Tx)=0;//insert x at element Ai.virtualboolDelete(inti)=0;//delete element AivirtualboolUpdate(inti,Tx)=0;//update element Ai with xvirtualvoidOutput(std::ostream&out)const=0;//ouput the list to the streamprotected:intElemNum;//element number in the list};
How do we implement pointers and objects in languages that do not provide them?
A multiple-array representation of objects
for example, next[], key[], prev[] can store the values and indexes
A single-array representation of objects
for example, store triple < key, next, prev > in one single array
Allocating and freeing objects
a garbage collector is responsible for determining which objects are unused.
We keep the free objects in a singly linked list, which we call the free list. Each object in the representation is either in list L or in the free list, but not in both. Free list acts like a stack. 在free list的head操作
10.4 Representing rooted trees 246
Binary trees
Rooted trees with unbounded branching
left-child, right-sibling representation
x.left-child points to the leftmost child of node x
x.right-sibling points to the sibling of x immediately to its right.
If node x has no children, then x.left-child = NIL, and if node x is the rightmost child of its parent, then x.right-sibling = NIL.
A hash table is an effective data structure for implementing dictionaries.
searching worst-case Θ(n)
expected O(1)
Instead of using the key as an array index directly, the array index is computed from the key.
11.1 Direct-address tables 254
Assume that universe U of keys is reasonably small, and no two elements have the same key. Store x at T[x.key] or at the place which T[x.key] points to. (Even storing the key is not necessary.)
slot: position in the array (direct-address table)
11.2 Hash tables 256
What if universe U is large? namely, K of keys stored in a dictionary is much smaller than the universe U of all possible keys?
hash function! element is stored in slot h(k) instead of key directly
h: U -> {0, 1, …, m-1}
An element with key khashes to slot h(k).
Problem: collision: two keys may hash to the same slot.
Collision resolution by chaining, insert at the head
Given a hash table T with m slots that stores n elements, we define the load factor α for T as n/m, that is, the average number of elements stored in a chain.
Worst-case: Θ(n) + T(compute hash)
Average-case: simple uniform hashing Any given element is equally likely to hash into any of the m slots, independently of where any other element has hashed to.
Unsuccessful, Θ(1+α) n个元素m个slot,期望是每个slot有n/m个元素
Successful, Θ(1+α)
11.3 Hash functions 262
What makes a good hash function? less collisions
satisfies (approximately) the assumption of simple uniform hashing.
Hash values are independent of any patterns that might exist in the data.
Interpreting keys as natural numbers
Different ways:
division method. h(k) = k mod m.
multiplication method. h(k) = ⦏m(kA mod 1)⦎, “kA mod 1” means the fractional part of kA, that is, kA - ⦏kA⦎. (A is a constant, 0 < A < 1) Knuth suggests that A = 0.618
universal hashing. select the hash function at random from a carefully designed class of functions.
11.4 Open addressing 269
Open addressing: all elements occupy the hash table itself. No chaining. load factor α can never exceed 1.
We compute the sequence of slots to be examined by probing until we find an empty slot. The probing sequence has not to be < 0, 1, …, m-1 > with Θ(n), but depends upon the key being inserted. To determine which slots to probe, we extend the hash function to include the probe number (starting from 0) as a second input. < h(k,0), h(k,1), …, h(k,m-1) >.
#include"dynamic.h"#include"binarytree.h"//二叉搜索树://所有节点的关键字不同,此树要么为空,要么://1. 左不空,则左所有<根//2. 右不空,则右所有>根//3. 左右子树均为二叉搜索树template<classT>classBSTree:publicDynamicSet<T>{public:BSTree(){root=NULL;}ResultCodeSearch_recursion(T&x)const;ResultCodeSearch_iteration(T&x)const;ResultCodeInsert_recursion(T&x);ResultCodeInsert_iteration(T&x);ResultCodeRemove(T&x);protected:BTNode<T>*root;private:ResultCodeSearch(BTNode<T>*p,T&x)const;ResultCodeInsert(BTNode<T>*p,T&x);};template<classT>ResultCodeBSTree<T>::Search_recursion(T&x)const{returnSearch(root,x);}template<classT>ResultCodeBSTree<T>::Search(BTNode<T>*p,T&x)const{if(!p)returnNotPresent;elseif(x>p->element)returnSearch(p->rChild,x);elseif(x<p->element)returnSearch(p->lChild,x);elsereturnSuccess;}template<classT>ResultCodeBSTree<T>::Search_iteration(T&x)const{BTNode<T>*p=root;while(p){if(x<p->element)p=p->lChild;elseif(x>p->element)p=p->rChild;elsereturnSuccess;}returnNotPresent;}template<classT>ResultCodeBSTree<T>::Insert_recursion(T&x)//ATTENTION: situation when root is NULL{if(root)returnInsert(root,x);elseroot=newBTNode<T>(x);}template<classT>ResultCodeBSTree<T>::Insert(BTNode<T>*p,T&x)//ATTENTION: situation when root is NULL{if(x>p->element){if(p->rChild==NULL){p->rChild=newBTNode(x);returnSuccess;}elsereturnInsert(p->rChild,x);}elseif(x<p->element){if(p->lChild==NULL){p->lChild=newBTNode(T);returnSuccess;}elsereturnInsert(p->lChild,x);}elsereturnDuplicate;}template<classT>ResultCodeBSTree<T>::Insert_iteration(T&x){BTNode<T>*p=root,*q=NULL;if(!p){q=p;if(x>p->element)p=p->rChild;elseif(x<p->element)p=p->lChild;elsereturnDuplicate;}p=newBTNode<T>(x);if(!root)root=p;elseif(x<q->element)q->lChild=p;elseq->rChild=p;returnSuccess;}template<classT>ResultCodeBSTree<T>::Remove(T*x){// TODO}
12.1 What is a binary search tree? 286
binary-search-tree property: left <= middle <= right: Let x be a node in a binary search tree. If y is a node in the left subtree of x, then y.key <= x.key. If y is a node in the right subtree of x, then y.key >= x.key.
How to print out keys in sorted order? inorder tree walk (other walks: preorder, postorder)
Iteration is faster in Java, C, Python… Recursion is fairly expensive compared to iteration (in general) because it requires the allocation of a new stack frame.
Recursion might be faster in functional programming language.
Searching
Recursive version: Me? search left? search right?
Iteration version: Use pointer p. p->key? p=p->left? p=p->right?
Min and Max
left-most leaf and right-most leaf
Successor and predecessor
Successor: If x has right subtree, return max(x.right), else find in parents.
Predecessor: If x has left subtree, return max(x.left), else find in parents.
12.3 Insertion and deletion 294
12.4 Randomly built binary search trees 299
13 Red-Black Trees 308
13.1 Properties of red-black trees 308
A BST with one extra color bit of storage per node.
Red-black properties:
Every node is either red or black.
The root is black.
Every leaf (NIL) is black.
If a node is red, then both its children are black.
For each node, all simple paths from the node to descendant leaves contain the same number of black nodes.
sentinel T.nil for all all pointers to NIL (replacing leaves and parent of root)
black-height of node x: number of black nodes on the path from (not including) x to a leaf
black-height of a tree: black-height of its root
A red-black tree with n internal nodes has height at most 2*log(n+1)
13.2 Rotations 312
Rotation: A local operation in a search tree that preserves the binary-search-tree property.
left-rotation: counter-clockwise
set y
turn y’s left subtree into x’s right subtree
link x’s parent to y
put x on y’s left
right-rotation: clockwise
13.3 Insertion 315
insert node redz into tree T (Property 5)
auxiliary procedure RBinsertFixup(T,z) to re-color and rotate.Ω
Property 2 is violated if z is the root
Property 4 is violated if z’s parent is red.
Case 1: z’s uncle y is red: z上两层反色
Case 2: z’s uncle y is black and z is a right child: around z left-rotate
Case 3: z’s uncle y is black and z is a left child: around z’s parent right-rotate + z上两层反色
O(lgn)
13.4 Deletion 323
RBTransplant(T, u, v): replace subtree u with subtree v. mind NIL node as parent and leaves
TODO
14 Augmenting Data Structures 339
14.1 Dynamic order statistics 339
Rank of an element: its position in the linear order of the set
Order-statistic tree: RBT + every node has x.size = number of nodes in its subtree, including the NIL sentinel
x.size = x.left.size + x.right.size + 1
14.2 How to augment a data structure 345
Choose an underlying data structure.
Determine additional information to maintain in the underlying data structure.
Verify that we can maintain the additional information for the basic modifying operations on the underlying data structure.