¹ | Ñëàéä | Òåêñò |
1 |
 |
Suffix Arrays: A new method for on-line string searchesUdi Manber Gene Myers May 1989 Presented by: Oren Weimann 1 |
2 |
 |
Introduction - Problem definitionW= badgfbb A= abccbbadgfbbcahgjf A= abccbbadgfbbcahgjf “Is W a substring of A?” |A|=N and |W|=P A = a0a1…aN-1 Ai = suffix beginning at index i = aiai+1…aN-1 2 |
3 |
 |
Introduction – what is a suffix arrayExample: A = assassin Pos[2] = 6 (A6 = in) Pos 3 0 1 2 3 4 5 6 7 |
4 |
 |
Introduction – what is a suffix arrayA lexicographically sorted array- Pos[N], of all the suffixes of A: Pos[k] = i ? Ai is the kth smallest suffix in the set {A0, A1, A2…… AN-1} 4 |
5 |
 |
Introduction – what is a suffix treeExample: A = assassin A trie that contains all suffixes of A: 5 s a s s a s s i n s 1 i i a n i n s a n i s s 6 s 5 4 i n i n n 0 3 2 0 1 2 3 4 5 6 7 |
6 |
 |
The Article OverviewA search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) An Algorithm for computing the lcp information in O(NlogN). Algorithms for Expected-time improvement. 6 |
7 |
 |
The Search algorithm - DefinitionsFor any string u, up = u1u2u3…….up (or u if |u| p) Let “ “ denote a Lexicographical order, We say u v ? up vp Note that for any choice of p: Note that W is a substring of A ? there is an i such that W 7 |
8 |
 |
The Search algorithm – how does the array help us know if W is asubstring of A? We define a search interval: LW = min {k | W APos[k] or k = N} RW = max {k | W APos[k] or k = -1} W matches ai ai+1 ...ai+P-1 ? i=Pos[k] for some k [LW, RW] 8 |
9 |
 |
Example:A = assassin Pos 9 Option 1 Option 2 Option 3 0 1 2 3 4 5 6 7 |
10 |
 |
Why finding LW, RW == Finding the matches: If LW > RW => W is not asubstring of A. Else: there are (RW-LW+1) matches - APos[LW],…, APos[RW] 10 Pos W>APos[k] W<APos[k] LW RW |
11 |
 |
The Search algorithm – The easy way - O(PlogN)W=“abcx” 11 Log(N) iterations, each iteration sets new L,R bonds (initially L=0, R=N-1) according to a comparison of W with APos[M] , where M=(L+R)/2. In the end LW R M R L Pos abcde... abcdf... abd... |
12 |
 |
The Search algorithm using lcp values in O(P+logN) – Definitions:Speedup using precomputed lcp Values, for now We assume lcp is known. Each iteration We define: l = lcp(APos[L], W) r=lcp(W, APos[R]) Llcp[M] = lcp(APos[L] APos[M]) Rlcp[M] = lcp(APos[M], APos[R]) 12 |
13 |
 |
The Search algorithm using lcp values in O(P+logN) Example: A=“abcx”Note that Llcp[M] is well defined because every midpoint M has one LM and one RM 13 M R L l = 3 r = 2 Pos Llcp[M]=4 Rlcp[M]=2 abcde... abcdf... abd... |
14 |
![So how do we use l,r,Llcp[M]](/up/thumbs/240140/014.jpg) |
So how do we use l,r,Llcp[M] Example: W=abcx 14 Case 1: Llcp[M] > l (Llcp[M]=4 and l=3 ) W>APos[L] W>APos[M] Go right l is unchanged = 3 R L M Llcp[M]=4 l=3 r=2 abcde... abc... abc... abcdf… abd… |
15 |
 |
Example: W=abcx (cont15 Case 2: Llcp[M] < l (Llcp[M]=2 and l=3 ) APos[L] <APos[M] W<APos[M] Go left r = Llcp[M] = 2 M L R Llcp[M]=2 l=3 r=2 abcde... abdf… abd… |
16 |
 |
Example: W=abcx (cont16 Case 3: Llcp[M] = l (Llcp[M]=3 and l=3 ) Compare Wl and APos[M]l until Wl+j APos[M]l+j Go right or left according to Wl+j, APos[M]l+j new l or r = (l+j) Number of comparisons = j+1 L M R Llcp[M]=3 r=2 l=3 abcde... abc... abc... abcp… abd… |
17 |
 |
The Search algorithm using lcp values- complexityIn each iteration there are maximum j+1 comparisons, when in total Total comparisons (P + #Iterations) O(P+logN) running time Requires only 3N-sized arrays 17 |
18 |
 |
The Article OverviewA search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) An Algorithm for computing the lcp information in O(NlogN). Algorithms for Expected-time improvement. 18 |
19 |
 |
Construction of suffix array in O(NlogN)Sorting the suffixes in a unique Radix sort – We Will have O(logN) stages (numbered 1,2,4,8,16…) In stage H the suffixes are sorted in buckets called H Buckets, according to the first H characters. (next stage is 2H– thus, in stage H the suffixes are sorted by ) 19 |
20 |
 |
Construction of suffix array – The general ideaIf Ai, Aj H-bucket we Sort them by the Next H symbols, but: Their next H symbols = first H symbols of Ai+H and Aj+H which are already sorted in phase H. 20 H=2: first bucket second bucket third bucket fourth bucket Ai Aj Aj+H Ai+H abef… abcd… ab… bb... bb… cd… cd… ef… |
21 |
 |
Construction of suffix array – The general idea (contLet Ai be in first H-bucket after stage H Ai starts with smallest H-symbol string Ai-H should be first in its H-bucket 21 H=2: Ai Ai-H abef… abcd… ab… bb... bb… cdef… cdab… ef… |
22 |
 |
Construction of suffix array – The algorithm22 Go over the suffix array: For each Ai: Move Ai-H to next available place in its H-bucket The suffixes are now sorted according to -order Go over the array again, and decide which suffix opens a new 2H-bucket, use lcs knowledge (described later) |
23 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=1 23 A2 A3 Ai sets Ai-1 assin assassin in n sin ssin sassin ssassin 0 1 2 3 4 5 6 7 |
24 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=1 24 A0 Ai sets Ai-1 assin assassin in n sassin ssin sin ssassin 0 1 2 3 4 5 6 7 |
25 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=1 25 A6 A5 Ai sets Ai-1 assin assassin in n sassin ssin sin ssassin 0 1 2 3 4 5 6 7 |
26 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=1 26 A6 A7 Ai sets Ai-1 assin assassin in n sassin sin ssin ssassin 0 1 2 3 4 5 6 7 |
27 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=1 27 A2 A1 Ai sets Ai-1 assin assassin in n sassin sin ssin ssassin 0 1 2 3 4 5 6 7 |
28 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=1 28 A4 A5 Ai sets Ai-1 assin assassin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
29 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=1 29 A0 A1 Ai sets Ai-1 assin assassin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
30 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=1 30 A3 A4 Ai sets Ai-1 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
31 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=1 Go over array to get new 2-buckets 31 lcs(sassin,sin)= 1+ lcs(assin,in)= 1+0=1 so “sin” opens a new 2-bucket Ai sets Ai-1 back assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
32 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=2 32 A0 Ai sets Ai-2 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
33 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=2 33 A1 A3 Ai sets Ai-2 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
34 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=2 34 A4 A6 Ai sets Ai-2 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
35 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=2 35 A7 A5 Ai sets Ai-2 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
36 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=2 36 A2 A0 Ai sets Ai-2 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
37 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=2 37 A3 A5 Ai sets Ai-2 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
38 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=2 38 A1 Ai sets Ai-2 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
39 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=2 39 A2 A4 Ai sets Ai-2 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
40 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=2 Go over array to get new 4-buckets 40 Ai sets Ai-2 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
41 |
 |
Construction of suffix array – The algorithm Example:A = assassin H=4 That’s it, we are sorted! 41 assassin assin in n sassin sin ssassin ssin 0 1 2 3 4 5 6 7 |
42 |
 |
Construction of suffix array – Complexity Summary42 Sorting by first char – O(N) O(logN) stages of O(N) operations = O(NlogN) Total - time: O(NlogN) - space: 2 integer arrays of size N back |
43 |
 |
The Article OverviewA search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). How to construct Pos[ ] in O(NlogN) time and O(N) space. An Algorithm for computing the lcp information in O(NlogN). Algorithms for Expected-time improvement. 43 |
44 |
 |
How to find Longest Common Prefixes – the general ideaWe don’t care what is the lcp between suffixes in the same H-bucket. For Ap, Aq in the same H-bucket but different 2H-buckets: H lcp(Ap, Aq) < 2H lcp(Ap, Aq) = H + lcp(Ap+H, Aq+H) lcp(Ap+H, Aq+H) < H ? that is why Ap+H,Aq+H Are in different H-buckets, but which ones? 44 |
45 |
 |
How to find Longest Common Prefixes – the general ideaIf Ap+H and Aq+H were in adjacent H-buckets then lcp is known. how? If not, Then: lcp(APos[i], APos[j]) = {lcp(APos[k],APos[k+1])} 45 |
46 |
 |
How to find Longest Common Prefixes – the general idealcp(Ap+H, Aq+H) = min{1,1,2} = 1 46 H=2 Ap+h Aq+h Notice that if 2 neighbors are in the same H-bucket we can consider there lcp to be H, since lcp(Ap+H, Aq+H) < H 1 1 2 assassin assin in n sassin sin ssassin ssin |
47 |
![How to find lcp – algorithm and data structures – Hgt[]](/up/thumbs/240140/047.jpg) |
How to find lcp – algorithm and data structures – Hgt[]During the construction stage, we build an array Called Hgt[N]: Hgt(i)=lcp(APos[i-1], APos[i]), initialized so that Hgt[i]=N+1 for every i. In stage H=1: Hgt(i)=0 for APos[i] that are first in their buckets. In stage 2H: we update every Hgt(i) that APos[i] is the first in a newly created 2H bucket 47 |
48 |
![How to find lcp – Hgt[] example:](/up/thumbs/240140/048.jpg) |
How to find lcp – Hgt[] example:48 lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1 |
49 |
![How to find lcp – Hgt[] example (cont](/up/thumbs/240140/049.jpg) |
How to find lcp – Hgt[] example (contlcp(assassin,assin)=2+lcp(sin, sassin)=2+1=3 lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2 49 |
50 |
 |
How to find lcp – data structuresWe need a data structure that will contain lcp(APos[j], APos[i]) between any i and j (not just i and i+1 which Hgt[] supplies) Hgt[] will become the leaves of a binary balanced tree called the Interval tree. 50 |
51 |
 |
How to find lcp – example of Interval tree51 0 9 0 0 0 9 9 0 0 0 9 9 9 |
52 |
 |
How to find lcp – ComplexityEach time a leaf opens a new bucket we change Hgt[i] for that leaf. That change requires O(logN) changes in the interval tree There are O(N) leaves opening new bucket In total we get O(NlogN) to get all lcp values 52 |
53 |
 |
The Article OverviewA search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). How to construct Pos[ ] in O(NlogN) time and O(N) space. An Algorithm for computing the lcp information in O(NlogN). Algorithms for Expected-time improvement. 53 |
54 |
![Time Expected-case Improvement of the construction of pos[]](/up/thumbs/240140/054.jpg) |
Time Expected-case Improvement of the construction of pos[]54 Assumptions: - All N-symbol strings are equally likely. Under this assumption: Expected length of longest repeated substring = O(log| |N) This immediately implies that construction of pos[] is reduced to O(NLogLogN). why? Next is a way to reduce it to O(N). |
55 |
![Time Expected-case Improvement of the construction of pos[]](/up/thumbs/240140/055.jpg) |
Time Expected-case Improvement of the construction of pos[]55 Let T = We encode each possible T length string to an integer with the isomorphism IntT(u) Map each AP to IntT(AP) [0,| |T-1] : IntT(AP) = ap| |T-1 + |
56 |
 |
Example of the mapping56 IntT(AP) = ap| |T-1 + 0*4^0 + 0 0 3*4^0 + 0 3 3*4^0 + 0 3 0*4^0 + 0 0 3*4^0 + 0 3 3*4^0 + 0 3 1*4^0 + 0 1 2*4^0 + 0 2 | |= 4 , a=0, i=1, n=2, s=3 N=8 T= =1 |
57 |
![Time Expected-case Improvement of the construction of pos[]](/up/thumbs/240140/057.jpg) |
Time Expected-case Improvement of the construction of pos[]57 By the definition of IntT(AP) it takes O(N) to compute all IntT(AP) values of all suffixes. So now instead of starting with H=1 we start with H= But since the longest repeated substring length is O(log| |N) we will have O(1) stages of the radix sort. Thus, the total time for constructing pos[] = O(N) |
58 |
 |
So is a suffix array better then a suffix treeSuffix array Suffix tree 58 Construction time O(NlogN) - for small | | O(N) – needs additional space O(N) Time Complexity O(P+logN) – good for large alphabets O(Plog| |) Space Complexity requires 2N integers – this is the main advantage. O(N) dependent on | | ? No Yes |
59 |
 |
The End59 |
«Suffix Arrays: A new method for on-line string searches» |
http://900igr.net/prezentacija/anglijskij-jazyk/suffix-arrays-a-new-method-for-on-line-string-searches-240140.html