Today目标–如虎添翼(虎指suffix
Algorithms in Bioinformatics91/10/30筆記
Today目標–如虎添翼(虎指suffix tree)
* An fundamental query that significantly strengthens suffix tree
--Range Minima Query (RMQ)
-->>前翼: RMQ for ?sequences.
-->>後翼: RMQ for general sequences.
*今天介紹內容的Hierarchy
--+/-RMQ(最底層)
--LCA(第二層)
--LCE、RMQ(第三層分兩邊)
--Wildcard matching、Fuzzy matching (建構在LCE的第四層) --Document listing (建構在RMQ的第四層)
Part 0: 簡介RMQ
1. RMQ: Range Minima Query
*S: a sequence of numbers.
*小(S, i, j) = k if.(k是位置i到位置j中子字串最小值的位置) --i ? k ? j, and
--S[k] = min(S[i], S[i+1], …, S[j]).
*例題:
123456789
S = 340141932
--小(S, 2, 6) = 3
--小(S, 4, 9) = 4 (or 6).
2. The RMQ challenge
*Input: a sequence S of numbers.
*Output: a data structure D for S.(建造一個D,使其可以滿足時間複雜度的要求)
*Time complexity(要求要達成的目標)
--Constant query time
-->>Each query 小(S, i, j) for S can be answered from D and S in O(1) time.
--Linear preprocessing time
-->>D can be computed in O(|S|) time.
3. Naive approach
*Storing the answer of 小(S, i, j) in a table for all index pairs i and j with 1 ? i ? j ?
|S|.(直覺想法:將所有的”小(S, i, j)”算出來存著) *Query time = O(1).
*Preprocessing time = Ω(|S|?).(不符合要求)
4. Faster Preprocessing(讓Preprocessing time縮短一點)
*Assumption (without loss of generality).--|S| = 2^k for some positive integer k.(Sequence的長度必須為2的次方)
*Idea:
--Precomputing the values of 小(S, i, j) only for those indices i and j with j–i + 1 = 1,
2, 4, 8, …, 2^k = |S|.(不算出全部的答案,只求出2的次方長度的小(S, i, j)來儲存)
*Preprocessing time
--O(|S| log |S|).(還沒有達到目標,但是比Ω(|S|?)短了一些) 5. 小(S, i, j) still in O(1) time
*Let k be the (unique) integer that satisfies 2^k ? j – i + 1 < 2^ (k+1).
*Then, 小(S, i, j) is
--x = 小(S, i, i + 2^k–1) or
--y = 小(S, j–2^k + 1, j).(因為j – i + 1在2^k和2^ (k+1)之間,所以由i開始算到i+2^k-1的”小”和由j往回算2^k-1的”小”都已經在Preprocessing被算出來了,所以只要比較兩者誰小,就是i到j的小(S, i, j)了,且這兩個小所比較的範圍會將整個i到j 中所有的數值全部包括下去)
--因為所有2^k的小(S, i, j)已經記住,所以query可以在constant time完成.
Part 1: 前翼
*The RMQ Challenge for ?sequeneces(只針對特殊的sequences) 1. ?sequeneces
*S is a ?sequence if S[i] – S[i – 1] = ?1 for each index i with 2 ? i ? |S|.(前後數值的關係不是1就是-1)
*For example,(允許負數在sequence裡面)
--S = 5 6 5 4 3 2 3 2 3 4 5 6 5 6 7
-- + - - - - + - + + + + - + +
--S = 3 4 3 2 1 0 -1 -2 -1 0 1 2 1
-- + - - - - - - + + + + -
2. 前翼: The RMQ Challenge for ?sequeneces *Input: a ?sequence S of numbers
*Output: a data structure D for S
*Time complexity
--Constant query time
-->>Each query 小(S, i, j) for S can be answered from D and S in O(1) time.
--Linear preprocessing time
-->>D can be computed in O(|S|) time under the unit-cost RAM model.
-->>全部假設在Unit-Cost RAM model下才能達到此time complexity的目標.
3. Unit-Cost RAM model
*Operations such as add, minus, comparison on consecutive O(log n) bits can be
performed in O(1) time.
--在排序演算法中需要O(n*log n)的時間才能排好,但若把bit=log n的時間考慮下去,則需要O(n*(log n)?)的時間,所以一般都是省略bit的時間.
4. Idea: compression
*Breaking S into blocks of length L = ? log |S|. (any constant c < 1 is OK,L的長度要為log|S|*(小於1的數值)才可以)
--There are B = 2|S|/log |S| blocks.(將S切開成B個blocks,每個blocks長度為? log |S|)
*Let 縮[t] be the minimum of the i-th block of S. --縮[t] = min {S[j] | (t–1) L < j ? tL} for t = 1, 2, …, B.(縮[t]記住第t個block最小的數值)
--Computable in O(|S|) time.
*RMQ on縮: 小(縮, x, y)
--O(1) query time.
--O(|S|) preprocessing time. (Why?)(所有的縮有(|S|/log |S|)個,比較彼此之間的大小需要花log(|S|/log |S|)的時間,所以處理所花的時間為O{(|S|/log |S|)*log(|S|/log
|S|)}會在O(|S|)之中).
5.小(S, i, j) via 小(縮, s, t)
*Suppose S[i] is in the α-th block of S.
--(α–1) L>If (++error > k) then return “no”;(進入while loop則error要加1,若error大於k則表示P在此沒有fussy occurrences)
-->>j += 1 + 延(i + j + 1, j + 2);(error沒有超過k,則繼續比對下去)
--return “yes”.
--loop只花k次的時間,所以全部只花k*|S|的時間 3. O (k|S|) time
*O(|P|+|S|) = O(|S|) time: preprocessing for supporting each 延(i, j) query in O(1)
time.
*O(|S|) iterations, each takes time O(k).
Part 6: 後翼: The RMQ (i.e., 小(S, i, j)) challenge for general sequences
*Another application of lowest common ancestor 1. The RMQ challenge
*Input: a sequence S of numbers
*Output: a data structure D for S
*Time complexity
--Constant query time
-->>Each query 小(S, i, j) for S can be answered from D and S in O(1) time.
--Linear preprocessing time
-->>D can be computed in O(|S|) time.
2. Idea: Minima Tree
*建構Minima Tree
123456789
S=432417363
--先找整個string中最小值的當root,由這個位置分左半邊和右半邊,再分別尋找最小值來當root,依此類推將整個tree建構起來. *小(S,i,j)=祖(i,j).
--若i和j為不同邊,不管怎麼挑i和j,則root一定是它們的最小值,若在同一邊,根據每次都挑最小的當root,則一定會找到最小的值. 3. Homework 1 (Optional)(兩週之內交)
*Show how to construct a minima tree for any sequence S of numbers in O(|S|) time.
*Due in two weeks.
--Email your typesetted solutions to TAs, or hand in your hand-written solutions to me
before the end of next lecture (in two weeks.)
Part 7: Listing source strings that contains a pattern string [Muthukrishnan,
SODA’02]
*An application of RMQ for general sequences
1. The problem
*Input:
--Strings S1, S2, …, Sm, which can be preprocessed in linear time. --A string P.
*Output: (問的問題可以有許多種,這只是其中的一種) --The index j of each Sj that contains P.(P若出現在Sj中,則將j output出來)
2. Preliminary attempts
*Obtaining the suffix tree for S1#S2#…#Sm$.
--Find all occurrences of P.
-->>i.e., exact string matching for S1#S2#…#Sm$ and P.
-->>Time = O(|P| + total number of occurences of P).(加上P在這個tree中出現的次數,但是這樣的時間複雜度不佳,因為若P只出現在S2中,但卻重複出現了100次,則時間的複雜度要為O(|P| + 100),但output卻只有”2”一個index,實在是太浪費時間了)
*Obtaining the suffix tree for each Si.--Determining whether P occurs in Si.
-->>i.e., substring problem for each pair Si and P. -->>Time = O(|P| * m).(每個S都去建一個tree,到每個tree去詢問P是否在裡面,則需要O(|P| * m)的時間)
3. The challenge
*Input:
--Strings S1, S2, …, Sm, which can be preprocessed in linear time. --A string P.
*Output:
--The index j of each Sj that contains P.
*Objective
--O(|P| + 現(P)) time, where 現(P) is the number of output indices.(P出現在幾個S中,則為現(P)的值)
4. The second attempt
*Constructing the suffix tree for S1#S2#…#Sm$. *Keeping the distinct descendant leaf colors for each internal node.(在internal node中記住自這個點以下所包含的葉子顏色)
*Query time? 符合我們所要求的目標
*Preprocessing time? 會花太多的時間
*Each query takes O(|P|+現(P)) time. (Why?)
--因為P出現在哪裡,代表以下所有包含的葉子顏色都出現了P.
*The preprocessing may need Ω(m|S1#S2#…#Sm$|) time. (Why?) --因為internal node要記住的顏色太多.
*Q: Any suggestions for resolving this problem?
--像之前一樣,將所有葉子的顏色由左至右排列,以方便查詢. 5. Compact Representation
*Keeping the list 彩 of leaf colors from left to right.(將全部的葉子由左至右排列並用一個”彩”來記住所有葉子的顏色)
*Each internal keeps the indices of leftmost and rightmost descendant leaves.(internal
node只要記住葉子最左邊和最右邊的兩個index) *例題中共有8個葉子,所以彩也有8個,但是顏色只有3種,output時間中的現(P)希望和出現的顏色種類成正比.
6. The challenge of listing distinct colors
*Input: a sequence 彩 of colors.
*Output: a data structure D for 彩 such that
--D is computable in O(|彩|) time.
--Each 顏(i, j) = {彩(i), …, 彩(j)} query can be answered from D in O(|顏(i, j)|) time.(顏(i, j)是指在彩(i)到彩(j)中出現幾種顏色) 7. An auxiliary index array
*Let 前[i] = 0 if 彩[j] ?彩[i] for all j < i. (多一個array來幫忙,若第i個彩在之前的i-1個彩中都沒有出現過,則前[i]的值給0) *Let 前[i] be the largest index j with j < i such that 彩[i] = 彩[j].(若在i之前已經出現過這個顏色,則前[i]為前一個出現這個顏色的位置) 8. An observation
*A color c is in 顏(i, j) if and only there is an index k in [i, j] such that 彩[k] = c, and
前[k] < i.
--在i到j中,若有一個顏色c在位置k,要output出去,則表示這個彩[k]=c且前[k]會小於i,因為若前[k]大於i則表示在i到j之間已經出現過顏色c,所以不應該再output一次.
9. The algorithm 解(i, j)
*Just recursively call 破(i, j, i);
*Subroutine 破(p, q, 左界): (左界為i,且自始至終都沒有改變) --If (p > q) then return; (p ? q才能往下一步走) --Let k = 小(前, p, q);(請問p到q之間最小的前在哪個位置) --If (前[k] ? 左界) then return;(若最小的前值已經在左界的右邊,則表示p到q之間已經沒有新的顏色可以output了)
--Output 彩[k];(前[k]<左界表示為第一次出現的顏色) --Call 破(p, k–1, 左界);(因k而分兩邊來繼續找,所以左右邊分別呼叫破)
--Call 破(k + 1, q, 左界);
10. Tracing 破(4, 8, 4)…
*例題:
1 2 3 4 5 6 7 8
彩 黃紅綠紅綠黃綠黃
前 0 0 0 2 3 1 5 6
--判斷p是否小於等於q,因為4<8,成立往下計算 --小(前,4,8)=6
--判斷前[6]=1<左界=4,繼續往下計算
--輸出彩[6]=黃色
--call破(4,5,4) (4=p ? q=5成立,找小(前,4,5)=4,且前[4]=2<左界=4,輸出彩[4]=紅色,呼叫破(4,3,4)和破(5,5,4);破(4,3,4)因為4=p > q=3所以return,破(5,5,4) 則因5=p ? q=5成立,找小(前,5,5)=5,且前[5]=3<左界=4,輸出彩[5]=綠色,呼叫破(5,4,4)和破(6,5,4);因這兩個5=p > q=4和6=p > q=5,所以皆return)
--call破(7,8,4)(p左界=4,所以return不用再找了)
11. Time = O(|顏(i, j)|)
*Why?