Suffix automaton

From Justapedia, unleashing the power of collective wisdom
Jump to navigation Jump to search

Suffix automaton
Suffix automaton bold.svg
TypeSubstring index
Invented1983
Invented byAnselm Blumer; Janet Blumer; Andrzej Ehrenfeucht; David Haussler; Ross McConnell
Time complexity in big O notation
Algorithm Average Worst case
Space

In computer science, a suffix automaton is an efficient data structure for representing the substring index of a given string which allows the storage, processing, and retrieval of compressed information about all its substrings. The suffix automaton of a string is the smallest directed acyclic graph with a dedicated initial vertex and a set of "final" vertices, such that paths from the initial vertex to final vertices represent the suffixes of the string.

In terms of automata theory, a suffix automaton is the minimal partial deterministic finite automaton that recognizes the set of suffixes of a given string . The state graph of a suffix automaton is called a directed acyclic word graph (DAWG), a term that is also sometimes used for any deterministic acyclic finite state automaton.

Suffix automata were introduced in 1983 by a group of scientists from the University of Denver and the University of Colorado Boulder. They suggested a linear time online algorithm for its construction and showed that the suffix automaton of a string having length at least two characters has at most states and at most transitions. Further works have shown a close connection between suffix automata and suffix trees, and have outlined several generalizations of suffix automata, such as compacted suffix automaton obtained by compression of nodes with a single outgoing arc.

Suffix automata provide efficient solutions to problems such as substring search and computation of the largest common substring of two and more strings.

History

Anselm Blumer with a drawing of generalized CDAWG for strings ababc and abcab

The concept of suffix automaton was introduced in 1983[1] by a group of scientists from University of Denver and University of Colorado Boulder consisting of Anselm Blumer, Janet Blumer, Andrzej Ehrenfeucht, David Haussler and Ross McConnell, although similar concepts had earlier been studied alongside suffix trees in the works of Peter Weiner,[2] Vaughan Pratt[3] and Anatol Slissenko.[4] In their initial work, Blumer et al. showed a suffix automaton built for the string of length greater than has at most states and at most transitions, and suggested a linear algorithm for automaton construction.[5]

In 1983, Mu-Tian Chen and Joel Seiferas independently showed that Weiner's 1973 suffix-tree construction algorithm[2] while building a suffix tree of the string constructs a suffix automaton of the reversed string as an auxiliary structure.[6] In 1987, Blumer et al. applied the compressing technique used in suffix trees to a suffix automaton and invented the compacted suffix automaton, which is also called the compacted directed acyclic word graph (CDAWG).[7] In 1997, Maxime Crochemore and Renaud Vérin developed a linear algorithm for direct CDAWG construction.[1] In 2001, Shunsuke Inenaga et al. developed an algorithm for construction of CDAWG for a set of words given by a trie.[8]

Definitions

Usually when speaking about suffix automata and related concepts, some notions from formal language theory and automata theory are used, in particular:[9]

  • "Alphabet" is a finite set that is used to construct words. Its elements are called "characters";
  • "Word" is a finite sequence of characters . "Length" of the word is denoted as ;
  • "Formal language" is a set of words over given alphabet;
  • "Language of all words" is denoted as (where the "*" character stands for Kleene star), "empty word" (the word of zero length) is denoted by the character ;
  • "Concatenation of words" and is denoted as or and corresponds to the word obtained by writing to the right of , that is, ;
  • "Concatenation of languages" and is denoted as or and corresponds to the set of pairwise concatenations ;
  • If the word may be represented as , where , then words , and are called "prefix", "suffix" and "subword" (substring) of the word correspondingly;
  • If and (with ) then is said to "occur" in as a subword. Here and are called left and right positions of occurrence of in correspondingly.

Automaton structure

Formally, deterministic finite automaton is determined by 5-tuple , where:[10]

  • is an "alphabet" that is used to construct words,
  • is a set of automaton "states",
  • is an "initial" state of automaton,
  • is a set of "final" states of automaton,
  • is a partial "transition" function of automaton, such that for and is either undefined or defines a transition from over character .

Most commonly, deterministic finite automaton is represented as a directed graph ("diagram") such that:[10]

  • Set of graph vertices corresponds to the state of states ,
  • Graph has a specific marked vertex corresponding to initial state ,
  • Graph has several marked vertices corresponding to the set of final states ,
  • Set of graph arcs corresponds to the set of transitions ,
  • Specifically, every transition is represented by an arc from to marked with the character . This transition also may be denoted as .

In terms of its diagram, the automaton recognizes the word only if there is a path from the initial vertex to some final vertex such that concatenation of characters on this path forms . The set of words recognized by an automaton forms a language that is set to be recognized by the automaton. In these terms, the language recognized by a suffix automaton of is the language of its (possibly empty) suffixes.[9]

Automaton states

"Right context" of the word with respect to language is a set that is a set of words such that their concatenation with forms a word from . Right contexts induce a natural equivalence relation on the set of all words. If language is recognized by some deterministic finite automaton, there exists unique up to isomorphism automaton that recognizes the same language and has the minimum possible number of states. Such an automaton is called a minimal automaton for the given language . Myhill–Nerode theorem allows it to define it explicitly in terms of right contexts:[11][12]

Theorem — Minimal automaton recognizing language over the alphabet may be explicitly defined in the following way:

  • Alphabet stays the same,
  • States correspond to right contexts of all possible words ,
  • Initial state corresponds to the right context of the empty word ,
  • Final states correspond to right contexts of words from ,
  • Transitions are given by , where and .

In these terms, a "suffix automaton" is the minimal deterministic finite automaton recognizing the language of suffixes of the word . The right context of the word with respect to this language consists of words , such that is a suffix of . It allows to formulate the following lemma defining a bijection between the right context of the word and the set of right positions of its occurrences in :[13][14]

Theorem — Let be the set of right positions of occurrences of in .

There is a following bijection between and :

  • If , then ;
  • If , then .

For example, for the word and its subword , it holds and . Informally, is formed by words that follow occurrences of to the end of and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle endpos(ab)} is formed by right positions of those occurrences. In this example, the element corresponds with the word Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle s_3 s_4 s_5 s_6 s_7 = acaba \in [ab]_R} while the word Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a \in [ab]_R} corresponds with the element Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 7-|a|=6 \in endpos(ab)} .

It implies several structure properties of suffix automaton states. Let Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle |\alpha| \leq |\beta|} , then:[14]

  • If Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_R} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\beta]_R} have at least one common element Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} , then Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle endpos(\alpha)} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle endpos(\beta)} have a common element as well. It implies Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} is a suffix of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta} and therefore Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle endpos(\beta) \subset endpos(\alpha)} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\beta]_R \subset [\alpha]_R} . In aforementioned example, Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle a \in [ab]_R \cap [cab]_R} , so Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle ab} is a suffix of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle cab} and thus Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [cab]_R=\{a\} \subset \{a,acaba\} = [ab]_R} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle endpos(cab) = \{6\} \subset \{2,6\} = endpos(ab)} ;
  • If Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_R=[\beta]_R} , then Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle endpos(\alpha)=endpos(\beta)} , thus Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} occurs in Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S} only as a suffix of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta} . For example, for Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha=b} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta = ab} it holds that Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [b]_R = [ab]_R = \{a,acaba\}} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle endpos(b)=endpos(ab)=\{2,6\}} ;
  • If Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_R=[\beta]_R} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} is a suffix of such that , then . In the example above and it holds for "intermediate" suffix that .

Any state of the suffix automaton recognizes some continuous chain of nested suffixes of the longest word recognized by this state.[14]

"Left extension" of the string is the longest string that has the same right context as . Length of the longest string recognized by is denoted by . It holds:[15]

Theorem — Left extension of may be represented as , where is the longest word such that any occurrence of in is preceded by .

"Suffix link" of the state is the pointer to the state that contains the largest suffix of that is not recognized by .

In this terms it can be said recognizes exactly all suffixes of that is longer than and not longer than . It also holds:[15]

Theorem — Suffix links form a tree that may be defined explicitly in the following way:

  1. Vertices of the tree correspond to left extensions of all substrings,
  2. Edges of the tree connect pairs of vertices , such that and .

Connection with suffix trees

Relationship of the suffix trie, suffix tree, DAWG and CDAWG

A "prefix tree" (or "trie") is a rooted directed tree in which arcs are marked by characters in such a way no vertex of such tree has two out-going arcs marked with the same character. Some vertices in trie are marked as final. Trie is said to recognize a set of words defined by paths from its root to final vertices. In this way prefix trees are a special kind of deterministic finite automata if you perceive its root as an initial vertex.[16] The "suffix trie" of the word is a prefix tree recognizing a set of its suffixes. "A suffix tree" is a tree obtained from a suffix trie via the compaction procedure, during which consequent edges are merged if the degree of the vertex between them is equal to two.[15]

By its definition, a suffix automaton can be obtained via minimization of the suffix trie. It may be shown that a compacted suffix automaton is obtained by both minimization of the suffix tree (if one assumes each string on the edge of the suffix tree is a solid character from the alphabet) and compaction of the suffix automaton.[17] Besides this connection between the suffix tree and the suffix automaton of the same string there is as well a connection between the suffix automaton of the string and the suffix tree of the reversed string .[18]

Similarly to right contexts one may introduce "left contexts" , "right extensions" corresponding to the longest string having same left context as and the equivalence relation . If one considers right extensions with respect to the language of "prefixes" of the string it may be obtained:[15]

Theorem — Suffix tree of the string may be defined explicitly in the following way:

  • Vertices of the tree correspond to right extensions of all substrings,
  • Edges correspond to triplets such that and .

Here triplet means there is an edge from to with the string written on it

, which implies the suffix link tree of the string and the suffix tree of the string are isomorphic:[18]

Suffix structures of words "abbcbc" and "cbcbba" 

Similarly to the case of left extensions, the following lemma holds for right extensions:[15]

Theorem — Right extension of the string may be represented as Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \overrightarrow{\gamma} = \gamma\alpha} , where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} is the longest word such that every occurrence of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} in Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S} is succeeded by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta} .

Size

A suffix automaton of the string Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S} of length Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n>1} has at most Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 2n-1} states and at most Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 3n-4} transitions. These bounds are reached on strings Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle abb\dots bb=ab^{n-1}} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle abb\dots bc=ab^{n-2}c} correspondingly.[13] This may be formulated in a stricter way as Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle |\delta| \leq |Q| + n - 2} where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle |\delta|} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle |Q|} are the numbers of transitions and states in automaton correspondingly.[14]

Maximal suffix automata

Construction

Initially the automaton only consists of a single state corresponding to the empty word, then characters of the string are added one by one and the automaton is rebuilt on each step incrementally.[19]

State updates

After a new character is appended to the string, some equivalence classes are altered. Let Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_\omega}} be the right context of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha } with respect to the language of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} suffixes. Then the transition from Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_\omega}} to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_{\omega x}}} after Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} is appended to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} is defined by lemma:[14]

Theorem — Let Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha, \omega \in \Sigma^*} be some words over Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Sigma} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x \in \Sigma} be some character from this alphabet. Then there is a following correspondence between Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_\omega}} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_{\omega x} }} :

  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_{\omega x} } = [\alpha]_{R_\omega}x \cup \{\varepsilon\}} if Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} is a suffix of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega x} ;
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_{\omega x} } = [\alpha]_{R_\omega}x} otherwise.

After adding Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} to the current word Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} the right context of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} may change significantly only if Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} is a suffix of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega x} . It implies equivalence relation is a refinement of . In other words, if , then . After the addition of a new character at most two equivalence classes of will be split and each of them may split in at most two new classes. First, equivalence class corresponding to empty right context is always split into two equivalence classes, one of them corresponding to itself and having as a right context. This new equivalence class contains exactly and all its suffixes that did not occur in , as the right context of such words was empty before and contains only empty word now.[14]

Given the correspondence between states of the suffix automaton and vertices of the suffix tree, it is possible to find out the second state that may possibly split after a new character is appended. The transition from to corresponds to the transition from to in the reversed string. In terms of suffix trees it corresponds to the insertion of the new longest suffix into the suffix tree of . At most two new vertices may be formed after this insertion: one of them corresponding to , while the other one corresponds to its direct ancestor if there was a branching. Returning to suffix automata, it means the first new state recognizes and the second one (if there is a second new state) is its suffix link. It may be stated as a lemma:[14]

Theorem — Let , be some word and character over . Also let be the longest suffix of , which occurs in , and let . Then for any substrings of it holds:

  • If and , then ;
  • If and , then ;
  • If and , then .

It implies that if (for example, when Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} didn't occur in Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} at all and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha=\beta=\varepsilon} ), then only the equivalence class corresponding to the empty right context is split.[14]

Besides suffix links it is also needed to define final states of the automaton. It follows from structure properties that all suffixes of a word Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} recognized by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle q=[\alpha]_R} are recognized by some vertex on suffix path Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (q, link(q), link^2(q), \dots)} of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle q} . Namely, suffixes with length greater than Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle len(link(q))} lie in Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle q} , suffixes with length greater than Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle len(link(link(q))} but not greater than Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle len(link(q))} lie in Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle link(q)} and so on. Thus if the state recognizing Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} is denoted by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle last} , then all final states (that is, recognizing suffixes of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} ) form up the sequence Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (last, link(last), link^2(last), \dots)} .[19]

Transitions and suffix links updates

After the character Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} is appended to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} possible new states of suffix automaton are Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\omega x]_{R_{\omega x} }} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_{\omega x} }} . Suffix link from Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\omega x]_{R_{\omega x} }} goes to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_{\omega x} }} and from Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_{\omega x} }} it goes to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle link([\alpha]_{R_{\omega} })} . Words from Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\omega x]_{R_{\omega x} }} occur in Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega x} only as its suffixes therefore there should be no transitions at all from Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\omega x]_{R_{\omega x}}} while transitions to it should go from suffixes of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} having length at least Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} and be marked with the character Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} . State Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_{\omega x}}} is formed by subset of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_\omega}} , thus transitions from Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_{\omega x}}} should be same as from Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_\omega}} . Meanwhile, transitions leading to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_{\omega x}}} should go from suffixes of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} having length less than Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle |\alpha|} and at least Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle len(link([\alpha]_{R_{\omega} }))} , as such transitions have led to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\alpha]_{R_\omega}} before and corresponded to seceded part of this state. States corresponding to these suffixes may be determined via traversal of suffix link path for Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle [\omega]_{R_\omega}} .[19]

Construction of the suffix automaton for the word abbcbc 
∅ → a
Single letter SA.svg Single letter SA.svg
After first character is appended, only one state is created in suffix automaton. Similarly, only one leaf is added to the suffix tree.
a → ab
Ab SA.svg Ba ST.svg
New transitions are drawn from all previous final states as b didn't appear before. For the same reason another leaf is added to the root of the suffix tree.
ab → abb
Abb SA.svg Bba ST.svg
The state 2 recognizes words ab and b, but only b is the new suffix, therefore this word is separated into state 4. In the suffix tree it corresponds to the split of the edge leading to the vertex 2.
abb → abbc
Abbc SA.svg Cbba ST.svg
Character c occurs for the first time, so transitions are drawn from all previous final states. Suffix tree of reverse string has another leaf added to the root.
abbc → abbcb
Abbcb SA.svg Bcbba ST.svg
State 4 consists of the only word b, which is suffix, thus the state is not split. Correspondingly, new leaf is hanged on the vertex 4 in the suffix tree.
abbcb → abbcbc
Suffix Automaton extension.svg Suffix Tree extension.svg
The state 5 recognizes words abbc, bbc, bc and c, but only last two are suffixes of new word, so they're separated into new state 8. Correspondingly, edge leading to the vertex 5 is split and vertex 8 is put in the middle of the edge.

Construction algorithm

Theoretical results above lead to the following algorithm that takes character Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} and rebuilds the suffix automaton of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} into the suffix automaton of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega x} :[19]

  1. The state corresponding to the word Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} is kept as Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle last} ;
  2. After Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} is appended, previous value of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle last} is stored in the variable Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle last} itself is reassigned to the new state corresponding to ;
  3. States corresponding to suffixes of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} are updated with transitions to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle last} . To do this one should go through Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p, link(p), link^2(p),\dots} , until there is a state that already has a transition by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} ;
  4. Once the aforementioned loop is over, there are 3 cases:
    1. If none of states on the suffix path had a transition by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} , then Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} never occurred in Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega} before and the suffix link from Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle last} should lead to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle q_0} ;
    2. If the transition by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} is found and leads from the state Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p} to the state Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle q} , such that Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle len(p)+1=len(q)} , then does not have to be split and it is a suffix link of ;
    3. If the transition is found but , then words from having length at most should be segregated into new "clone" state ;
  5. If the previous step was concluded with the creation of , transitions from it and its suffix link should copy those of , at the same time is assigned to be common suffix link of both and ;
  6. Transitions that have led to before but corresponded to words of the length at most are redirected to . To do this, one continues going through the suffix path of until the state is found such that transition by from it doesn't lead to .

The whole procedure is described by the following pseudo-code:[19]

function add_letter(x):
    define p = last
    assign last = new_state()
    assign len(last) = len(p) + 1
    while δ(p, x) is undefined:
        assign δ(p, x) = last, p = link(p)
    define q = δ(p, x)
    if q = last:
        assign link(last) = q0
    else if len(q) = len(p) + 1:
        assign link(last) = q
    else:
        define cl = new_state()
        assign len(cl) = len(p) + 1
        assign δ(cl) = δ(q), link(cl) = link(q)
        assign link(last) = link(q) = cl
        while δ(p, x) = q:
            assign δ(p, x) = cl, p = link(p)

Here is the initial state of the automaton and is a function creating new state for it. It is assumed , , and are stored as global variables.[19]

Complexity

Complexity of the algorithm may vary depending on the underlying structure used to store transitions of the automaton. It may be implemented in with memory overhead or in with memory overhead if one assumes that memory allocation is done in . To obtain such complexity, one has to use the methods of amortized analysis. The value of strictly reduces with each iteration of the cycle while it may only increase by as much as one after the first iteration of the cycle on the next add_letter call. Overall value of never exceeds and it is only increased by one between iterations of appending new letters that suggest total complexity is at most linear as well. The linearity of the second cycle is shown in a similar way.[19]

Generalizations

The suffix automaton is closely related to other suffix structures and substring indices. Given a suffix automaton of a specific string one may construct its suffix tree via compacting and recursive traversal in linear time.[20] Similar transforms are possible in both directions to switch between the suffix automaton of and the suffix tree of reversed string .[18] Other than this several generalizations were developed to construct an automaton for the set of strings given by trie,[8] compacted suffix automation (CDAWG),[7] to maintain the structure of the automaton on the sliding window,[21] and to construct it in a bidirectional way, supporting the insertion of a characters to both the beginning and the end of the string.[22]

Compacted suffix automaton

As was already mentioned above, a compacted suffix automaton is obtained via both compaction of a regular suffix automaton (by removing states which are non-final and have exactly one out-going arc) and the minimization of a suffix tree. Similarly to the regular suffix automaton, states of compacted suffix automaton may be defined in explicit manner. A two-way extension of a word is the longest word , such that every occurrence of in is preceded by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta} and succeeded by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} . In terms of left and right extensions it means that two-way extension is the left extension of the right extension or, which is equivalent, the right extension of the left extension, that is Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\textstyle \overset{\scriptstyle\longleftrightarrow}{\gamma} = \overset{\scriptstyle\leftarrow}{\overset{\rightarrow}{\gamma}} = \overset{\rightarrow}{\overset{\scriptstyle\leftarrow}{\gamma}}} . In terms of two-way extensions compacted automaton is defined as follows:[15]

Theorem — Compacted suffix automaton of the word Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S} is defined by a pair Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (V, E)} , where:

  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle V = \{\overleftrightarrow \omega : \omega \in \Sigma^*\}} is a set of automaton states;
  • Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E = \{(\overleftrightarrow \omega, x \alpha, \overleftrightarrow {\omega x}) : x \in \Sigma, \alpha \in \Sigma^*, \overleftrightarrow{\omega x} = \overleftrightarrow{\omega} x\alpha\}} is a set of automaton transitions.

Two-way extensions induce an equivalence relation Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\textstyle \overset{\scriptstyle\longleftrightarrow}{\alpha} = \overset{\scriptstyle\longleftrightarrow}{\beta}} which defines the set of words recognized by the same state of compacted automaton. This equivalence relation is a transitive closure of the relation defined by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\textstyle (\overset{\scriptstyle{\rightarrow}}{\alpha\,}=\overset{\scriptstyle{\rightarrow}}{\beta\,}) \vee (\overset{\scriptstyle{\leftarrow}}{\alpha} = \overset{\scriptstyle{\leftarrow}}{\beta})} , which highlights the fact that a compacted automaton may be obtained by both gluing suffix tree vertices equivalent via Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \overset{\scriptstyle{\leftarrow}}{\alpha} = \overset{\scriptstyle{\leftarrow}}{\beta}} relation (minimization of the suffix tree) and gluing suffix automaton states equivalent via Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \overset{\scriptstyle{\rightarrow}}{\alpha\,}=\overset{\scriptstyle{\rightarrow}}{\beta\,}} relation (compaction of suffix automaton).[23] If words Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta} have same right extensions, and words Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} have same left extensions, then cumulatively all strings Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} , Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \beta} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} have same two-way extensions. At the same time it may happen that neither left nor right extensions of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma} coincide. As an example one may take Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle S=\beta=ab} , Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \alpha=a} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \gamma=b} , for which left and right extensions are as follows: Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \overset{\scriptstyle{\rightarrow}}{\alpha\,}=\overset{\scriptstyle{\rightarrow}}{\beta\,}=ab=\overset{\scriptstyle{\leftarrow}}{\beta} = \overset{\scriptstyle{\leftarrow}}{\gamma}} , but Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \overset{\scriptstyle{\rightarrow}}{\gamma\,}=b} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \overset{\scriptstyle{\leftarrow}}{\alpha}=a} . That being said, while equivalence relations of one-way extensions were formed by some continuous chain of nested prefixes or suffixes, bidirectional extensions equivalence relations are more complex and the only thing one may conclude for sure is that strings with the same two-way extension are substrings of the longest string having the same two-way extension, but it may even happen that they don't have any non-empty substring in common. The total number of equivalence classes for this relation does not exceed Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n+1} which implies that compacted suffix automaton of the string having length Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n} has at most Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n+1} states. The amount of transitions in such automaton is at most Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 2n-2} .[15]

Suffix automaton of several strings

Consider a set of words Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle T=\{S_1, S_2, \dots, S_k\}} . It is possible to construct a generalization of suffix automaton that would recognize the language formed up by suffixes of all words from the set. Constraints for the number of states and transitions in such automaton would stay the same as for a single-word automaton if you put Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n = |S_1| + |S_2| + \dots + |S_k|} .[23] The algorithm is similar to the construction of single-word automaton except instead of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle last} state, function add_letter would work with the state corresponding to the word Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \omega_i} assuming the transition from the set of words Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \{\omega_1, \dots, \omega_i, \dots, \omega_k\}} to the set Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \{\omega_1, \dots, \omega_i x, \dots, \omega_k\}} .[24][25]

This idea is further generalized to the case when Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle T} is not given explicitly but instead is given by a prefix tree with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle Q} vertices. Mohri et al. showed such an automaton would have at most Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 2Q-2} and may be constructed in linear time from its size. At the same time, the number of transitions in such automaton may reach Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(Q|\Sigma|)} , for example for the set of words Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle T=\{\sigma_1, a\sigma_1, a^2\sigma_1, \dots, a^n \sigma_1, a^n \sigma_2, \dots, a^n \sigma_k \}} over the alphabet Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Sigma=\{a,\sigma_1,\dots,\sigma_k\}} the total length of words is equal to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\textstyle O(n^2+nk)} , the number of vertices in corresponding suffix trie is equal to Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(n+k)} and corresponding suffix automaton is formed of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(n+k)} states and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(nk)} transitions. Algorithm suggested by Mohri mainly repeats the generic algorithm for building automaton of several strings but instead of growing words one by one, it traverses the trie in a breadth-first search order and append new characters as it meet them in the traversal, which guarantees amortized linear complexity.[26]

Sliding window

Some compression algorithms, such as LZ77 and RLE may benefit from storing suffix automaton or similar structure not for the whole string but for only last Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} its characters while the string is updated. This is because compressing data is usually expressively large and using Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(n)} memory is undesirable. In 1985, Janet Blumer developed an algorithm to maintain a suffix automaton on a sliding window of size Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} in Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(nk)} worst-case and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(n \log k)} on average, assuming characters are distributed independently and uniformly. She also showed Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(nk)} complexity cannot be improved: if one considers words construed as a concatenation of several Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (ab)^m c (ab)^m d} words, where Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k=6m+2} , then the number of states for the window of size Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} would frequently change with jumps of order Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle m} , which renders even theoretical improvement of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(nk)} for regular suffix automata impossible.[27]

The same should be true for the suffix tree because its vertices correspond to states of the suffix automaton of the reversed string but this problem may be resolved by not explicitly storing every vertex corresponding to the suffix of the whole string, thus only storing vertices with at least two out-going edges. A variation of McCreight's suffix tree construction algorithm for this task was suggested in 1989 by Edward Fiala and Daniel Greene;[28] several years later a similar result was obtained with the variation of Ukkonen's algorithm by Jesper Larsson.[29][30] The existence of such an algorithm, for compacted suffix automaton that absorbs some properties of both suffix trees and suffix automata, was an open question for a long time until it was discovered by Martin Senft and Tomasz Dvorak in 2008, that it is impossible if the alphabet's size is at least two.[31]

One way to overcome this obstacle is to allow window width to vary a bit while staying Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle O(k)} . It may be achieved by an approximate algorithm suggested by Inenaga et al. in 2004. The window for which suffix automaton is built in this algorithm is not guaranteed to be of length Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} but it is guaranteed to be at least Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle k} and at most Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle 2k+1} while providing linear overall complexity of the algorithm.[32]

Applications

Suffix automaton of the string may be used to solve such problems as:[33][34]

  • Counting the number of distinct substrings of in on-line,
  • Finding the longest substring of occurring at least twice in ,
  • Finding the longest common substring of and in ,
  • Counting the number of occurrences of in in ,
  • Finding all occurrences of in in , where is the number of occurrences.

It is assumed here that is given on the input after suffix automaton of is constructed.[33]

Suffix automata are also used in data compression,[35] music retrieval[36][37] and matching on genome sequences.[38]

References

Bibliography

  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.
  • Lua error in Module:Cite_Q at line 13: attempt to index a nil value.

External links