COMPARING COMPLEXITY FUNCTIONS OF A LANGUAGE AND ITS EXTENDABLE PART

. Right


Introduction
The combinatorial complexity of a language (simply complexity throughout the paper) is a function defined for an arbitrary language L over a finite alphabet Σ by the rule C L (n) = |L ∩ Σ n |.This is the most natural counting function associated with the language.The complexity was intensively studied for many languages and particular classes of languages.Probably the first results in this direction were obtained by Morse and Hedlund [10].A systematic study of combinatorial complexity was initiated by Ehrenfeucht and Rosenberg in [4]; they focused mostly on an important, but narrow class of D0L-languages.A representative selection of results on complexity can be found in Section 9 of [1].Besides this, some general results on complexity of arbitrary regular (rational) languages can be found in [6,11]; the sets of possible polynomial complexity functions coincide for regular and context-free languages, and explicit formulas for such complexity functions can be effectively constructed [3].
On the other hand, there exist only a few results on "general" complexity properties, which can be applied to arbitrary languages, or at least to wide classes

Preliminaries
Recall some notions on words, languages, automata, complexity, and graphs.We consider a finite alphabet Σ and finite words over it.The length of the word W is denoted by |W |.A word U is a factor of the word W if W can be written as P UQ for some (possibly empty) words P and Q.The reversal of a word is obtained by writing its letters in the reversed order.We write Σ n (Σ ≤n ) for the set of all words of length n (resp. of length at most n) over Σ.As usual, Σ * denotes the set of all words over Σ.The subsets of Σ * are called languages.A language is factorial if it is closed under taking factors of its words.The reversal of a language consists of the reversals of all its elements.
As usual, we call a complexity function polynomial if it is O(n p ) for some p ≥ 0 (bounded from above by a polynomial of degree p), and exponential if its fastest growing infinite subsequence is Ω(α n ) for some α > 1 (bounded from below by an exponential function at base α).A complexity function which is superpolynomial and subexponential is called intermediate.We write Θ(n p ) for the function which is bounded from above and from below by two polynomials of degree p.
The complexity of a language can be coarsely described by the growth rate α(L) = lim sup n→∞ C L (n) 1/n .For factorial languages, the following theorem holds.Theorem 1.1 [7].For an arbitrary factorial language L, and has the exponential complexity, and α(L) = 1 otherwise.
We consider deterministic finite automata (dfa's) with a partial transition function, and identify a dfa with a digraph, which contains states as vertices and transitions as directed labeled edges.A dfa is consistent if each its vertex is contained in some accepting path.
The adjacency matrix of a digraph is nonnegative, whence its eigenvalue of the maximum absolute value is a nonnegative real number, called the Frobenius root.This number is usually referred to as the index of a digraph [2].We denote the index of a dfa A by r(A).
A strongly connected component (scc) of a digraph G is a maximal with respect to inclusion subgraph G such that there exists a (directed) path from any vertex of G to any other vertex of G .A well-known result (see [2]) states that the index of a digraph equals maximum of the indices of its scc's.The scc's of index 0 (singletons) and of index 1 (simple cycles) are called trivial.The index of any nontrivial scc is strictly greater than 1.
The connection between growth rates of regular languages and Frobenius roots of some matrices is well-known.In the most general form, this connection is expressed in the following theorem.Theorem 1.2.Let a language L be recognized by a consistent dfa A. Then the growth rate of L coincides with the index of A.
As far as we know, this theorem was not yet published in this form (a restricted version is proved, for example, in [7]).Since we need such a general form to prove further results, we give the proof here.
Proof.Let A = (a ij ) be the adjacency matrix of A, m be its size, and |A| be the sum of all elements of A. For any n, consider the matrix A n = (a n ij ).One of the properties of the Frobenius root (see [5]) is the equality lim Note that a n ij is the number of paths of length n in A from the state q i to q j .Hence, |A n | is the total number of paths of length n in A, and a n ij is the number of paths of length n in A, starting at q i .Suppose that the vertex q 1 is initial, and denote R j (n) = a n 1j .Then the complexity C L (n) equals the sum of these reading functions R j over the set of terminal states.Thus, the maximum growth rate of the functions R j over the set of terminal states is α(L).
Suppose that A contains an edge (q i , q j ).Then R j (n+1) ≥ R i (n), yielding that the growth rate of the function R j is greater than or equal to the one of R i .Therefore, the growth rates of the reading functions can only increase along a path in the automaton.Since A is consistent, for every state there exists a path from it to some terminal state.Thus, the overall maximum of the growth rates of the reading functions is achieved on a terminal state.We obtain that the function

R j has the growth rate α(L).
There exists a path from the initial vertex to any vertex q i , because A is consistent.Dually to the above argument on reading functions, we conclude that P 1 has at least the same growth rate as P i .Since |A n | = m i=1 P i (n), the growth rate of |A n | is equal to the maximum of the growth rates of P i , that is, to the growth rate of P 1 .This gives us the required equality r(A) = α(L).

Extendable parts of a language
For a language L over Σ we consider three subsets of extendable words: Obviously, re(L) ∩ le(L) ⊇ e(L).Actually, this inclusion is often strict; the following example involves well-known combinatorial objects.
Example 2.1.Recall that a word is overlap-free, if it contains no factors of the form XXc, where c is the first letter of the word X.Let OF ⊂ {a, b} * denote the language of all binary overlap-free words.An infinite Thue-Morse word T = abba baab baab abba b . . .over the same alphabet is a fixed point of the morphism φ, defined by φ(a) = ab, φ(b) = ba.We write T for the reversal of T .It is well known that the word T is overlap-free; hence, so is T .
From the definition of the Thue-Morse word it is easy to see that T (and also T ) contains no factor bbabb.Then, the infinite words bbabbaT and T abbabb are overlap-free, and bbabb ∈ re(OF )∩le(OF ).On the other hand, any word P bbabbQ with nonempty P, Q contains either b 3 or abbabba, whence it is not overlap-free.Thus, bbabb / ∈ e(OF ).
The following observation is simple but very useful.

Comparing growth rates
In this section we study how the growth rate of a language relates to the growth rates of its extendable subsets.The answers are quite different in the case of factorial languages and in the case of arbitrary ones.Proof.By Observation 2.2, it is sufficient to prove the statement for re(L), because the result for le(L) can be obtained in the same way, considering the reversal of L instead of L.
Consider a consistent dfa A recognizing L. Then α(L) = r(A) by Theorem 1.2.Since r(A) > 1, the automaton contains a non-trivial scc; thus, removing trivial scc's from it does not affect the index.
Since L is factorial, and A is consistent, we see that all vertices of A are terminal.Let us partition these vertices into two groups, Q 1 and Q 2 .A vertex q belongs to Q 1 iff A contains a cycle, which is attainable from q.It is easy to see that a word W ∈ L belongs to re(L) iff the reading of W by the automaton A terminates in some vertex of Q 1 .Now we remove all vertices of Q 2 to obtain a consistent dfa A , recognizing re(L).Note that each vertex of Q 2 forms a singleton scc in A. Hence, r(A ) = r(A), and α(L) = α(re(L)), as desired.Now turn to the general case.A factorial language L over Σ possesses an antidictionary, which is the set of minimal forbidden words, defined by the formula It is easy to see that L = Σ * \Σ * M Σ * , whence L is regular iff M is.In particular, we may assume for the rest of the proof that M is infinite.Consider the sequence of finite antidictionaries, where M i = M ∩ Σ ≤i .Let L i be the factorial language over Σ with the antidictionary M i .One has and Now repeat this argument for right extendable parts of the languages L i .From ( 1) and ( 2) we have and finally Arguing as above, we obtain that {α(re(L i ))} converges to α(re(L)).Now it remains to note that each language L i is regular, whence α(re(L i )) = α(L i ).The result now follows.
Remark 3.2.If we consider arbitrary languages instead of factorial ones, the statement of Theorem 3.1 fails even for regular languages.The dfa in Figure 1 recognizes the language a * +a * b(a+ba) * bb having exponential complexity but only one right extendable word of each length.Indeed, the two "middle" vertices of the automaton constitute a nontrivial scc, whose index is equal to the golden ratio.At the same time, only the words from a * are right extendable in this language.

Subexponential gaps
According to Theorem 3.1, the gaps between complexities of L, re(L) and e(L) are always subexponential.From known results it follows that all these complexity functions can be polynomials of different degrees.For example, the already mentioned language OF of binary overlap-free words has complexity Ω(n 1.217 ) [9], while C re(OF ) (n) = Θ(n α ) with α ≈ 1.155 [8].The language e(OF ) coincides with the set of all finite factors of the Thue-Morse word T , and C e(OF ) (n) = Θ(n) (folklore).We now show that such gaps can be much bigger.The language defined below has a superpolynomial gap.
Let K ⊂ {a, b} * be the language consisting of all words of the form U = c t1 1 . . .c tm m such that m ∈ N, c i ∈ {a, b}, c i = c i+1 for all i, and t 1 < . . .< t m−1 .Thus, the powers of letters in U are strictly increasing, with the last letter being the only possible exception.This exception is necessary to make K factorial.We note that K is very close to the family of languages of intermediate complexity, introduced in [12].The binary language of that family is defined in the same way, as K, with the only difference: the inequalities for t's are not strict.So, we adapt some ideas of [12] to estimate the complexity of K.

U
It is clear that the length of a left extension of U does not exceed the value . By Theorems 1.1 and 3.1, the language K has subexponential complexity also.Thus, it remains to show that the complexity of K is superpolynomial.
Let K m denote the subset of K consisting of all words U = c t1 1 . . .c tm m .We show that for n large enough the function C Km (n) is Θ(n m−1 ).Since K = ∞ m=1 K m , we then obtain that C K (n) is not bounded from above by any polynomial.
Note that C Km (n) equals the number of positive solutions of the diophantine equation multiplied by two.Indeed, taking we get the required correspondence between words of K m and positive solutions of (7) (the multiplication by two is due to the choice of the letter c 1 ).There are (m−1) free variables in (7), and their values are bounded from above by n.Hence, the number of positive solutions of ( 8) is O(n m−1 ).To get the lower bound on the number of these solutions, consider an auxiliary equation There is a one-to-one correspondence between (m−1)-element subsets of the set {1, . . ., n 1 −1} and sequences of partial sums Each of these sequences uniquely determines a positive solution of (8), whence (8) has n1−1 m−1 positive solutions.Now represent n in the form n = (m−1)!•n 1 + n 2 , where 0 ≤ n 2 < (m−1)!.Any positive solution ( ξ1 , . . ., ξm ) of ( 8) generates some positive solution (ξ 1 , . . ., ξ m ) of ( 7 Thus, the number of solutions of (7) exceeds n1−1 m−1 .Since n 1 is obtained by dividing n by a constant, we obtain that (7) has Ω(n m−1 ) positive solutions for any n satisfying the condition n 1 ≥ m.Thus, we get C Km (n) = Θ(n m−1 ) for all n ≥ m!, whence the result.
We conclude the paper with a few notes on possible applications of the given results.The extendable parts of a language can have more clear structure, than the language itself.For example, the structure of the language OF of binary overlapfree words is rather complicated, while the extendable words in this language are just the factors of a single infinite word T which has a simple and regular form.Thus, if we want to estimate the complexity of some language, we may study its extendable part instead.Theorem 3.1 provides that we can find the growth rate of the target language in this way (and, in particular, decide whether the target language is exponential or subexponential).On the other hand, Theorem 4.1 shows that this method does not allow to distinguish different low (i.e.subexponential) complexities.

Figure 1 .
Figure1.This dfa recognizes an exponential language with only constant number of right extendable words.The bigger circle denotes the initial vertex, the terminal vertices are filled.