We present a novel information theoretic algorithm for unsupervised segmentation of sequences into alternating Variable Memory Markov sources. The algorithm is based on competitive learning between Markov models, when implemented as Prediction Suffix Trees (Ron et al., 1996) using the MDL principle. By applying a model clustering procedure, based on rate distortion theory combined with deterministic annealing, we obtain a hierarchical segmentation of sequences
between alternating Markov sources. The algorithm seems to be self regulated and automatically avoids over segmentation. The method is applied successfully to unsupervised segmentation of multilingual texts into languages where it is able to infer correctly
both the number of languages and the language switching points. When applied to protein sequence families, we demonstrate the method‘s ability to identify biologically meaningful sub-sequences within the proteins, which correspond to important functional sub-units called domains.