Remove duplicated sequences from an alignment

The omit_duplicated app removes redundant sequences from a sequence collection (aligned or unaligned).

Let’s create sample data with duplicated sequences.

Creating the omit_duplicated app with the argument choose="longest" selects the duplicated sequence with the least number of gaps and ambiguous characters. In the above example, only one of c and d will be retained.

Creating the omit_duplicated app with the argument choose=None means only unique sequences are retained.

The mask_degen argument specifies how to treat matches between sequences with degenerate characters.

Let’s create sample data that has a DNA ambiguity code.

Since “Y” represents pyrimidines where the site can be either “C” or “T”, s1 indeed matches s2 and one of them will be removed.