Basic Principles of Sequence Alignment and Analysis
Sequence alignment is crucial in any analyses of evolutionary relationships, in extracting functional and even tertiary structure information from a protein amino acid sequence. Since evolutionary relationships assume that a certain percentage of the amino acid residues in a protein sequence are conserved, the simplest way to assess the relationships between two sequences would be by counting the number of identical and similar amino acids. This is done by sequence alignment. The number of identical and similar amino acid residues may then be compared to the total number of amino acids in the protein and the resulting number is called the percentage of sequence identity or sequence similarity, depending on whether we compare the identical or similar amino acids. By similar I mean amino acids with similar chemical characteristics, like positively charged Lys and Arg, or hydrophobic Leu and Val, etc. Substitution of amino acids by chemical equivalents in a sequence often does not have any dramatic consequences when the 3D structure or protein function is concerned. For example, Leu and Val will be equally tolerated within a hydrophobic core, assuming that there is place for the slightly longer side chain of leucine. The same applies to Lys and Arg, which are usually located on the surface of proteins and primarily interact with solvent or with the acidic side chains of Glu or Asp. The same applies for other amino acids of similar physicochemical characteristics.However, to be able to count the number of identities and similarities, we first need to align the sequences against each other, and we also need some rules describing how this alignment should be done. The computer program, which makes the sequence alignment following a certain algorithm, will try to align the maximum number of identical or similar amino acid residues against each other. Nevertheless, one should be aware that an alignment generated by a computer program represents only one of many possibilities. One of the reasons is that identical amino acids are easy to recognize and align against each-other, while alignment of similar amino acids is not that straight forward. For example, how to score the following substitutions - Val-Leu, Leu-Ile, Ser-Thr or Lys-Arg, etc. Apparently, the score we give to each of these substitutions, or call it a weight, will affect how the sequences will be aligned.
By other words, we need some rules which would allow us to assess the importance of different replacements, for example, when counting the percentage of sequence similarity. In addition, it is quite common that sequences, when compared to other members of a family, have some extra inserted residues (insertions), or some residues may be missing (deletions). This can be seen, for example, when a group of bacterial sequences is compared against a group of eukaryotic sequences. Sometimes even larger segments or a whole domain may be inserted into or deleted from a protein. Depending on how we handle these insertions and deletions, different sequence alignments may be generated. By other words, the computer program that generates the alignment will need some criteria to distinguish between different possible alignments to be able to choose the best one. To illustrate the concept I show below an example of a simple alignment of a short stretch of two sequences. This was extracted from a ClustalW generated sequence alignment using the EBI server (European Bioinformatics Institute):

The amino acid residues which are identical in the two sequences are marked in the third raw by their names (GCP and P), while the position of those which are different are marked by x. One of the residues (a cysteine) in the second sequence does not seem to have a corresponding mate in the first. This position is marked by a dash. The percentage of identity for this sequence alignment is simply 4/12, or 30%. Then, the score of the alignment can be assessed, for example, by a simple expression:
(Score) S= number of matches - number of mismatches = 4 - 12 =-8
Everything looks nice, except that to maximize the number of matches, we introduced a gap (marked by a dash in the first sequence). A gap in one of the sequences simply means that one or more amino acid residues have been deleted from the sequence, or we could also say that there is an insertion in the second sequence. When introducing a gap several questions may arise: How many gaps we can introduce? How to decide where to place them? How long they can be? Apparently, by introducing a large number of gaps here and there, we could continue maximizing the identity, but would that be biologically relevant? Intuitively one would think that something must be wrong in this approach, but a correct answer is crucial for a correct sequence alignment. For example in homology modeling correct placement of gaps is one of the moments, which will ensure correctness of the model. A badly placed gap may result in a totally meaningless model. Normally, when we run a sequence alignment software, we will notice that the number of gaps is limited. Apparently, the program has some instructions on how to limit the number of gaps and where to place them. What are these instructions?
They are called gap penalties. Each time the program introduces a gap it triggers a penalty score, which reduces the total score of the alignment. However, this would make the whole thing meaningless, unless gap introduction will rise the score by a value that is higher than negative effect of the penalty. By this simple way we can limit the number of gaps and increase their significance. The value of gap penalties is a parameter which can be changed during the alignment, thus controlling the number, length and position of the gaps. At the next page we will continue the discussion of the way we can construct a sequence alignment.
(Score) S= number of matches - number of mismatches = 4 - 12 =-8
Everything looks nice, except that to maximize the number of matches, we introduced a gap (marked by a dash in the first sequence). A gap in one of the sequences simply means that one or more amino acid residues have been deleted from the sequence, or we could also say that there is an insertion in the second sequence. When introducing a gap several questions may arise: How many gaps we can introduce? How to decide where to place them? How long they can be? Apparently, by introducing a large number of gaps here and there, we could continue maximizing the identity, but would that be biologically relevant? Intuitively one would think that something must be wrong in this approach, but a correct answer is crucial for a correct sequence alignment. For example in homology modeling correct placement of gaps is one of the moments, which will ensure correctness of the model. A badly placed gap may result in a totally meaningless model. Normally, when we run a sequence alignment software, we will notice that the number of gaps is limited. Apparently, the program has some instructions on how to limit the number of gaps and where to place them. What are these instructions?
They are called gap penalties. Each time the program introduces a gap it triggers a penalty score, which reduces the total score of the alignment. However, this would make the whole thing meaningless, unless gap introduction will rise the score by a value that is higher than negative effect of the penalty. By this simple way we can limit the number of gaps and increase their significance. The value of gap penalties is a parameter which can be changed during the alignment, thus controlling the number, length and position of the gaps. At the next page we will continue the discussion of the way we can construct a sequence alignment.
