Introduction to Protein Sequence Alignment and Analysis
Amino acid sequence alignment and analysis is one of the central components in most biochemical and molecular biology applications. Sequence analysis addresses questions like how to reveal and understand an observed conservation pattern, what is a consensus sequence in a protein family and how is it related to function, how to locate important functional residues, what are the relationships between sequence and 3D structure and what kind of 3D structural information we can extract from the amino acid sequence, and many, many more. This chapter provides an overview of the basic concepts in protein amino acid sequence analysis. It also provides a couple of examples to guide you in making some simple sequence alignment using resources available at the Internet. Since this site is focused on structural bioinformatics, the sequence alignments will be interpreted in terms of structure. We will discuss what structural information may be found in a sequence alignment and how to make use of available structural information to make a better sequence alignment.
When putting together a bunch of sequences for an alignment, we assume that they are evolutionary related. And since evolutionary relationships assume that a certain number of the amino acid residues within a protein family are conserved, we need to have some instruments to be able to assess the degree of conservation of the amino acid sequences. To assist us in the process, scoring schemes for sequence alignment have been developed. Here we will discuss the basic concepts behind that.
When making a sequence alignment we need to understand the effect of amino acid substitutions, that is when one amino acid is replaced by another in the sequence. This is important to take into account when calculating the alignment score. Some substitutions are conservative, i.e., they will not introduce any substantial disturbances in the protein structure. But other substitutions may have dramatic effect on the structure and function of the protein and for this reason they are rare. Structural information can always help us to understand the effect of amino acid substitutions. But there are also specially calculated substitution matrices, which can be used for assessing the score of the alignment, and in helping to get the best alignment.
There will also be two guided examples of the use of the resources available at the Expasy server for sequence alignment, and the tools for sequence alignment analysis. In some cases the alignment may be easy to make, while in others extra attention may be needed, for example when we align multidomain proteins. Also three-dimensional structural information may be used to correct the sequence alignment. In the second example we will discuss the use of structural information, and particularly protein secondary structure in the alignment of a multidomain protein. The results obtained in these tutorials will be used later in the homology modeling tutorial, which will follow in the chapter dedicated to modeling. Although this chapter is not so detailed and rather basic, I still hope that you people will find it useful and will like it.
And few more words about the technique...
Although it should be possible to retrieve all the information we need directly from the protein sequence, looking at a sequence without prior knowledge and experience is like reading a text in a foreign language: we may recognize the letters, but we do not understand the meaning and are unable to extract the information. Still, when proteins are concerned, we have learned to extract a substantial part of the information from detailed sequence analysis, using for example multiple sequence alignment. In a multiple sequence alignment a given sequence is compared to a group of evolutionary related sequences from other organisms. The pleasant fact is that we will always find a related protein from some other organism. When we say "related" we mean that they belong to the same family, the members of which usually perform a similar function in different organisms. We know that the main characteristic features of a protein sequence and the protein tertiary structure are often conserved within a protein family. Furthermore, the degree of conservation of the tertiary structure is much higher than that of the protein sequence. And this is because STRUCTURE IS FUNCTION! The three-dimensional protein structure controls, for example, the interactions of the protein with its partners, its enzymatic activity, etc.
For sequence analysis we first need to make a multiple sequence alignment and ask some basic questions: Since two protein sequences can be aligned in many different ways, how do we score the different alignments to identify the best one? What types of changes in the sequences of related proteins we should expect and how to account for them when calculating the alignment score? We also know that during evolution some short (sometimes even long!) segments of a sequence may be added to a protein or deleted from it. In a sequence alignment, this is accounted for by introducing the so called ”gaps”. The questions to ask then would be: How many gaps we can introduce into a sequence alignment? How do we optimize their position along the sequence alignment? What factors do we need to take into consideration when we place a gap? From available tertiary protein structures we have learned that the position of insertions and deletions in the protein sequence are closely related to the secondary structure of the protein and usually are found in regions outside secondary structure elements. A correct gap position is crucial for obtaining correct sequence alignment, for example, in homology modeling. By other words, the combination of structural information with sequence analysis will provide the most powerful way for the analysis of protein structure and function.
Alignment
Substitution matrices
Tutorial 1 - Retrieve a sequence from the Expasy server, make a simple alignment
Tutorial 2 - Alignment and secondary structure prediction