Protein Databases: Short OverviewWhere to find a protein structure I am interested in? Which protein database should I use?
There are many protein and structural bioinformatics-related resources on the Internet. Some of them are of general character, but some are dedicated to specific aspects of protein structures or to specific protein families, specific metabolic pathways, etc. Here I will discuss just few general-character databases. For beginners and for our purpose, which is creating a homology model, I find it more efficient to limit the number of databases in use, otherwise it may become too confusing. At a later stage when new questions will arise, it will be easy to find the required resources.
Probably the first question, when working with a protein structure, would be where to find the structure of interest. And another question, which many people need to ask, once they get access to a protein structure file is: What is actually inside that file? What information can be found there apart from the structure as such? And by the way, what is that structure? How do we get it into a file? Is it a collection of separate small pictures of various parts of the protein, which are put together by the computer program? Or is there something else?
The primary protein database for protein structure information is the Protein Data Bank (PDB), created sometime in the beginning of the 1970ties. Believe it or not, but databases existed already at that time, even if it almost feels like Stone Age. Only few structures existed at that time, and the only experimental method for protein structure determination available was protein X-ray crystallography. The real structural revolution, as you may see from the figure below, started in the beginning of the 1990ties (click on the image below to get to the original image):
Now days every newly determined protein structure has to be deposited with the protein data bank before the scientific paper describing the structure can be published. Currently there are more than 80 000 structures in the PDB and the number is increasing very rapidly, as it can be seen from the figure. However, one should remember that this number does not represent unique structures. In many cases there are many entries of the same protein in the database - some variants with amino acid replacements, some complexes with ligands (substrate analogues, inhibitors, co-factors), etc. This may be a source of confusion if one would try to fetch a structure from PDB - which one to choose if there are many entries of the same protein? This will be discussed later in the chapter on homology modeling. For modeling it is important to choose the right structure of the available quality. By other words, we need to learn to distinguish among the entries to be able to identify the most suitable structure for a homology modeling project.
So, coming back to our initial questions, how to fetch a structure and what is inside the PDB file? First wee need to check if there is a structure for the protein we are interested in. This part is easily done, all you need to do is to go to the PDB and type in the name of the protein you are looking for into the search window. For example, I typed in the name of a protein called magnesium chelatase. If you do the same you will get few hits, however in this case there is only one X-ray structure for one of the submits of magnesium chelatase. There are some other proteins listed there and they are not magnesium chelatase. This is something I don't like about the protein data bank. Sometimes PDBsum gives more clear results, when I type in the name of the same protein, I get a single hit (PDB ID 1g8p). Of course you may refine your search using the options provided on the PDB page that show up when you enter the name of the protein:
Both PDB and PDBsum provide additional data on the entry, including links to other databases, where more information can be found. Here is an example from PDBsum link page (you may click on the image below to get the web site):
For our purposes we may, for example, be interested in the links to CATH and SCOP (related to structural classification of proteins). The PQS database is also of interest for us, the Protein Quaternary Structure database. However, when you click on this link the database will inform you that from 2009 it is not updated anymore. I think the reason is that the information, which can be found in PQS is currently generated by the PISA sever, Protein Interfaces, Surfaces and Assemblies. The reason is that PDB files usually contain the crystallographic unit, or as it is called in crystallography, asymmetric unit. The biological unit in solution may often contain several subunits of the same protein, they may be arranged as dimers, trimers of higher order oligomers. In these oligomers the subunits are usually related by some kind of rotation symmetry - two-fold rotation for dimers, three-fold rotation for trimers, four-fold rotation for tetramers, etc. When the molecules are crystallized, they get arranged in certain types of space lattices, within which all molecules are ordered and related to each-other by symmetry operations (groups of symmetry listed in the International Tables for Crystallography). The symmetry axes present in the molecule in solution, which could be 2-, 3-, or 4-fold, may become part of the crystallographic symmetry. In such cases, one unit within, for example a trimer, becomes the asymmetric unit of the crystal. Crystallography operates with asymmetric units since the other units will be exactly the same and related by the symmetry operation of the crystal. This is reflected in the content of the files in the PDB, they contain coordinates for the atoms of one subunit, the asymmetric unit. For this reason the PISA server reconstructs the biological unit in cases when it is known to be different from the asymmetric unit or when there are some other indications which need to be taken into account. The file generated by the PISA server may also be downloaded from the PDB, I will discuss this in more details in the next page. The whole concept is illustrated in the figure below:
|In the left figure the asymmetric unit of the crystal is just one subunit and all molecules in the lattice are related to each other by simple translation. In the middle figure there are two subunits in the unit cell related by a two fold axis. There is a big chance that the biological unit in solution is a dimer. In the figure on the right the molecules within the unit cell are related by a 4-fold crystallographic symmetry axis. Again, it cannot be excluded that the biological unit is going to be a tetramer.
The information in this page is useful for quick identification of the position of amino acids within the structure, for getting an idea on the type of the protein (all alpha, alpha/beta, etc). There is also information on the publication which describes the structure (rather detailed in PDBsum, with links to citing papers). In the next page we will continue with the discussion of the PDB and the information, which can be found in PDB files.