Protein Databases: Short Overview

Where to find a protein structure I am interested in? Which protein database should I use?
There are many protein and structural bioinformatics-related resources on the Internet. Some of them are of general character, but some are dedicated to specific aspects of protein structures or to specific protein families, specific metabolic pathways, etc. Here I will discuss just few general-character databases. For beginners and for our purpose, which is creating a homology model, I find it more efficient to limit the number of databases in use, otherwise it may become too confusing. At a later stage when new questions will arise, it will be easy to find the required resources.
Probably the first question, when working with a protein structure, would be where to find the structure of interest. And another question, which many people need to ask, once they get access to a protein structure file is: What is actually inside that file? What information can be found there apart from the structure as such? And by the way, what is that structure? How do we get it into a file? Is it a collection of separate small pictures of various parts of the protein, which are put together by the computer program? Or is there something else?

The primary protein database for protein structure information is the
Protein Data Bank (PDB), created sometime in the beginning of the 1970ties. Believe it or not, but databases existed already at that time, even if it almost feels like Stone Age. Only few structures existed at that time, and the only experimental method for protein structure determination available was protein X-ray crystallography. The real structural revolution, as you may see from the figure below, started in the beginning of the 1990ties (click on the image below to get to the original image):

PDB growth
One of the reasons for this structural revolution was that cloning techniques started to enter the lab and both the number and amount of proteins available for crystallization increased substantially. Before the cloning era people had to purify proteins from cells, and apparently cells do not express large quantities of a protein just because we needed it for crystallization. Therefore, to obtain a few milligrams of a protein for crystallization one would needed a lot of cells. Cloning solved the problem, proteins could be expressed in large quantities and purified for crystallization. Another important factor was the introduction of synchrotron radiation. Synchrotrons, like MAX Lab in Lund, Sweden, ESRF in Grenoble, France, or DESY in Hamburg, Germany, and many others around the world provide very high intensity X-rays, which may be used for collecting high quality X-ray diffraction data from crystals. The third factor was probably the introduction of personal computers, relatively cheep and with ever increasing power. As usual Apple was of course first, but the cheaper Dos-based machines took over the market very soon, especially after the introduction of Windows in the beginning of the 1990ties. Cheaper computers also mean new software, which also started to become user friendly. That was when the number of protein structures started to increase dramatically. Then came the era of structural genomics- large consortia were formed with the aim to develop new technology for solving huge amounts of protein structures. One such consortium is the SGC. And with a larger number of structures available, the number of protein databases started to increase and new tools for the analysis of protein sequence and structure were rapidly developed.

Now days every newly determined protein structure has to be deposited with the protein data bank before the scientific paper describing the structure can be published. Currently there are more than 80 000 structures in the PDB and the number is increasing very rapidly, as it can be seen from the figure. However, one should remember that this number does not represent unique structures. In many cases there are many entries of the same protein in the database - some variants with amino acid replacements, some complexes with ligands (substrate analogues, inhibitors, co-factors), etc. This may be a source of confusion if one would try to fetch a structure from PDB - which one to choose if there are many entries of the same protein? This will be discussed later in the chapter on homology modeling. For modeling it is important to choose the right structure of the available quality. By other words, we need to learn to distinguish among the entries to be able to identify the most suitable structure for a
homology modeling project.

So, coming back to our initial questions, how to fetch a structure and what is inside the PDB file? First wee need to check if there is a structure for the protein we are interested in. This part is easily done, all you need to do is to go to the PDB and type in the name of the protein you are looking for into the search window. For example, I typed in the name of a protein called magnesium chelatase. If you do the same you will get few hits, however in this case there is only one X-ray structure for one of the submits of magnesium chelatase. There are some other proteins listed there and they are not magnesium chelatase. This is something I don't like about the protein data bank. Sometimes
PDBsum gives more clear results, when I type in the name of the same protein, I get a single hit (PDB ID 1g8p). Of course you may refine your search using the options provided on the PDB page that show up when you enter the name of the protein:

Magnesium-chelatase-PDB-search

Among the options we may use to refine our search we can specify the organism from which the protein originates, we may chose the subunit, the experimental method, etc.
Both PDB and PDBsum provide additional data on the entry, including links to other databases, where more information can be found. Here is an example from PDBsum link page (you may click on the image below to get the web site):

PDBsum-BchI-search

For our purposes we may, for example, be interested in the links to CATH and SCOP (related to structural classification of proteins). The PQS database is also of interest for us, the Protein Quaternary Structure database. However, when you click on this link the database will inform you that from 2009 it is not updated anymore. I think the reason is that the information, which can be found in PQS is currently generated by the PISA sever, Protein Interfaces, Surfaces and Assemblies. The reason is that PDB files usually contain the crystallographic unit, or as it is called in crystallography, asymmetric unit. The biological unit in solution may often contain several subunits of the same protein, they may be arranged as dimers, trimers of higher order oligomers. In these oligomers the subunits are usually related by some kind of rotation symmetry - two-fold rotation for dimers, three-fold rotation for trimers, four-fold rotation for tetramers, etc. When the molecules are crystallized, they get arranged in certain types of space lattices, within which all molecules are ordered and related to each-other by symmetry operations (groups of symmetry listed in the International Tables for Crystallography). The symmetry axes present in the molecule in solution, which could be 2-, 3-, or 4-fold, may become part of the crystallographic symmetry. In such cases, one unit within, for example a trimer, becomes the asymmetric unit of the crystal. Crystallography operates with asymmetric units since the other units will be exactly the same and related by the symmetry operation of the crystal. This is reflected in the content of the files in the PDB, they contain coordinates for the atoms of one subunit, the asymmetric unit. For this reason the PISA server reconstructs the biological unit in cases when it is known to be different from the asymmetric unit or when there are some other indications which need to be taken into account. The file generated by the PISA server may also be downloaded from the PDB, I will discuss this in more details in the next page. The whole concept is illustrated in the figure below:

In the left figure the asymmetric unit of the crystal is just one subunit and all molecules in the lattice are related to each other by simple translation. In the middle figure there are two subunits in the unit cell related by a two fold axis. There is a big chance that the biological unit in solution is a dimer. In the figure on the right the molecules within the unit cell are related by a 4-fold crystallographic symmetry axis. Again, it cannot be excluded that the biological unit is going to be a tetramer.

Asymmetric-unit

Both PDB and PDBsum provides a description of the amino acid sequence in relation to the secondary structure of the protein. The image below shows the respective page from PDBsum:

pdbsum2

The information in this page is useful for quick identification of the position of amino acids within the structure, for getting an idea on the type of the protein (all alpha, alpha/beta, etc). There is also information on the publication which describes the structure (rather detailed in PDBsum, with links to citing papers). In the next page we will continue with the discussion of the PDB and the information, which can be found in PDB files.

Basics of Protein Structure