Introduction to protein databases
Which protein database should I use? Where to find a protein structure I am interested in?There are many protein databases on the Internet. Some of them are of general character, but some are dedicated to specific protein families, specific metabolic pathways, etc. Here I will discuss some general-character databases. The first question, which may arise when working with a protein structure, could be something like: Where I can find a protein structure I am interested in? And another question, which many people don’t ask, once they get access to a protein structure file is: What is actually inside that file? The last question is essentially about how to put all the beautiful 3D structures with all the helices, strands, loops, etc. into a single file.
The main protein database for protein structure information is the Protein Data Bank, created sometime in the beginning of the 1970ties. Believe me or not, but databases existed already at that time, even if it almost feels like the Stone Age. Only few structures existed in the Protein Data Bank (PDB) at that time, and the only experimental method for protein structure determination available was that of protein X-ray crystallography. The real structural revolution, as you may see from the figure below, started in the beginning of the 1990ties:
Why then? One of the reasons was that cloning techniques started to enter the lab and the amount of protein available for crystallization increased substantially. Before the cloning era people had to purify proteins from cells, and apparently the cells did not have the need to express large quantities of the proteins we needed for crystallization. Therefore, to obtain few milligrams of protein for crystallization one needed a huge amount of cells. Cloning solved the problem. Another important factor was the introduction of synchrotron radiation. Synchrotrons, like MAX Lab in Lund, Sweden, ESRF in Grenoble, France, or DESY in Hamburg, Germany. provide very high intensity X-rays, which may be used for collecting high quality X-ray diffraction data from crystals. The third factor was probably the introduction of personal computers, relatively cheep and with ever increasing power. As usual Mac was of course first, but the cheaper Dos-based machines took over the market very soon, especially after the introduction of Windows in the beginning of the 1990ties. And cheaper computers mean new software. That was when the number of protein structures started to increase dramatically. Then came the era of structural genomics- large consortia were formed with the aim to develop new technology for solving huge amounts of protein structures. One such consortium is the SGC. And with a larger number of structures available, the number of protein databases started to increase and new tools for the analysis of protein sequence and structure were rapidly developing.
Currently every new structure has to be deposited with the protein data bank in order for the research group to be able to publish the paper based on the structure. So, how to get a structure and what is inside the PDB file? Getting a structure is very easy, all you need to do is to go to the PDB and type in the name of the structure you are looking for:
Currently every new structure has to be deposited with the protein data bank in order for the research group to be able to publish the paper based on the structure. So, how to get a structure and what is inside the PDB file? Getting a structure is very easy, all you need to do is to go to the PDB and type in the name of the structure you are looking for:
I typed in the name of a protein I know, magnesium chelatase. We should have got only one single hit, since there is only one X-ray structure of a magnesium chelatase (actually one of the subunits). However, there are several other hits listed there and they have very little to do with magnesium chelatase. This is something I don't like about the protein data bank. When looking for this type of protein databases, I prefer to use PDBsum or PDBe (PDB Europe). When I type in the name of the same protein, I get a single hit. The search function in the original PDB does seem to work properly, however, they have got great education stuff on using the protein data bank. I recommend having a look at this, you may find the link to the educational material at the bottom of the sidebar menu.
Let us use PDBsum as an entrance to our exploration of interesting protein databases. The main page looks something like that:
Let us use PDBsum as an entrance to our exploration of interesting protein databases. The main page looks something like that:
And the search result I get from typing in "magnesium chelatase" in the text serach area looks like that:
All I need to do now is to click on the 1g8p on the left, this is the PDB code for this particular structure. I use the same structure in the homology modeling tutorial. What we get is this:
But this is just the top of the page. If you click here, you will get the page with all the links to other protein databases containing information related to this structure, information about the article, where the structure was published, references which cited that publication, etc. I like it, and I think it is wonderful to be able to squeeze so much information in a single page.
Here you can see the secondary structure along the amino acid sequence- important information when, for example, doing amino acid sequence alignment and homology modeling.
On the right side you will find links to several protein databases:
On the right side you will find links to several protein databases: