Chapter iI – Protein Structure & Databases

Protein Fold Classification: The CATH Database

In the previous section, we discussed the definition of a domain, along with some examples of domains and folds. Now, we will continue our discussion on fold classification. An in-depth analysis of a protein fold can reveal important insights into its function and evolutionary history, insights that may be challenging to obtain through amino acid sequence analysis alone. Exploring and understanding the relationships between amino acid sequences and folds can enhance our understanding of the principles governing protein structure and function. Additionally, this research may be helpful for the design of new proteins with specific structures and activities.

The first step in classifying protein domains is to define their secondary structure. This process is routinely performed when a new structure is deposited in the Protein Data Bank (PDB). Each PDB entry includes a detailed description of the protein’s secondary structure, specifying the names of the first and last amino acid residues for each helix and β-strand. Graphics programs can be used to visualize the secondary structure, usually by coloring different structural elements with different colors. The second requirement for classification is to define the domains within a protein. A domain is the primary “unit” of classification. It is important to note that the same domain type can be found in many unrelated proteins, for example, the 4-helix bundle domain or the nucleotide-binding Rossmann fold domain. Therefore, classification can not rely on the entire protein if it contains more than one domain.

The PDB coordinate file does not provide information on a protein’s domain content and classification. However, PDB databases (RCSB PDB, PDBsum, and others) usually link to databases where this information is explicitly present. CATH (C-class, A-Architecture, T-Topology, H-Homologous superfamily) is the primary database on domains and domain classification. CATH gives detailed information on each protein’s domain content and describes each domain’s structure, function, and evolutionary origin. The assignment procedure includes:

  • Assignment of a Class to each domain (essentially refers to the secondary structure content – alpha, beta, or alpha/beta proteins)
  • Assignment of Architecture (the arrangement of secondary structure elements in space, irrespective of connectivity). The amino acid sequences within a particular architecture class are not necessarily homologous – common evolutionary origin is not required.
  • Assignment of Topology. Topology is what I was referring to fold. Here, connectivity between secondary structure elements is considered. Proteins with a similar fold do not need to have a common evolutionary origin.
  • Assignment of Homologous superfamily. A homologous superfamily defines a group of proteins that appear to be homologous (have common evolutionary origin), even without significant sequence similarity.

    We will examine some examples of CATH classification to explain these definitions.

Using CATH to Analyze Pyruvate Kinase

For the analysis, we will use PDB ID 1T5A, the tetrameric human pyruvate kinase mentioned in the domain section. A search in CATH will return the following results for the three domains: