structural characterization of proteins
More than 540,000 protein sequences have been deposited in the nonredundant database maintained by the National Center for Biotechnology Information (NCBI). By contrast, the number of unique protein structures is still less than 2000. Structural proteomics aims to achieve a structural characterization of protein complexes. Structural genomics is sometimes used to describe the goal of delineating the 3D structures for individual proteins.
Despite many years of development of molecualr simluation methods, attempts to refine models produced by both de novo and comparative modeling have met with relatively little sucess. (Baker, “Protein Structure Prediction and Structural Genomics” Science, 294(5), 2001).
Difficulty of Predicting function/structure
Difficulty in predictions based on sequence similarity
The sequencing of entire genomces is a major achievement, but the meaning of the mass of acculated atea is only jsut beginning to be unraveled. At first sight, the task appears straightforward: locate the genes and translate the coding regions to establish their protein products; perform similarity searches to establish relationships with previously characterized sequences and assign function by evolutionary inference; and rationalize the function in structural terms using known or model-dervied structures. The reality, of course, is not so simple. Attemps to decipher the clues latent in genomic data are hmapered becasue current methods to product genes in uncahracterized DNA are unreliable (and it is not alwasy clear what is mean by “gene”). It is presumptous to make funcitonal assignments merely on the basis of some degree of similarity between sequences; very few structures are known compared to the number of sequences, and structural prediction methods are unrelatiable (knowing structure does not inherently tell function). (Attwood, “The babel of bioinformatics, Science 290(5491), 2000, pp. 471-473)
Difficulty in predictions based on sequence:
Decades of research have failed to produce an algorithm for predicting the structure of a given protein from its amino acid sequence alone. Thomas Ngo “Computational complexity, protein structure prediction, and the levinthal Paradox” from The Protein Folding and Tertiary Structure Prediction, K. Mertz and Le Grand, 1994.
Single Amino Acid changes can have significant consequences for function:
Burgess (J Cell biology, 111, November 1990, 2129-2138) showed that replacement of a single amino acid lysin with glutamic acid reduces teh apparenta ffinity of HBGHF-1 for immobilized heparin.
Lazar studiesd the relaitonship between the primary structure of TGF-alpha and some of its functional properties such as competition with EGF for binding to the EGF receptor. Lazar introduced single amino acid mutations into the sequence for teh fully processed 50 amino acid human TGF-lapha and found that mutations of two amino acids that are conserved in the family of the EGF like peptides and are located in the carboxy-terminal part of TGF-lpha resulted in different biological effects. When aspartic acid 47 was mutation to alanine or asparagine, biological activity was retains; in contrast, substitutions of this reisidue with serine or glutamic acid generated mutants with reduced binding and colon forming capacities. When leucine 48 was mutated to alanine, a complete loss of binding and colony forming abilities resudues. The data suggested that these two adjacent concserved amino acids in positions 47 and 48 play different roles in defining the structure and/or biological activity of TGF-alpha. (Lazar, Molecular and Cellular Biology, Mar 1988, p. 1247-1252).
Engineering Binding Proteins based on Non-Antibody Scaffolds
The term “scafold” in the context of protein enginnering describes a polypeptide framework with a high toelrance of its fold for modificaitons such as multiple insertions, deletions ro substitutions. This intrinsic conformational stability enables the directed randomization and drastic changes within a defined region of the protein. Thus, it acquires certain novel properites, whereas its overall structural integrity and original physiochemical behavior remains conserved. This de novo adopted property mostly, but not exclusively, includes the binding specificity for a pre-defined target molecule. (Hey, “Artificial, non-antibody binding proteins for pharmaceutical and industrial applications” TRENDS Biotechnol. 23(10: 514-522 92005).
There are now multiple 3D structures of binder/target complexes available from four saffold systems; fibronectin type III (FN3) domains of human fibronectin, affibodies (derived from the immunoglobulin binding protein A); DARPins (based on Ankyrin repeat modules) and anticalins (derived from the lipocalins billin-binding protein and human lipocalin. The increasing number of structures available from these systems, nearly 30 now in the protein data bandk (PDB) may allow extractiting meaningful information that goes beyond isolated “anecdotes”. Relationships between structure/affinity ahve been extensively examined in the context of natural interfaces. However, simple structural parameters such as interface size, packing, and hydrophobicity have been shown to be poor predictors of affinity. Although all these parameters surely influence affinity, their relative contributions appear to be highly context dependent , minimizing their individual predictive power. (Gilbreth “Structural insights for engineering binding proteins based on non-antibody scaffolds” Current Opinion in Structural Biology, 2012, 22: 413-420).
Antibody-related scafolds: Included with this alternative scaffold prtoeins are derivatives of antibodies that are either naturally smaller in size or have been engineered in this way. Companies such as Ablynx are using the variable domains of human light or heavy chains or single-domain anitobdies from camolids.
Affibodies: are based on the Z-domain, a design variant of the IgG binding Protein A from Staphylococcus aureus, a three-helix bundle consistng of 58 amino acids. Affibody libraries are generated by randominization of up to 13 solvent exposed amino acids in the alpha-helical structures wihtout harming the overall structure in the parental scaffold. This strategy has been applied by Affibody.
Three-helix bundle: occurs ubiquitously in nature as a robust scaffold for molecular recognition. First observed in the helical IgG binding domains of Staphylococcal aureus, this family has grown to include DNA binding prtoeins, enzymes and structural proteins. (Wash, “solution structure and dynamics of a de novo designed three-helix bundle protein” Proc. Natl. Acad. Scie, 95, 5480-5491 (199).
–Alpha3D three-helix bundled polypetpide:
Cangelosi, Angew Chem Int Ed Engl. 2014, 53(3): 7900-7903) discloses design of a metalloenzyme starting form an alpha3D, a de novo single-stranded antiparallel three-helix bundle. This 73 amino acid proteins folds wiht native protein-like satility, is tolerant of mutations within the hydrophobic core, and has been strcutrually characterized by NMR spectroscopy. The new metalloenzyme, alpha3DH3, which contains a His3 site that, upon binding Zn(ii), catalized the hydration of CO2. Alpha3DH3 differs form alpha3D in that three leucine residues were replaced with histidine residues (L18H, L28H, L67H), a histidine residue was replaced with valine (H72V) to ensure no competition for An(II) binding, and four extra residues were added to the end of the hain (GSGA) which improved expression yields.
Lafleur (US 15/564,325, published as US 10,662,248; US 15/564,319, published as US 10,647,775; US 16/817,755, published as US 2020/0223934; see also US 16/824,809, published as US 2021/0002381) discloses de novo binding domain containing polypepetides that bind to targets of interest. The sequences of the DBDpp are rpeferably not antigenic with respect to a subject and thus do nto contain a human HLA-Dr binding motif or cleavage sites for proteasomes and immune-proteasomes. In aprticular embodimetns, the DBDpp sequence does not contain an MHC (class I or class II) binding site sequence as predicted by an algorithm such as ProPred. In silico analysis of the amino acid sequence of alpha3D revealed a 9 amino acid seuqence that shared characteristics with that of high affinity and promiscous T cell epitopes. Thus with the aim of reducign the potential for immunogenicity, a Q19E substitution was introduced into the alpha3D sequence. This conserved and surface exposed substitution appeared unlikley to significanlty disrupt the hydrophobic core.
Park, (Protein Engineering, Design & Selection, 19(5), 211-217, 2006) discloses creation of a library of alpha3D mutatns to explore the pssible correlation of protein stability and fold with expression level. Five efficiently expressed mtuatants were then pruified and further studies. Despite their differences in stabiliyt, most mutatns expressed at levels compared with that of wild-type alpha3D. Two other related sequences (alpha3A and alpha3B) that formed collapsed, stable molen glbules but lacked a uniquely folded structure were similarly expressed at high levels by yeast display.
Experimental Methods used in Structural Proteomics:
X-ray crystallography: is the most prolific technique for the structural analysis of proteins and protein complexes. Crystallography requires that milligram quantities of a pure protein can be prepared, and that the protein can be induced to form 3D crystals. When suitable crystals and high resolution cystallographic data are obtained, there is little need for other methods of structure characterization.
NMR spectroscopy: is also significant in determining the number of structures in the database of biomelecular structures. Although NMR analysis is generally not as applicable as X-ray crystallography to protein structures with more than 300 amino acid residues, it is more suitable than x-ray crystallography to study their dynamics and interactions in solution.
X-ray crystallography or two-dimensional electron microscopy (2D EM): produces images that represent only 2D projections of the specimen. Nevertheless, the full 3D structure of the object can be reconstructed again if one is able to start with many such projections, each showing the object from a different angle. Because it is possible to use non-crystalline particles, the purity does not need to be at that standard required for crystallization.
Electron tomography: is based upon multiple tilted views of the same object. Tomograms of cells are essentially 3D images of the cell’s entire proteome. They reveal information about the spatial relationships of macromolecules in the cytoplasm.
Immuno-electron microscopy: uses a construct of the protein of interest that binds to a gold-labelled antibody. The relative position of the gold particles is then identified by EM.
Chemical crosslinking with mass spectroscopy: relies on crosslinking reagents that covalently link proteins interacting with each other. Proteolytic digestion and subsequent of the crosslinked species reveal their composition.
Affinity purification with mass spectroscopy: combines purification of protein complexes with identification of their individual components by mass spectroscopy.
site-directed mutagenesis: can reveal which subunits in a complex interact with each other.
Alpha-Helices:
The alpha-helix is a ubiquitous secondary structural element that is almost exclusively observed in proteins when stabilized by tertiary or quaternary interactions. Beginning with the unexpected observations of alpha-helix formation in the isolated C-peptide in ribonuclease A, there is growing evidence that a significant percentage of all proteins contain isolated stable single alpha-helical domains (SAH). These SAH domains provide unique structural features essential for normal protein function. The SAH domain is a structural element found in numerous proteins, where it appears to operate as a semi-reigid structural element that thethers globular domains. (Swanson, “harnesssing the unique structural properties of isolated alpha-helices” J Biological Chemistry, 289(37), 2014).
EAAAK motif: has been incorproated into various synthetically engineered polypeptides. It has for example been used as spacing linkers between a pair of green flurescent protein variants. (Swanson, “harnesssing the unique structural properties of isolated alpha-helices” J Biological Chemistry, 289(37), 2014).
ER/K motif: are the most extensively characterized SAH domains. ER/K alpha-helices are found in myosin X and VL. The ER/K motif has readily found applications in protein engineering. Three separate studies have used chimeric approaches to investigate the interplay of ER/K alpha helix mechanical properties on myosin function. (Swanson, “harnesssing the unique structural properties of isolated alpha-helices” J Biological Chemistry, 289(37), 2014).
Non Experimental Methods (In silico):
Using the 3-D structure of the antibody-antigen complexes, it is possible to enhance the antibody-antigen binding affinities by in silico mutations on antibody residues. In the best situation, when the antibody-antigen complex structures are avilable, it is relatively straight forward to perform affinity maturation in silico. First, the protein backbone is treated as rigid, and the conformation of the side chain was determeind by discrete side-chain rotamer search. Second the lowest energy of the structures was further re-evaluated by using more accurate, but computationally more expense models. (Zhao, “In silico methods in antibody design” Antibodies, 2018)
de novo or ab initio methods: predicts the structure from sequence alone, without relying on similarity at the fold level between the modeled sequence and any known structures. These methods assume that the native structure corresponds to the global free-energy minimum and attempts to find this minimum by an exploration of many conceivable protein conformations.
homology modeling: relies on detectable similarity spanning most of the modeled sequence and at least one known structure. It relies on finding known structures related to the sequence to be modeled, aligning the sequence with the related structures, building a model, and assessing the model.
computational docking: is based on maximizing the shape and chemical complementarities between a given pair of interacting proteins.
bioinformatic analysis of genomic sequences: Multiple sequence alignments and protein structure
–ConPLex: is an in silico scxreening tool which makes predictions of binding based on the distance between learned representations, enabling predicitons at the scale of maassive compound libraries and the human proteome. (Singh, Biophysics and computational Biology, “Constrastiv learning in protein language space predicts interactions between drugs and protein targets”, 120(24), 2023).