General Search links:  Information hyperlinked over proteins (iHOP) (this is a useful database to find information regarding proteins as well as synonyms for the proteins)

Database entry portals: 

SRS SRS DBGET NCBI Cross-Ref DB (one of the most comprehensive cross-reference sites by the US govt to other Data bases) PubMed ExPASy (protein analysis) The Protein Data Bank (structural data; the PDB contains mroe than 126,000 periementally determined atomic level 3D structures of biological macomolecules (proteins, DNA, RNA). Knowledge of the 3D structure of the gene product can help in understanding its function and role in diease. ) Enzyme Data Base Uniprot (another good database for sequence alignment, and other database entries).

Sequence Analysis/manipulation:

 EMBL NCBI Sequence Manipulation Suite (this site will do a lot of functions all in one with a given sequence)

Immunoglobulines:  IgBlastIMGT

Search Engines: 

MIA (Molecular Information Agent will actually take your sequence, etc and search for all information pertaining to that query)

Structural Visualization Software

PyMOL  (visualization system on an open-source foundation). 

Links to Bioinformatic Sites: 

Genamics Keuleven Genome Web (very extensive and updated list of genome web sites) Bioreference (very useful too to find highly used links, also provides short description of each link) DBCAT IndianaU EMBL (highly useful site which cross-references you to sites which allow you to do the things you want to do!)

Utility Tools: 

File conversion Seqtools

Several goals for cancer genomics and proteomics are (1) a better understanding of cancer development, (2) providing a molecular basis for histological diagnoses, and (3) subdividing tumor types according to prognosis and response to therapy (i.e., certain proteomes may be associated with greater success with certain therapies or may represent long-term survivors). Some interesting areas with respect to studies in cancer genomics/proteomics are the following:

Profiling T-cell reactive tumor antigens

Human tumor antigens recognized by CD4+ or CD8+ T cells are being classified into groups on the basis of their expression pattern. The expression pattern of the antigens is the critical factor determining their potential usefulness for cancer immunotherapy. 

Cancer Immunity – Peptide Database

MHC peptides can be recovered by detergent or acid treatments of cells. They can also be obtained by transfecting tumor cells with soluble versions of MHC molecules, isolating the secreted MHC molecule and recovering the peptide.  Recovered MHC peptides can then be identified by mass spec approaches. These peptides can then be used to see whether T cells can be stimulated. 

Methylation Profiling

Cancers have historically been linked to genetic changes caused by chromosomal mutations within the DNA. Mutations, hereditary or acquired, can lead to the loss of expression of genes critical for maintaining a healthy state. Evidence now supports that a relatively large number of cancers originate, not from mutations, but from inappropriate DNA methylation. In many cases, hyper-methylation of DNA incorrectly switches off critical genes, such as tumor suppressor genes or DNA repair genes, allowing cancers to develop and progress. This non-mutational process for controlling gene expression is described as epigenetics.

Aberrations in DNA methylation patters are now recognized as a hallmark of the cancer cell. In fact, methylation profiling, based on gene silencing of tumor suprressor genes, can also distinguish different cancers (see US20090215709)

Source Materials

Cancer genomics and proteomics studies are dependent on different source materials for RNA and protein which can limit the studies. Source material includes the following:

immortalized tumor cell lines: Cancer cell lines are a widely used experimental tool, and their use in cancer research has several advantages: (1) a wide spectrum of tumor types is commercially available, (2) they are easy to culture in vitro, yielding significant amounts of high-quality RNA and DNA, (3) although primary tumors are heterogeneous and invariably contaminated with normal cells, tumor cell lines are usually free of normal contamination. However, there are some drawbacks: (1) One question is how closely related they are to the primary tumors from which they originated and, even more importantly, how accurately a particular cell type represents the general population of primary tumors. In other words, they may not faithfully represent patient cancers.

patient tumor explants: Some possible disadvantages are (1) they are obtained as tissue masses with many different normal cell types. However, there are techniques like fluorescence activated cell-sorting (FACS) and laser capture microscopy (LCM) that can be used for obtaining patient samples. (2) patient samples represent a small and finite source of protein. However, with respect to mRNA, single patient cells can be isolated and RT-PCR can be used to amplify cDNA sequences as needed, for gene chip or cDNA microarray hybridizations.

Phosphorylation is one of the most frequently occurring postranslational modifications (almost 30% of all proteins are thought to be phosphorylated). In Eukaryotes, regulation of cellular processes is achieved through reversible phosphorylation of receptors, adaptor proteins and protein kinases on .  It plays an important role in signal transduction and regulates diverse cellular processes such as growth, metabolism, proliferation, motility and differentiation.

Analysis of the entire complement of phosphorylated proteins in cells or the “phosphoproteome” has become a realistic goal due to new enrichment protocols for phosphoproteins and phosphopeptides and improvement of methods to selectively visualize phosphorylated residues by mass spectrometry. 

Procedures to identify phosphorylated proteins

1) Traditional methods include 

  • radioactive labeling with 32P-labeled ATP followed by  , 

  • Edman sequencing and 

  • phosphospecific antibodies. 

Each of these techniques exhibits shortcoming for quantitative proteome analysis. For example, labeling with 32P requires a viable cell source and therefore is not applicable for the proteomic analysis of human tissue specimens. Most traditional methods are also inadequate because it is impossible to obtain large amounts of proteins since most signaling molecules are not abundantly expressed and their stoichiometry of phosphorylation is quite low (phosphoproteins are often a small fraction of the individual protein concentration). 

2) Mass Sectrometry has become the technique of choice for phosphorylation analysis. There are, however, the following challenges:

  • ion signals corresponding to phosphorylated peptides are significantly suppressed in the presence of non-phosphorylated peptides. Thus purification steps to enrich phosphorylated proteins (discussed below) from non-phosphorylated proteins or phosphopeptides from non-phosporylated peptides is usually necessary. 

  • Phosphopeptides are negatively charged whereas electrospray MS is run in the positive mode

  • Phosphopeptides are hydrophilic and do not bind well to usual prep columns (like C18 discussed below)

  • are labile (B-elimination), whereas phosphotyrosine is relatively stable

Enrichment of phosphoproteins

(1) By Chromatography

  • Miniaturized reverse-phage C18 columns: The problem here is that phosphopeptides are hydrophilic which results in losses.

  • Polymer-based reverse-phage (oligo R3) and porous graphite carbon (PGC): have a much higher capacity for phosphopeptides and hydrophilic peptides than C18 columns

  • Immobilized metal affinity chromatography (IMAC): is based on the affinity of negatively charged phosphate groups for positively charged metal ions (e.g. Fe3+) immobilized on a chromatographic support. The major limitation of this method is non-specific binding of non-phosphorylated peptides containing residues possessing acidic side chains such as  . Strategies to overcome this problem is the derivatization of all peptides by a methyl esterification reaction that reduces non-specific binding by carboxylate groups. 

(2) By Phosphospecific Antibodies: This is the simplest method to enrich phosphoproteins. Although there are several commercially available antibodies for phosphorylated (not good for phosphopeptides, however), there are currently no suitable antibodies for phosphorylated .

(2) By Chemical Modification:

  • Alkaline B-elimination: under strongly alkaline conditions, phosphoserine and phosphothreonine residues undergo an elimination reaction whereby phosphoric acid is lost and an alpha, beta unsaturated bond is formed. The end products can be detected by tandem   (MS/MS). 

  • chemical modification based on B-elimination: Here, samples containing phosphoproteins are first treated with a strong base, leading to B-elimination reaction in the case of phosphoserine and phosphothreonine residues. A reactive species containing an alpha, beta unsaturated bond is formed. The biotinylated reagent reacts with sulfhydryl (SH) groups. Biotinylated phosphoprotein is now tagged for enrichment on avidin columns in later steps.

Some advantages here is that the avidin-biotin extraction results in phosphopeptide enrichment.

Some possible disadvantages of this method is that glycosyl groups on Ser and Thr will be eliminated giving positive results. Thus deglycosylation will be required. 

  • carbodiimide condensation reaction: A disadvantage here is that the multistep chemical procedure requires 13 hours and has low yield. There are also side reactions that may interfere.

Quantitative Phosphoproteomics

There are several reasons why quantitation of phosphorylation is important. For example, the ratio of phosphorylation of a protein on multiple residues might be crucial for its function. MS-based quantitation techniques are emerging.

In one scheme of quantitation of phosphopeptides using , 2 states of phosphopeptides are labeled with isotopically distinct biotinylated mass tags. After labeling, the samples are mixed and purified over an avidin column. The unbound peptides are removed by washing followed by elution of tagged peptides and analysis by MS. The mass spectrum shows pairs of peaks for formerly phosphorylated peptides owing to the mass difference introduced by biotin tags of 2 different masses. The relative amounts of phosphopeptides in the 2 states can be derived from the relative intensities of the two peaks.

In another scheme, two protein pools differing in their extent of phosphorylation are digested with trypsin either in H216O or in H218O to obtain differential mass labeling. Equal amounts of the two pools are mixed, and phosphopeptides are selected with IMAC beads charged with Fe3+. The two peptide pools can be distinguished by a shift of 4 Da in the isotope cluster, and the difference in the extent of phosphorylation is reflected by the peak areas of the two monoisotopic peaks.

X-Refs:National Human Genome Research Institute (comparison of genomes of other species with our species to determine protein function)

See also Protein-Protein Interaction assays under biotechnology.

The Daunting Task of predicting Protein Function based on Structure

Proteomics is the study of how proteins interact with each other and other molecuels in metabolic pathways. The ghree billion nucleic acid base pairs identified in the human genome are believed to make up about 40k genes, which after full cotranslational and posttranslational modifications are believed to total in the millions of proteins. Unlike DNA and RNA, protein activity is based on molecular structure, While there have been countless attmepts to predict prtoein activity with supercomputers, these efforts have produced little useful results due to the complex structure of proteins. (Haaft, “Separations in Proteomics: Use of Camelid Antibody Fragments in the Depletion and Enrichment of Human Plasma Proteins for Proteomics Applciations, http://www.captureselect.com/downloads/sperations-in-proteomics.pdf, pp. 29-40, 2005).

Although the sequencing of complete genomes provides a list that includes the proteins responsible for cellular regulation, this does not reveal what these proteins do, nor how they are assembled into the molecular machines and functional networks that control cellular behavior. With the genomes of so many organisms completely sequenced, science and its new biomedical discipline of functional genomics, are faced with understanding the function of these newly discovered genes. Attwood states that “it is presumptuous to make funcitonal assignments merely on the basis of some degree of similarity between sequences” because “very few structures are known compared with the number of sequences, and structure prediction methods are unreliable (and knowing structure does not inherently tell us function) (Science, 290, 2000). Bowie also states that although it should be possible to predict structure from sequence and subsequently to infer detailed aspects of function form the structure, both problems are extremely complex and it seems unlikely that either will be solved in an exact manner in the near future (“Diciphering the Message in protein sequences: tolerance to amino acid substitutions” Science, 247, 1990, pp. 1306-1310).

The regulation of many different cellular processes requires the use of protein interaction domains to direct the association of polypeptides with one another and with phospholipids, small molecules, or nucleic acids. Interaction domains can target proteins to a specific subcellular location, provide a means for recognition of protein posttranslational modifications or chemical second messengers, nucleate the formation of multiprotein signaling complexes, and control the conformation, activity, and substrate specificity of enzymes.

As an example, enzymes like kinases often generate modified amino acids on their substrates that are then recognized by interaction modules in signal transduction. For example, phosphotyrosine (pTyr) sites formed by the actions of tyrosine kinases bind effectors with pTyr recognition domains (i.e., , whereas phosphoinositides produced by phospoinositide kinases recruit pleckstrin homology domains. 

Mutant cellular proteins that cause inherited disorders can exert their effects through the loss of protein-protein interactions, or conversely, by the creation of aberrant protein complexes. This suggests that rewiring of protein-protein interactions could be used experimentally to alter cellular function. Understanding the network of cellular protein interactions should expand the scope for creating novel biological responses through engineered proteins or small molecules. 

It is becoming increasingly clear that an important level of organization is provided by multi protein complexes because instead of proteins and substrates colliding in diffusion-dependent manner, proteins generally interact with each other and form larger assemblages in a time and space dependent manner. 

The importance of studying complexes is that it allows to place proteins with unknown roles into a functional context that is provided by their associated partners, some of which may have a known function. 

Analysis of protein complexes has some special challenges in that more than 10k different genes might be expressed at the same time in a single cell or tissue and diversity on the protein level is much higher. In addition diversity on the level of primary protein sequence and the presence of modificaitons, complexity is further increased when considering the dynamic range of expression levels of individual proteins. While some proteins are present thousand copies per cell, others are just represented by a few molecules.

Prlic “Impact of genetic variation on three dimensional structure and function of proteins” PLOS One, March 15, 2017) discloses a wide range of structural and functional changes caused by single amino acid differences, including changes in enzyme activity, aggregation propensity, structural stability, binding and dissociation. For example, delta-aminolevulinic acid dehydratase catayzes an early step in tetrapyrrole biosynthesis. The Phe-Leu mutation (F12L) casues ALAD Porphyria, a rare autosomal recessive disease. Despite of being located far from active site reisudes 199 and 252, theis variant changes the preferred protein assemply from octamer to hexamer. In addition, the optimal pH for enzyme activity is shifted from pH 7 to pH 9 in the mutant. The mtuant enzyme is barely active under physiological conditions. 

Antibodies:

Kabat compared the sequences of the hypervariable regions then known and found that, at 13 sites in the light chains and at seven positions in the heavy chains, the residues are conserved. They argued that hte residues at these sites are involved in the structure, rather than the specificity of the hypervariable regions. They suggested that these residues have a fixed position in antibodies and that this could be used in the model builindg of combining sites to limit the conformations and positions of the sites whose reisdues varied. (Lesk, “Canonical structures for the hypervariable regions of immunoglobulins” J Mol. Bio. (1987) 196, 901-917)

Antibody complementarity determining region (CDR) H3 loops are critical for adaptive immunological functions. Although the other five CDR loops adopt predictable canonical structures, H3 conformations have proven unclassifiable, other than an unusual C-terminal “kink” present in most antibodies. High structureal conservation among antibodies makes it possible to model the framework and the five CDR loops that adopt canoical conformations, but the exceptionally diverse CDR H3 loop evades current methods, thus making structure prediciton of the antigen-binding region difficult. (Weitzner, “the origin of CDR H3 structural diversity” Structure 23, 302-311, 2015). 

Effects of Single Nucleotide Variations (SNVs):

on Activity:

Prlic “Impact of genetic variation on three dimensional structure and function of proteins” PLOS One, March 15, 2017) discloses that 52 of 374 SNV related changes in their dataset either increase or decrease protein activity. In some cases, SNVs lead to complete loss of funciton. For example, human glycyl-tRNA synthetase loses detedtable enzymatic activity due to a G526R mutation, which is causative of Charcot-Marie-Tooth disease. The Ile-Val mutation in von Willebrand factor casues the blood clotting disorder von Willebrand diease. The mutation has a “gain of function” effect, producing a constitutively active form of vWF that binds platelets in the absence of shear forces. 

on Aggregation:

Prlic “Impact of genetic variation on three dimensional structure and function of proteins” PLOS One, March 15, 2017) disclsoes that 28 of 374 SNVs in their dataset gave rise to prtoein aggregaion, which is a hallmark of some nuerodegeneative diases such as Alzheimers. 

on Stability:

Prlic “Impact of genetic variation on three dimensional structure and function of proteins” PLOS One, March 15, 2017) discloses that 58 of 374 SNV related changes in their dataset letd to reduce protein stability. 

on Binding:

Prlic “Impact of genetic variation on three dimensional structure and function of proteins” PLOS One, March 15, 2017) disclsoes that 44 of 374 SNV related changes in their dataset affect ligand or macromolecule binding proeprties of the protein. A SNV can change the affinity of binding to partners such as activators, repressors, or substrates. Such changes can also afect the kinetics of interacitons with partners or alter binding specificity. 

Conservative Amino Acid Substitution

There is tremendous variability in the importance of individual amino acids in protein sequences. On the one hand, nonconcervative residue substitutions can be tolerated wtih no loss of activity at many residue positions, especially those exposed on the protein surface. On the other hand, destabilizing mutations can occur at a large number of different sites in a protein, and for many proteins such mutations account for more than half of the randomly isolated missene mutations that confer a defective phenotype. At sites that are key determinatans of stability or activity, even residue substituions that are generally considered to be conservative (.e.g, Glu-Asp, Asn-Asp, Ile-Leu, Lys-Arg and Ala-Gly) can have severe phenotypic effects. Unofrtunately, this means that there is no simple way to infer the likely effect of an amino acid substitution on the basis of sequence information alone. A nonconservative Gly-Arg substitution could be phenotypically silent at one position while a conservative Asn-Asp change could ldead to complete loss of activity at another position. (Pakula, “Genetic analysis of protein stability and funciton” Annu. Rev. Genet. 1989, 23: 289-310). 

Peptides can be designed de novo, but most peptides of biological interest are derived from N-terminal, C-erminal, or internal sequences of native proteins. Unfortunately, there are valid reasons why certain native sequences soemtimes need to be altered. Even for relatively short sequences, there are essential and non-esstial amino acid residues, although the relative importance of the individual amino acid residues is not always easy to determine. The “not-so-straighforward” rule of thum is to make the changes in the non-essential residues. These changes may include amino acid substitution (e.g., for solubility, stability, etc), chemical modification(e.g., for stability, structure-fuciton sutides), attachmet of ligands and conjugation. (SIGMA “Designing custom peptides” 2004, pp. 1-2).  

More than 540,000 protein sequences have been deposited in the nonredundant database maintained by the National Center for Biotechnology Information (NCBI). By contrast, the number of unique protein structures is still less than 2000. Structural proteomics aims to achieve a structural characterization of protein complexes. Structural genomics is sometimes used to describe the goal of delineating the 3D structures for individual proteins.

Despite many years of development of molecualr simluation methods, attempts to refine models produced by both de novo and comparative modeling have met with relatively little sucess. (Baker, “Protein Structure Prediction and Structural Genomics” Science, 294(5), 2001).

Difficulty of Predicting function/structure

Difficulty in predictions based on sequence similarity

The sequencing of entire genomces is a major achievement, but the meaning of the mass of acculated atea is only jsut beginning to be unraveled. At first sight, the task appears straightforward: locate the genes and translate the coding regions to establish their protein products; perform similarity searches to establish relationships with previously characterized sequences and assign function by evolutionary inference; and rationalize the function in structural terms using known or model-dervied structures. The reality, of course, is not so simple. Attemps to decipher the clues latent in genomic data are hmapered becasue current methods to product genes in uncahracterized DNA are unreliable (and it is not alwasy clear what is mean by “gene”). It is presumptous to make funcitonal assignments merely on the basis of some degree of similarity between sequences; very few structures are known compared to the number of sequences, and structural prediction methods are unrelatiable (knowing structure does not inherently tell function). (Attwood, “The babel of bioinformatics, Science 290(5491), 2000, pp. 471-473)

Difficulty in predictions based on sequence:

Decades of research have failed to produce an algorithm for predicting the structure of a given protein from its amino acid sequence alone. Thomas Ngo “Computational complexity, protein structure prediction, and the levinthal Paradox” from The Protein Folding and Tertiary Structure Prediction, K. Mertz and Le Grand, 1994. 

Single Amino Acid changes can have significant consequences for function:

Burgess (J Cell biology, 111, November 1990, 2129-2138) showed that replacement of a single amino acid lysin with glutamic acid reduces teh apparenta ffinity of HBGHF-1 for immobilized heparin. 

Lazar studiesd the relaitonship between the primary structure of TGF-alpha and some of its functional properties such as competition with EGF for binding to the EGF receptor. Lazar introduced single amino acid mutations into the sequence for teh fully processed 50 amino acid human TGF-lapha and found that mutations of two amino acids that are conserved in the family of the EGF like peptides and are located in the carboxy-terminal part of TGF-lpha resulted in different biological effects. When aspartic acid 47 was mutation to alanine or asparagine, biological activity was retains; in contrast, substitutions of this reisidue with serine or glutamic acid generated mutants with reduced binding and colon forming capacities. When leucine 48 was mutated to alanine, a complete loss of binding and colony forming abilities resudues. The data suggested that these two adjacent concserved amino acids in positions 47 and 48 play different roles in defining the structure and/or biological activity of TGF-alpha. (Lazar, Molecular and Cellular Biology, Mar 1988, p. 1247-1252). 

Engineering Binding Proteins based on Non-Antibody Scaffolds

The term “scafold” in the context of protein enginnering describes a polypeptide framework with a high toelrance of its fold for modificaitons such as multiple insertions, deletions ro substitutions. This intrinsic conformational stability enables the directed randomization and drastic changes within a defined region of the protein. Thus, it acquires certain novel properites, whereas its overall structural integrity and original physiochemical behavior remains conserved. This de novo adopted property mostly, but not exclusively, includes the binding specificity for a pre-defined target molecule. (Hey, “Artificial, non-antibody binding proteins for pharmaceutical and industrial applications” TRENDS Biotechnol. 23(10: 514-522 92005). 

There are now multiple 3D structures of binder/target complexes available from four saffold systems; fibronectin type III (FN3) domains of human fibronectin, affibodies (derived from the immunoglobulin binding protein A); DARPins (based on Ankyrin repeat modules) and anticalins (derived from the lipocalins billin-binding protein and human lipocalin. The increasing number of structures available from these systems, nearly 30 now in the protein data bandk (PDB) may allow extractiting meaningful information that goes beyond isolated “anecdotes”.  Relationships between structure/affinity ahve been extensively examined in the context of natural interfaces. However, simple structural parameters such as interface size, packing, and hydrophobicity have been shown to be poor predictors of affinity. Although all these parameters surely influence affinity, their relative contributions appear to be highly context dependent , minimizing their individual predictive power. (Gilbreth “Structural insights for engineering binding proteins based on non-antibody scaffolds” Current Opinion in Structural Biology, 2012, 22: 413-420). 

Antibody-related scafolds: Included with this alternative scaffold prtoeins are derivatives of antibodies that are either naturally smaller in size or have been engineered in this way. Companies such as Ablynx are using the variable domains of human light or heavy chains or single-domain anitobdies from camolids. 

Affibodies: are based on the Z-domain, a design variant of the IgG binding Protein A from Staphylococcus aureus, a three-helix bundle consistng of 58 amino acids. Affibody libraries are generated by randominization of up to 13 solvent exposed amino acids in the alpha-helical structures wihtout harming the overall structure in the parental scaffold. This strategy has been applied by Affibody

Three-helix bundle: occurs ubiquitously in nature as a robust scaffold for molecular recognition. First observed in the helical IgG binding domains of Staphylococcal aureus, this family has grown to include DNA binding prtoeins, enzymes and structural proteins. (Wash, “solution structure and dynamics of a de novo designed three-helix bundle protein” Proc. Natl. Acad. Scie, 95, 5480-5491 (199). 

–Alpha3D three-helix bundled polypetpide: 

Cangelosi, Angew Chem Int Ed Engl. 2014, 53(3): 7900-7903) discloses design of a metalloenzyme starting form an alpha3D, a de novo single-stranded antiparallel three-helix bundle. This 73 amino acid proteins folds wiht native protein-like satility, is tolerant of mutations within the hydrophobic core, and has been strcutrually characterized by NMR spectroscopy. The new metalloenzyme, alpha3DH3, which contains a His3 site that, upon binding Zn(ii), catalized the hydration of CO2. Alpha3DH3 differs form alpha3D in that three leucine residues were replaced with histidine residues (L18H, L28H, L67H), a histidine residue was replaced with valine (H72V) to ensure no competition for An(II) binding, and four extra residues were added to the end of the hain (GSGA) which improved expression yields. 

Lafleur (US 15/564,325, published as US 10,662,248; US 15/564,319, published as US 10,647,775; US 16/817,755, published as US 2020/0223934; see also US 16/824,809, published as US 2021/0002381) discloses de novo binding domain containing polypepetides that bind to targets of interest. The sequences of the DBDpp are rpeferably not antigenic with respect to a subject and thus do nto contain a human HLA-Dr binding motif or cleavage sites for proteasomes and immune-proteasomes. In aprticular embodimetns, the DBDpp sequence does not contain an MHC (class I or class II) binding site sequence as predicted by an algorithm such as ProPred. In silico analysis of the amino acid sequence of alpha3D revealed a 9 amino acid seuqence that shared characteristics with that of high affinity and promiscous T cell epitopes. Thus with the aim of reducign the potential for immunogenicity, a Q19E substitution was introduced into the alpha3D sequence. This conserved and surface exposed substitution appeared unlikley to significanlty disrupt the hydrophobic core. 

Park, (Protein Engineering, Design & Selection, 19(5), 211-217, 2006) discloses creation of a library of alpha3D mutatns to explore the pssible correlation of protein stability and fold with expression level. Five efficiently expressed mtuatants were then pruified and further studies. Despite their differences in stabiliyt, most mutatns expressed at levels compared with that of wild-type alpha3D. Two other related sequences (alpha3A and alpha3B) that formed collapsed, stable molen glbules but lacked a uniquely folded structure were similarly expressed at high levels by yeast display. 

Experimental Methods used in Structural Proteomics:

X-ray crystallography: is the most prolific technique for the structural analysis of proteins and protein complexes. Crystallography requires that milligram quantities of a pure protein can be prepared, and that the protein can be induced to form 3D crystals. When suitable crystals and high resolution cystallographic data are obtained, there is little need for other methods of structure characterization.

NMR spectroscopy: is also significant in determining the number of structures in the database of biomelecular structures. Although NMR analysis is generally not as applicable as X-ray crystallography to protein structures with more than 300 amino acid residues, it is more suitable than x-ray crystallography to study their dynamics and interactions in solution.

X-ray crystallography or two-dimensional electron microscopy (2D EM): produces images that represent only 2D projections of the specimen. Nevertheless, the full 3D structure of the object can be reconstructed again if one is able to start with many such projections, each showing the object from a different angle. Because it is possible to use non-crystalline particles, the purity does not need to be at that standard required for crystallization.

Electron tomography: is based upon multiple tilted views of the same object. Tomograms of cells are essentially 3D images of the cell’s entire proteome. They reveal information about the spatial relationships of macromolecules in the cytoplasm.

Immuno-electron microscopy: uses a construct of the protein of interest that binds to a gold-labelled antibody. The relative position of the gold particles is then identified by EM.

Chemical crosslinking with mass spectroscopy: relies on crosslinking reagents that covalently link proteins interacting with each other. Proteolytic digestion and subsequent of the crosslinked species reveal their composition. 

Affinity purification with mass spectroscopy: combines purification of protein complexes with identification of their individual components by mass spectroscopy. 

site-directed mutagenesis: can reveal which subunits in a complex interact with each other.

Alpha-Helices:

The alpha-helix is a ubiquitous secondary structural element that is almost exclusively observed in proteins when stabilized by tertiary or quaternary interactions. Beginning with the unexpected observations of alpha-helix formation in the isolated C-peptide in ribonuclease A, there is growing evidence that a significant percentage of all proteins contain isolated stable single alpha-helical domains (SAH). These SAH domains provide unique structural features essential for normal protein function. The SAH domain is a structural element found in numerous proteins, where it appears to operate as a semi-reigid structural element that thethers globular domains. (Swanson, “harnesssing the unique structural properties of isolated alpha-helices” J Biological Chemistry, 289(37), 2014). 

EAAAK motif: has been incorproated into various synthetically engineered polypeptides. It has for example been used as spacing linkers between a pair of green flurescent protein variants. (Swanson, “harnesssing the unique structural properties of isolated alpha-helices” J Biological Chemistry, 289(37), 2014).

ER/K motif: are the most extensively characterized SAH domains. ER/K alpha-helices are found in myosin X and VL. The ER/K motif has readily found applications in protein engineering. Three separate studies have used chimeric approaches to investigate the interplay of ER/K alpha helix mechanical properties on myosin function. (Swanson, “harnesssing the unique structural properties of isolated alpha-helices” J Biological Chemistry, 289(37), 2014).

Non Experimental Methods (In silico):

Using the 3-D structure of the antibody-antigen complexes, it is possible to enhance the antibody-antigen binding affinities by in silico mutations on antibody residues. In the best situation, when the antibody-antigen complex structures are avilable, it is relatively straight forward to perform affinity maturation in silico. First, the protein backbone is treated as rigid, and the conformation of the side chain was determeind by discrete side-chain rotamer search. Second the lowest energy of the structures was further re-evaluated by using more accurate, but computationally more expense models. (Zhao, “In silico methods in antibody design” Antibodies, 2018)

de novo or ab initio methods: predicts the structure from sequence alone, without relying on similarity at the fold level between the modeled sequence and any known structures. These methods assume that the native structure corresponds to the global free-energy minimum and attempts to find this minimum by an exploration of many conceivable protein conformations.

homology modeling: relies on detectable similarity spanning most of the modeled sequence and at least one known structure. It relies on finding known structures related to the sequence to be modeled, aligning the sequence with the related structures, building a model, and assessing the model.

computational docking: is based on maximizing the shape and chemical complementarities between a given pair of interacting proteins.

bioinformatic analysis of genomic sequences: Multiple sequence alignments and protein structure

–ConPLex: is an in silico scxreening tool which makes predictions of binding based on the distance between learned representations, enabling predicitons at the scale of maassive compound libraries and the human proteome. (Singh, Biophysics and computational Biology, “Constrastiv learning in protein language space predicts interactions between drugs and protein targets”, 120(24), 2023). 

Useful Links:  Human Proteome Map Portal contains translation of protein products derived from over 17k human genes.   Proteomic Data Base   world-wide Protein Data Bank (contains more than 126,000 experimentally determiend atomic level 3D structures of biological macromolecules). 

Proteomics has been developed for the large scale study of protein patterns in organisms. Typical goals for proteomic analysis are identification and quantification of proteins present in a specific tissue under specific circumstances. Proteomic technologies, in combination with bioinformatics, are powerful tools for protein identification and characterisation. Commonly, two dimensional (2D) electrophoresis is used for protein separation and Mass Spectrometry followed by databank searching are used for protein identification. Up to 10000 proteins can be studies simultaneously. Strategies for characterizing changes in complex mixtures have been developed using both  and also.

Differential Proteomics

The single largest proteomics market is in the field of differential proteomics, wehre campels of serum from diseased and nondiseased propulations are compared and contrasted to search for differences in protein levels. These proteins become diagnostic markers of diisease, targets for clinical therpaeutic intervention, or therapeutics themselves. (Haaft, “Separations in Proteomics: Use of Camelid Antibody Fragments in the Depletion and Enrichment of Human Plasma Proteins for Proteomics Applciations, http://www.captureselect.com/downloads/sperations-in-proteomics.pdf, pp. 29-40, 2005).

Techniques Used fro Proteomics

Affinity Chromatography:

–VHH fragments as ligands: Affinity chromatgoraphy that uses naturally occurring camelid single chain antibody (VHH) fragments as ligands can solve problems in proteomics by providing high affinity, high specificity binders. Unlike antibody reagents, VHH fragments can be easily manufactured and are very stable. (Haaft, “Separations in Proteomics: Use of Camelid Antibody Fragments in the Depletion and Enrichment of Human Plasma Proteins for Proteomics Applciations, http://www.captureselect.com/downloads/sperations-in-proteomics.pdf, pp. 29-40, 2005).

For two-dimensional gels, samples may be run on separate gels, stained, and protein abundances compared with the use of imaging software. However, in practice, protein pattern comparisons can be difficult to achieve due to poor reproducibility of protein separations on two-dimensional gels.

Mass spectrometry (MS) is not a quantitative technique per se as ion yields are highly dependent on the chemical and physical nature of the sample. However, isotopic labeling combined with MS has been extensively used for many years to produce accurate quantitation of small molecules and, more recently this has been extended to peptides and proteins.

 Isotope-coded affinity tag (ICAT): is a gel-free method for quantitative proteomics that relies on chemical labeling reagents referred to as ICATs. These chemical probes consist of 3 general elements: a reactive group capable of labeling a defined amino acid side chain (e.g. iodacetamide to modify cyteine residues), an isotopically coded linker, and a tag (e.g. biotin) for the affinity isolation of labeled proteins/peptides. For the quantitative comparison of two proteomes, one sample is labeled with the isotopically light (d0) probe and the other with the isotopically heavy (d8) version. To minimize error, both samples are then combined, digested with a protease (ie., trypsin), and subjected to avidin affinity chromatography to isolate peptides labeled with isotope-coded tagging reagents. These peptides are then analyzed by . The ratios of signal intensities of differentially mass-tagged peptide pairs are quantified to determine the relative levels of proteins in the two samples.

The development of isotope coded affinity tag (ICAT) reagents allows for quantitation through isotopic labeling. These reagents consist of 3 functional parts:

  • an iodoacetamide group that reacts with the free sulfhydryl group of a reduced side chain

  • a biotin moiety to aid isolation of modified peptides by 

  • a linker group that contains either heavy or light isotopic variants

In a typical experiment, one sample is labeled with light reagent and the other with heavy reagent. After attachment with ICAT labels, samples are combined, and the cysteine containing components are purified by means of the biotin tag. After MS data acquisition, the resulting mass spectra are searched for pairs of isotope envelopes differing in mass by 8 Da, and relative quantities of the proteins are determined by comparison of the corresponding isotope profiles.  Collision induced fragmentation (CID) of peptides of interest by gives rise to sequence specific fragmentation patterns, from which the identity of the parent protein can be derived.

A major innovation of the ICAT approach was that the affinity tag (biotin) was used to purify cysteine-containing peptides, reducing the complexity of a peptide mixture by about a factor of 10. As a result, several proteins that usually can’t be observed in an approach like 2DE could be identified and quantified.

Several problems with first generation ICAT reagents included the fact that 1) the biotin tag was bulky, and fragmentation of modified peptides produced many fragments in the CID spectrum related to the tag rather than the peptide, 2) the substantial mass addition resulting from the tag could also shift the masses of larger peptides outside the optimum range for detection by standard MS instruments, 3) the choice of 8 Da mass difference for the heavy ICAT reagent produced potential ambiguity between peptides containing 2 ICAT labeled cysteine residues (delta M _16.100 Da) and common oxidation of methionine residues (delta M + 15.995 Da) and 4) the D0 and D8 modified peptides did not coelute by reverse-phage chromatography, making quantitation less accurate.

These problems have been solved through the use of second generation ICAT reagents such as those which contain a cleavable linker group connecting the biotin moiety with the sulfhydryl reactive isotope tag. Also, rather than using deuterium as the heavy isotope, reagents employ nine 13C atoms as the isotopic label for the heavy reagent. Therefore, the heavy and light modified peptides coelute by reverse-phase chromatography, making quantitation simpler to achieve and the results more reliable.

ICAT reagentsapplied biosystems

Stable isotope labeling by amino acids in cell culture (SILAC)

In this quantification procedure, labeled, essential amino acids (usually deuterated leucine (Leu-d3) are added to amino acid deficient cell culture media and are thus incorporated into all proteins as they are synthesized. No chemical labeling or affinity purification steps are performed.

In a typical experiment, an experimental cell population is treated in a specific way, such as cytokine stimulation. Protein populations from both this experimental sample and the control are then harvested, and because the label is encoded directly into the amino acid sequence of every protein, the extracts can be mixed directly. Purified proteins or peptides will preserve the exact ratio of the labeled to unlabeled protein, as no more synthesis is taking place. Quantitation takes place at the level of the peptide mass spectrum or peptide fragment mass spectrum, exactly the same as in any other stable isotope method (such as ICAT).

Advantages of SILAC over ICAT include the fact that almost 70% of unique tryptic peptides in the human genome contain at least one leucine, while only ~25% contain cysteine, the common target for chemical tagging. A disadvantage of SILAC is that it is limited to cells that can be grown in culture.

Proteolysis in the presence of 16O and 18O 

In this procedure Mirgorodskaya et al. carried out isotopic label after proteolysis in the presence of 18O water or regular water. The sample digested with 18O water incorporates 16O, generating an isotopic label that is used for relative quantitation. However, the quantitation can be complicated by the possible loss or incomplete incorporation of the label.

Metabolomics is the study of the entire complement of all the small molecular weight metabolitesinside an organism of interest. Some metabolites which are routinely assayed in the blood include urea for liver and kidney function, cholesterol for risk of coronary artery disease, and glucose. This can be a difficult task, however, because metobolites differ widely in their chemical characteristics (solubility, charge, MW, etc), and concentrations can differ by many orders of magnitude. 

The interaction of an organism with its environment is essential to its survival. The basis of this interaction is predominantly small molecules. On a molecular level, small molecules both promote, as in nutrients, and challenge, as in toxins, cell viability. Those gene products that interact with small molecules underlie the organism’s ability to adapt to environmental changes and include those that bind, transport, and metabolize small molecules. 

In the future, databases of metabolomic information will be used to establish the relations among metabolite profiles and health status. One will be able to compare the metabolic status of an individual by comparing the results of diagnostics to the reference knowledge base. This knowledge will allow interventions such as drugs to improve health and prevent disease. 

One can already see such an approach being used. For example, metabolic profiling of amino acids and acylcarnitines from blood spots by automated  is a diagnostic tool for errors of metabolism in newborns. 

Lipidomics is a branch of metabolomics in which non-water soluble metabolites are studied in relation to the function of genes and their proteins. Classical methods of lipid analysis utilize the techniques of thin layer chromatography (TLC), HPLC, gas liquid chromatography (GLC), and MS. The recently developed technique of  has allowed almost complete analysis of the lipidome from organic extracts of cells/tissues. 

The complete analysis of lipid metabolites has been applied to investigate the effects of TZDs which are therapeutic agents against type II diabetes. TZDs decrease serum lipid concentrations but these actions are accompanied by lipid accumulation in tissues. In a study by Watkins, an assessment of the lipid metabolome (the concentration of each lipid class and each of its constituent fatty acids) was applied to evaluate the effect of feeding a low does of a TZD on the lipid metabolisms of diabetic mice. Analysis of the results revealed key targets of the actions of the TZD.

See also Lectin Affinity chromatography under Affinity Purification 

Definitions

Glycan: refers to the carbohydrate portion of a glycoconjugate, such as a glycopeptide, glycoprotein, glycolipid or proteoglycan. Regnier (US8,568,993) 

Glycomics is the systematic study of protein-glycan interaction and function. Glycans are information rich molecules composed of complex carbohydrates (sugars or polysaccharides) that are often attached to proteins and lipids. Studying glycomics is not an easy task due to linkage forms(e.g., alpha1-3, beta 1-4) and branching events that increase the structural complexity of glycans. Furthermore, glycans can not be directly sequenced and synthesized like DNAs and proteins.

Diagnosis of Disease based on Glycosylation Profile

Regnier (US8,568,993) teaches a method for detecting the presence or absence of sieases such as cancer by obtaining two or more sample glycopeptides from a subject and two or more reference glycopeptides and then comparing the sample glycopeptides or glycoproteins with the reference glycopeptides and detecting a difference in glycosylation state between them. 

Glycosylation and Cancer:

Lubman (WO2007/112082) discloses methods for the identification of glycosylated perotines and protein glycosylation patterns. In some embodiments, the method comprises treating a protein sample with a lecitn affinity chromatography apparatus under conditions such that it enriches the protein sample for glycosylated proteins to generate a glycosylated protein enriched sample and then separating the glycosylated protein enriched sample with a liquid chromatography apparatus. Lectins are carbohydrates that bind to glycosylated proteins and the use of the lectin affintiy chromatography allows for a protien sample to be enriched in glycosylated proteins. In some embodiments, the method further comrpises performes polyacrylamide gel electrophoresis or mass spectromety (e.g., MALDI-TOF mas spectrometry) on the separated glycosylated enriched protein sample. For example, pancreatic cancer serum can be analyzed using sialic acid specific lectin affintiy chromatography followed by fractionation using RP-HPLC and further separation by SDS-PAGE to idnetify serum marker proteins of pancreatic ancer. The expression of sialic acid glycoproteins with different sub-structures can be compared between normal and cancer serum based on UV absorption detection. Altered glycoproteins can be digested and identified by LC-MS?MS. The structures of released carbohydrte from purified serum proteins can be studies using MALDI mas spectrometry. The method can be sued to detect the change of the isoforms and extend of glycosylation of target glycoproteins in cancer serum. 

In the case of glycoportines, glycan structure often changes in association with disease. The fact that these aberrant glycans can also be immunogenic has been exploited by pathologists in detecting cancer Trhough the use of fluorescently labeled glycan-directed antibodies, staining procedures have been developed that allow differentiation between noraml and malignant cells in tissues on the basis of targeting aberrant glycosylation. In addition, surface glycoproteins are well documented to play a prominent role in the loss of cellular adhesion, metastasis the binding of tumor cells at remote sites and secondary tumor colonization. Regnier (US8,568,993)

See also CRISP gene editing under Enzymes and endonucleases

Gene editing refers to the introduction of desired changes at specific genomic loci. The original strategy for gene editing was to use programmable sequence specific nucleases to generate NA double strand breaks at predetermined gehomic sites, the desired changes being produced by subsequent non-homologous end joining, microhomology mediated end joining or homology directed repair. The emergence of CRISPR-Cas systems accelerated the development of gene editing technologies. (Liu, “The CRISPR-Cas toolbox and gene editing technologies” Molecualr Cell 82, 2022). 

CRISPR

CRISPR based methods creat edits by using a guid RNA to efect targeted breakage, and a plasmid-borne donor DNA to repair these breaks with the desired sequence, whith phenotype tracking permitted by amplicon sequence of these components. 

Retron Library Recombineering (RLR)

Retrons are prokaryotic retroelements that that under targeted reverse transcription, producing single stranded multicopy satellite DNA (msDNA). They participate in anti-Phage defense mechanisms. These components can produce functioning recombineering donors within cells, creating specified genomic cells. Pooled, barcoded mutant libraries can be prepared in this way and used for characterization of natural and synthetic allelic variants, a process referred to as RLR. RLR is an alternative to CRISPR based methods above and has the advantage that the rLRs “donor only” versus CRISPR’s “guid + donor” method eliminates the requirement for ablating CRISPR targetting, allowing single based pair changes to be characterized without wihtout requring additional prtospacer adjacent motif (PAM) targetting. This allows ablating mutations to be incorporated. RLR overcomes the requriement for targetting a suitable PAM altogether, whereas CRISPS “guide + donor” methods decrease in performance as the distance to a PAM increases. RLR’s lack of guide simplifies RLR elements. In contrast to the two unique elements required for the “guide + donor” strategies, RLP’s solle requirement is a unique short donor  sequence within the retron. This relaxed design constraint enables RLP using nondesigned variation. (Church, PNAS 2021, 118 (18) 

Eukaryotic retroelement Proteins for transgene insertion:

(Zhang, “Harnessing eukaryotic retroelement proteins for transgene insertion into human safe-harbor loci, 2024) discloses precise RNA mediated insertion of transgenes (PRINT), an approach for site specifically primed reverse transcription that directs transgene synthesis directly into the genome at a multicopy safe-harbor locus. PRINT uses delivery of two in vitro trasncribed RNAs: messenger RNA encoding avian R2 retroelement protein and template RNA encoding a transgene of lenght validated up to 4 kb. The R2 protein cocordinately recognizes the target stie, nicks one strand at a prcise location and primes complementary DNA syntehsis for stable transgene insertion. 

 

Once conserved non-coding sequences (CNSs) regions have been identified, the next step is to evaluate their function in biological experiments. There are various techniques to assess regulatory function.

DNASE I Hypersensitivity: Hypersensitive sites are regions at which local nucleosome organization has been altered from that of surrounding areas. They are often occupied by or show increased accessibility to transcription factors and other DNA-binding proteins. Ideally, bioinformatic and hypersensitve site analyses should extend for 50-100 kb in either direction from the gene, and intronic regions of neighboring genes should be included in the analysis, as they may contain important regulatory elements.

Once hypersensitive site mapping has been completed for a cell type of interest, it is worthwhile to extend it to other cell types and stimulation conditions. CNS regions that show increased hypersensitivity in stimulated cells may correspond to inducible enhancers whose function can be tested in standard reporter assays.

Developmental and cell lineage analyses are also particularly informative. For example, precursor naive T cells show DNase I hypersensitivity only at hypersensitive sites 3 and IV, located 5′ and 3′ of IL4. Th1 cells, which derive from the same naive precursors but have silenced IL4, continue to display hypersensitive sites 3 and IV. However, neither of these sites is apparent in fibroblasts, an unrelated cell type that has also silenced IL4. A testable hypothesis following from these data is that hypersenstivie sites 3 and IV have distinct, cell type-specific functions: in Th1 cells, the sites may participate in silencing of the cytokines genes, whereas in naive T cells, the sites may be responsible for the poised state of the locus, for which the exact stimulation conditions determine whether gene activation or silencing prevails.

Targeted Disruption: One of the most reliable means of assessing in vivo function is targeted disruption of putative regulatory regions. Deletions and mutations of regulatory regions can be done either in the native chromosomal context or in large bacterial or yeast artificial chromosome transgenes. Deletion of positive regulatory elements such as enhancers will result in decreased gene expression. Deletion of negative regulatory elements such as silencers will lead to increased gene expression in cells that either normally express or normally silence the gene. Loss of an insulator element may lead to inappropriate expression of either the gene in question or a neighboring gene in an irrelevant cell type.

It is important to note that for important loci (e.g., IL4) for which the ability to survive attack by pathogens is crucial for reproductive fitness and survival, deletion of individual hypersensitive sites may have only a partial effect, most likely because evolutionary pressures have imposed functional redundancy such that more than one regulatory region participates in gene activation or silencing. In such cases, multiple mutations may be needed to produce strong effects.

Reporter Assays: A variety of reproter assays using cell lines or transgenic animals have been used to asses whether putative enhancer, silencer and insulator elements influence gene expression from target promoters.

Chromatin immunoprecipitation: As DNA-binding proteins associated with a regulatory region are likely to recruit DNA and histone modifying enzymes, a CNS involved in regulating gene expression is likely to be a focus for differential histone modifications or differential DNA methylation in cells that express or do not express the gene. This can be evaluated by chromatin immunoprecipitation with antibodies to modified histone tails and by restriction digestion with methylation-sensitive enzymes.

Send an Email. All fields with an * are required.