Workflow

Goal and current status

The overall aim of the Human Protein Atlas project is to map the expression of all human proteins in normal tissues, cancers and cell lines through a systems biology approach that integrate various omics technologies, including genomics, transcriptomics and antibody-based proteomics. The strategy to meet this aim involves large-scale, high throughput generation and validation of antibodies against at least one isoform of all the roughly 20.000 protein-coding genes in the human genome, and to use these antibodies in a variety of applications, e.g. immunohistochemical staining of normal tissues and cancers, immunofluorescent staining of cell lines and Western blot analysis. The effort to map the human proteome can be considered as a natural progression of the human genome project, and the Human Protein Atlas project is to a large extent based on the available information derived from the human genome.

The Human Protein Atlas project was initiated in 2003 and is funded by the Knut and Alice Wallenberg foundation. The first on-line version of the Atlas was made public in 2005, and then contained data from a little over 700 antibodies. In the latest version of the Human Protein Atlas (v.21, released November 2021), the expression profiles of more than 17.100 human proteins have been analyzed using over 26.900 antibodies, which corresponds to 87% of the human protein-coding genome.

Antigens and antibodies

As part of the Human Protein Atlas effort, more than 50,000 polyclonal antibodies towards various recombinant human Protein Epitope Signature Tag (PrEST) protein fragments (100-150 amino acids) have been generated. Each antibody has been affinity purified using the recombinant protein fragment as capture ligand, to ensure that all antibodies in the pool bind to epitopes present on the target protein. In addition to the in-house antibodies, the Human Protein Atlas has access to >12,000 antibodies from >100 commercial vendors, kindly provided as part of collaboration agreements.

Generation of data for Tissue, Pathology, Single Cell Type and the other sections

All antibodies that have been approved for protein expression profiling are stained on a series of 8 different tissue microarrays (TMAs) that altogether contain samples from 44 normal human tissues (in triplicate) and 20 different forms of cancer (typically 12 different patients per cancer form, in duplicate). This amounts to a total of 576 stained samples for each antibody, which are then scanned as high-resolution digital images. The tissue images are annotated manually by specially educated personnel, in order to determine the staining pattern of the evaluated antibodies. 

After the annotation step, the quality and correlation of all available data regarding the gene of interest is taken into consideration to manually set a reliability score for a generated protein expression profile. The following data is evaluated:
 

  • Annotated staining patterns of all antibodies targeting the gene of interest.

  • Normalized mRNA expression levels (nTPM) of each tissue/organ.

  • Available gene-related scientific literature.

Finally, the protein expression profile, along with all images, antibody annotation data and antibody information, are published in the up-coming version of The Human Protein Atlas database (www.proteinatlas.org).

Figure 1: Illustration of the workflow in the Uppsala group of Human Protein Atlas, divided into three main steps: 1) Data is initially generated through the preparation and scanning of antibody-stained tissue microarrays. 2) The image data is subsequently analyzed and validated for each gene by comparing the stainings with available mRNA expression data and gene-related scientific literature. 3) The analyzed data is presented as protein expression profiles together with all images and is released in the up-coming version of the Human Protein Atlas.

In the last couple of years, large volumes of transcriptomics data have been generated and imported into the Human Protein Atlas database. Expression levels of mRNA for all protein coding genes in most human organs and tissues, originating from three projects (HPA, GTEx, FANTOM5), are now available in the Tissue section. The mRNA data from HPA and GTEx have also been merged to form normalized TPM (nTPM) levels. All genes have been categorized based on their nTPM-based mRNA expression according to the degree of specificity and distribution of the expression in the different organs, tissues and cell types of the human body. In addition, the normalized mRNA expression of each gene has been categorized into expression clusters, where genes with similar expression patterns among the various organs and tissues of the body are clustered together, which is visualized in UMAP plots. Expression clusters are also available for the mRNA data in four other sections: Single Cell Type, Brain, Immune Cell and Cell Line.

In the Pathology section there is in addition to antibody-based analysis of protein expression in 20 different forms of cancer also mRNA expression data of each gene in 17 different main forms of cancer, imported from The Cancer Genome Atlas (TCGA). The cancer-related mRNA expression of each gene is categorized according to distribution of the expression pattern across cancer types, as well as its association with cancer patient survival, visualized in Kaplan-Meier plots. Each gene is further categorized as favourable or unfavourable for each cancer type if the survival association is statistically significant (p<0.001). Other genes are labeled non-prognostic.

The Single Cell Type section shows single cell RNA sequencing (scRNAseq) data from 25 different human tissues, together with immunohistochemically stained tissue sections visualizing the corresponding spatial protein expression patterns. The scRNAseq analysis is based on publicly available genome-wide expression data from various studies and comprises all protein-coding genes in 444 individual single cell clusters. These clusters have been annotated as 15 cell type groups and a total of 78 individual cell types using >500 well-known cell type-specific markers. The genes expressed in each of the cell types can be explored in interactive UMAP plots and bar charts, with links to corresponding immunohistochemical stainings in human tissues. In addition to UMAP plots of single cell clusters, there are also UMAP plots that visualize expression clusters grouping genes of similar expression among the annotated cell types.

The other seven sections are mainly handled by other groups within Human Protein Atlas, whereof four are new creations since version 21.

The Brain section combines mRNA expression data and antibody-based protein localization to map the gene expression in the different regions of the mammalian brain. RNA-seq transcriptomics data for each gene in more than 200 different regions of the human brain and more than 10 brain regions in pig and mouse, is combined with high-resolution spatial antibody-based protein expression data in mouse brain (whole mouse brain serially aligned sections) and human brain (tissue images/data from Tissue Atlas).

The Tissue Cell Type section is based on a correlation analysis of the mRNA expression from bulk RNA data performed in 14 different organs. The analysis works as a complement to the Single Cell Type data by identifying genes which mainly are expressed by a few cell types within an organ. A prediction of cell type specificity within an organ for the expression of all protein coding genes has been calculated by correlating the mRNA expression of gene marker panels, consisting of three cell type specific marker genes (virtual reference transcripts) per cell type, with the expression of all other genes. The specificity prediction results in a score up to one, where one corresponds to perfect correlation. In order to simplify the exploration of the data, the correlation scores are categorized into three different degrees of cell type specificity enrichment (moderate, high, very high), which indicates how much more the expression of a gene is enriched in a cell type compared to the other cell types within an organ.

The previous section Blood Atlas has been divided into two separate sections: Immune Cell and Blood Protein. In the Immune Cell section, gene expression of the various cell types of human blood is explored. Transcriptomics data from three different projects (HPA, Monaco, Schmiedel) of mRNA expression in different types of blood cells have been generated through a combination of cell sorting and RNA-seq. In the Blood Protein section the Immune Cell data is supplemented with proteomics data in the form of plasma or blood protein concentrations generated through mass spectrometry and/or antibody-based immunoassays. In addition to analysis of plasma protein levels, distribution of secreted proteins in the human body, the Human Secretome, have been mapped and published in the Blood Protein section. The final location of secreted proteins have been annotated for 2793 candidate proteins through the use of available scientific literature.

In parallel to the immunohistochemical analysis of tissues, immunofluorescence analyses of the protein's subcellular expression pattern is performed on a selection of 36 cell lines using confocal microscopy. The subcellular location(s) has been manually annotated to one or more of 35 organelles or subcellular structures using three representable cell lines for more than 13 000 proteins. The data and images are published in the Subcellular section.

A Cell Line section has also been created for the exploration of the mRNA expression of all protein coding genes in 69 established human cell lines. As in other sections, the mRNA expression data is categorized according to degree of specificity and distribution as well as clustered based on body wide expression patterns.

The Metabolic section is an expansion of the Human Protein Atlas with data imported from metabolicatlas.org to enable exploration of gene expression and protein function in the context of the human metabolic network. Manually curated maps are available for over 120 different metabolic pathways or subsystems, each depicting the association of proteins with the involved biochemical reactions. For proteins involved in metabolism, a metabolic summary is provided that describes the metabolic subsystems/pathways, subcellular compartments, and number of reactions associated with the protein. Each metabolic pathway map can additionally be explored in its entirety together with a heatmap showing the mRNA expression (nTPM) of all proteins associated with the pathway across 256 different tissue types.

The website

All data is freely and publically available on The Human Protein Atlas website (www.proteinatlas.org), a database with >150.000 unique visitors per month which is updated with new data and functionalities on a yearly basis. Data regarding a protein of interest can be accessed through an initial search. Possible search queries include the name of the protein of interest (simple search), but also conditional advanced queries which include or exclude tissue types, cell types, cell lines, protein classes, subcellular location, etc. A query using the search fields (Fig 2) leads to a search result, summarizing the main findings of each gene matching the query. Clicking on a certain gene leads to a gene-centered summary page that provides a general and large-scale overview on expression pattern both at the mRNA and protein level. From the summary page, one is then able to browse deeper into the specific expression patterns from the different perspectives of the 10 different sections.

In addition to search queries, it is possible to navigate the Human Protein Atlas data by browsing knowledge pages showing comprehensive summaries of e.g. cell type or organ specific proteomes (Fig 2), with clickable charts and examples. The database also provides around 30 downloadable datasets for large-scale bioinformatic analyses or programmatic access to all Human Protein Atlas data.

Figure 2: Illustration of the global structure of the Human Protein Atlas web portal. Exploration of the Human Protein Atlas data is guided through a systemic and a gene‐centric approach, both subdivided into ten interconnected sections. The systemic approach entails exploration of various section proteomes, each specific to a group of genes associated with a certain location, phenotype or activity within the body, such as an organ or celltype, or proteins secreted to blood or other parts of the body. The section pages are accessed by clicking on either of the section images on the startpage. The other option is to search for specific genes using the startpage search field that can be combined with various different filtering options. The search leads the visitor to a search result page, wherein the gene of interest can be selected to access gene‐specific data, both summarized in the gene summary page and presented in depth in the different gene‐specific sections. In addition, the database contains a page with data concerning SARS‐CoV‐2 relevant proteins, as well as downloadable datasets, educational material such as dictionaries and more, found in the menu.

Read more in the links below:

Back to main page

Last modified: 2022-03-09