Dataset information

This report has been verified by Polly as per framework v1.0 Learn More

Dataset information Value
Dataset ID GSE68086_GPL16791_raw
Title RNA-Seq of Tumor-Educated Platelets Enables Blood-Based Pan-Cancer, Multiclass, and Molecular Pathway Cancer Diagnostics
Summary We report RNA-sequencing data of 283 blood platelet samples, including 228 tumor-educated platelet (TEP) samples collected from patients with six different malignant tumors (non-small cell lung cancer, colorectal cancer, pancreatic cancer, glioblastoma, breast cancer and hepatobiliary carcinomas). In addition, we report RNA-sequencing data of blood platelets isolated from 55 healthy individuals. This dataset highlights the ability of TEP RNA-based 'liquid biopsies' in patients with several types with cancer, including the ability for pan-cancer, multiclass cancer and companion diagnostics.
Overall Design Blood platelets were isolated from whole blood in purple-cap BD Vacutainers containing EDTA anti-coagulant by standard centrifugation. Total RNA was extracted from the platelet pellet, subjected to cDNA synthesis and SMARTer amplification, fragmented by Covaris shearing, and prepared for sequencing using the Truseq Nano DNA Sample Preparation Kit. Subsequently, pooled sample libraries were sequenced on the Illumina Hiseq 2500 platform. All steps were quality-controlled using Bioanalyzer 2100 with RNA 6000 Picochip, DNA 7500 and DNA High Sensitivity chips measurements. For further downstream analyses, reads were quality-controlled using Trimmomatic, mapped to the human reference genome using STAR, and intron-spanning reads were summarized using HTseq. The processed data includes 285 samples (columns) and 57736 ensemble gene ids (rows). The supplementary data file (TEP_data_matrix.txt) contains the intron-spanning read counts, after data summarization by HTseq.
Number of samples 285
Publication Link Link
Abstract Tumor-educated blood platelets (TEPs) are implicated as central players in the systemic and local responses to tumor growth, thereby altering their RNA profile. We determined the diagnostic potential of TEPs by mRNA sequencing of 283 platelet samples. We distinguished 228 patients with localized and metastasized tumors from 55 healthy individuals with 96% accuracy. Across six different tumor types, the location of the primary tumor was correctly identified with 71% accuracy. Also, MET or HER2-positive, and mutant KRAS, EGFR, or PIK3CA tumors were accurately distinguished using surrogate TEP mRNA profiles. Our results indicate that blood platelets provide a valuable platform for pan-cancer, multiclass cancer, and companion diagnostics, possibly enabling clinical advances in blood-based "liquid biopsies".
Disease Breast Neoplasms, Triple Negative Breast Neoplasms, Digestive System Neoplasms, Colorectal Neoplasms, Glioblastoma, Normal, Carcinoma, Non-Small-Cell Lung, Pancreatic Neoplasms
Tissue Blood
Drug None
Cell Lines None
Cell Type Platelet
Organism Homo Sapiens
Custom Curation N/A

Processing information

The section provides processing details for the data coming from source.

Data Processing SRA files are converted to fastq files using fasterq dump, then QC'ed using FastQC with short read threshold of 20. MinION adapter search with adapter threshold 2 is performed on Fastq file(s) and skewer quality trimming is done, with min. read length (18), and phred quality threshold (10). Kallisto quantification with fragment length (100) and standard deviation (20) is used to get read counts. These parameters ensure robust analysis and reliable interpretation of bulk RNA-seq data.
1. Metadata information
Metadata information Value
Polly curated metadata fields are present at dataset level Pass
Polly curated metadata fields are present at sample level Pass
Polly curated metadata fields are present in gct file Pass
Publication Link is provided Pass
Publication Link is valid Fail
Dataset-Level vs. Sample-Level Metadata: concordance check Pass
Custom fields are present and valid N/A

2. Feature identifier
Feature Identifier Check Value
Ensembl Gene IDs present Pass
Ensembl Gene IDs are valid Pass
Gene Symbol present Fail
Gene Symbol are valid Fail

3. Data Matrix
Data Matrix Value
Data Matrix Values Valid Pass
Data Matrix Range 0.00 to 5168550.00


4. Histogram for expression distribution

Figure 1: Histogram showing frequency and distribution of TPM normalised expression values across all samples.

The histogram displays data distribution from counts matrix. The Raw count values are TPM normalized and log2(x+1) transformed for clarity.


5. Sample wise distribution of expression values using a boxplot.

Figure 2:  Boxplot showing TPM expression values across all samples.

The boxplot displays sample-wise distribution of counts matrix. The Raw count values are TPM normalized and log2(x+1) transformed for clarity.


6. Sample wise distribution of number of genes expressing using a barplot.

Figure 3: Barplot showing the distribution of number of genes with expresion value equal to 0 per sample.

This barplot helps identify if there are any samples with significantly number of genes which are lowly expressed which may indicate low mapping of reads to the genome.


1. Polly's curated metadata field distribution

Figure 1: The umap plot(s) represent different samples in a reduced dimensional space, with colors indicating the Polly standard and custom curated fields.

The plot(s) aid in understanding the biological differences between different samples as described by different metadata fields. Note: Umap plot for the raw counts will not be a reflective of correct distribution as the data requires normalisation

Figure 2: The sunburst plot(s) represent counts of different samples, with colors representing values from the Polly standard and custom curated fields.

The plot(s) aid in understanding the distribution of different samples as per the categorical metadata variables of Polly standard curated fields


2. Source metadata field distribution

Figure 3: The umap plot(s) represent different samples in a reduced dimensional space, with colors indicating the source metadata fields.

The plot(s) aid in understanding the biological differences between different samples as described by different metadata fields. Note: Umap plot for the raw counts will not be a reflective of correct distribution as the data requires normalisation


Figure 4: The sunburst plot represent counts of different samples, with colors representing values from the source.

The plot(s) aid in understanding the distribution of different samples as per the categorical metadata variables of source fields