当前位置:网站首页>Lasso regression with one code

Lasso regression with one code

2020-12-07 13:01:19 Shengxin skill tree

The essence of data mining is to reduce the number of genes , For example, the expression matrix is usually 2 More than 10000 protein coding genes , Whether it's an expression chip or RNA-seq Sequenced , What degree of difference analysis is used , Finally, there are hundreds of target genes . If it's a clinical cohort , It usually intersects with survival analysis , Or the intersection of different results from multiple datasets , such as : Multiple datasets integrated artifact -RobustRankAggreg package , Such a gene set is 100 Within the number of , But there's still room to shrink , such as lasso And so on , In the end 10 About genes signature It can be published smoothly .

however lasso For most of the data analysts who are half baked , It is still a little difficult to understand the principle and operation . Here I recommend one R package glmSparseNet, Can finish Multiple data types Of lasso Return to , Include :"gaussian", "poisson", "binomial", "multinomial", "cox", and "mgaussian".

The full name of this bag is :Network Centrality Metrics for Elastic-Net Regularized Models, Website is :https://www.bioconductor.org/packages/release/bioc/html/glmSparseNet.html Several instance documents are given , It's worth reading :

HTML R Script Breast survival dataset using network from STRING DB
HTML R Script Example for Classification -- Breast Invasive Carcinoma
HTML R Script Example for Survival Data -- Breast Invasive Carcinoma
HTML R Script Example for Survival Data -- Prostate Adenocarcinoma
HTML R Script Example for Survival Data -- Skin Melanoma

Install and load R package glmSparseNet I believe there is no need for me to say more :

if (!require("BiocManager"))
  install.packages("BiocManager")
BiocManager::install("glmSparseNet")

3 An example of cancer

Use here curatedTCGAData To get TCGA Database data , Reference tutorial : Use curatedTCGAData download TCGA Is database information easy to use , Let's take a look at the data first .

First, breast cancer

If it's classification modeling , Distinguish between tumor samples and normal tissue samples , The code to get the data is as follows :

library(curatedTCGAData)
library(TCGAutils)
params <- list(seed = 29221)

brca <- curatedTCGAData(diseaseCode = "BRCA", assays = "RNASeq2GeneNorm", FALSE)
## ----data.show, warning=FALSE, error=FALSE------------------------------------
brca <- TCGAutils::splitAssays(brca, c('01','11'))
xdata.raw <- t(cbind(assay(brca[[1]]), assay(brca[[2]])))

# Get matches between survival and assay data
class.v        <- TCGAbiospec(rownames(xdata.raw))$sample_definition %>% factor
names(class.v) <- rownames(xdata.raw)
# keep features with standard deviation > 0
xdata.raw <- xdata.raw %>%
  { (apply(., 2, sd) != 0) } %>%
  { xdata.raw[, .] } %>%
  scale

set.seed(params$seed)
small.subset <- c('CD5', 'CSF2RB', 'HSF1', 'IRGC', 'LRRC37A6P', 'NEUROG2',
                  'NLRC4', 'PDE11A', 'PIK3CB', 'QARS', 'RPGRIP1L', 'SDC1',
                  'TMEM31', 'YME1L1', 'ZBTB11',
                  sample(colnames(xdata.raw), 100))

xdata <- xdata.raw[, small.subset[small.subset %in% colnames(xdata.raw)]]
ydata <- class.v
save(xdata,ydata,file = 'brca_for_lasso.rdata')

#  You can see   The tumor sample is probably normal 10 Multiple quantity .
> table(ydata)
ydata
Primary Solid Tumor Solid Tissue Normal
               1093                 112

If it's survival analysis modeling , The data needed are pure tumor sample expression matrix plus corresponding survival information of tumor patients , The full download code is as follows :

params <- list(seed = 29221)  

brca <- curatedTCGAData(diseaseCode = "BRCA", assays = "RNASeq2GeneNorm", FALSE)
brca.primary.solid.tumor <- TCGAutils::splitAssays(brca, '01')
xdata.raw <- t(assay(brca.primary.solid.tumor[[1]]))

# Get survival information
ydata.raw <- colData(brca.primary.solid.tumor) %>% as.data.frame %>%
  # Keep only data relative to survival or samples
  select(patientID, vital_status,
         Days.to.date.of.Death, Days.to.Date.of.Last.Contact,
         days_to_death,         days_to_last_followup,
         Vital.Status) %>%
  # Convert days to integer
  mutate(Days.to.date.of.Death = as.integer(Days.to.date.of.Death)) %>%
  mutate(Days.to.Last.Contact  = as.integer(Days.to.Date.of.Last.Contact)) %>%
  # Find max time between all days (ignoring missings)
  rowwise %>%
  mutate(time = max(days_to_last_followup,        Days.to.date.of.Death,
                    Days.to.Last.Contact, days_to_death, na.rm = TRUE)) %>%
  # Keep only survival variables and codes
  select(patientID, status = vital_status, time) %>%
  # Discard individuals with survival time less or equal to 0
  filter(!is.na(time) & time > 0) %>% as.data.frame

# Set index as the patientID
rownames(ydata.raw) <- ydata.raw$patientID

# Get matches between survival and assay data
xdata.raw <- xdata.raw[TCGAbarcode(rownames(xdata.raw)) %in%
                         rownames(ydata.raw),]
xdata.raw <- xdata.raw %>%
  { (apply(., 2, sd) != 0) } %>%
  { xdata.raw[, .] } %>%
  scale

# Order ydata the same as assay
ydata.raw    <- ydata.raw[TCGAbarcode(rownames(xdata.raw)), ]

# Using only a subset of genes previously selected to keep this short example.
set.seed(params$seed)
small.subset <- c('CD5', 'CSF2RB', 'IRGC', 'NEUROG2', 'NLRC4', 'PDE11A',  
                  'PTEN', 'TP53', 'BRAF',
                  'PIK3CB', 'QARS', 'RFC3', 'RPGRIP1L', 'SDC1', 'TMEM31',
                  'YME1L1', 'ZBTB11', sample(colnames(xdata.raw), 100)) %>%
  unique

xdata <- xdata.raw[, small.subset[small.subset %in% colnames(xdata.raw)]]
ydata <- ydata.raw %>% select(time, status)
save(xdata,ydata,file = 'brca_for_lasso_sur.rdata')

> head(ydata)
             time status
TCGA-3C-AAAU 4047      0
TCGA-3C-AALI 4005      0
TCGA-3C-AALJ 1474      0
TCGA-3C-AALK 1448      0
TCGA-4H-AAAK  348      0
TCGA-5L-AAT0 1477      0

You can see , Although variable names are ydata , Oh, the data in it is different .

And then there's prostate cancer

The code is as follows :

library(curatedTCGAData)
library(TCGAutils)
params <- list(seed = 29221)

prad <- curatedTCGAData(diseaseCode = "PRAD", assays = "RNASeq2GeneNorm", FALSE)
prad.primary.solid.tumor <- TCGAutils::splitAssays(prad, '01')
xdata.raw <- t(assay(prad.primary.solid.tumor[[1]]))

# Get survival information
ydata.raw <- colData(prad.primary.solid.tumor) %>% as.data.frame %>%
  # Find max time between all days (ignoring missings)
  rowwise %>%
  mutate(time = max(days_to_last_followup, days_to_death, na.rm = TRUE)) %>%
  # Keep only survival variables and codes
  select(patientID, status = vital_status, time) %>%
  # Discard individuals with survival time less or equal to 0
  filter(!is.na(time) & time > 0) %>% as.data.frame

# Set index as the patientID
rownames(ydata.raw) <- ydata.raw$patientID

# keep only features that have standard deviation > 0
xdata.raw  <- xdata.raw[TCGAbarcode(rownames(xdata.raw)) %in%
                          rownames(ydata.raw),]
xdata.raw  <- xdata.raw %>%
  { (apply(., 2, sd) != 0) } %>%
  { xdata.raw[, .] } %>%
  scale

# Order ydata the same as assay
ydata.raw    <- ydata.raw[TCGAbarcode(rownames(xdata.raw)), ]

set.seed(params$seed)
small.subset <- c(geneNames(c('ENSG00000103091', 'ENSG00000064787',
                              'ENSG00000119915', 'ENSG00000120158',
                              'ENSG00000114491', 'ENSG00000204176',
                              'ENSG00000138399'))$external_gene_name,
                  sample(colnames(xdata.raw), 100)) %>% unique %>% sort

xdata <- xdata.raw[, small.subset[small.subset %in% colnames(xdata.raw)]]
ydata <- ydata.raw %>% select(time, status)
save(xdata,ydata,file = 'prad_for_lasso.rdata')

Finally, skin cancer

The code is as follows :

library(curatedTCGAData)
library(TCGAutils)
params <- list(seed = 29221)


skcm <- curatedTCGAData(diseaseCode = 'SKCM', assays = 'RNASeq2GeneNorm', FALSE, cache = tempdir())
skcm.metastatic <- TCGAutils::splitAssays(skcm, '06')
xdata.raw <- t(assay(skcm.metastatic[[1]]))
# Get survival information
ydata.raw <- colData(skcm.metastatic) %>% as.data.frame %>%
  # Find max time between all days (ignoring missings)
  rowwise %>%
  mutate(time = max(days_to_last_followup, days_to_death, na.rm = TRUE)) %>%
  # Keep only survival variables and codes
  select(patientID, status = vital_status, time) %>%
  # Discard individuals with survival time less or equal to 0
  filter(!is.na(time) & time > 0) %>% as.data.frame

# Get survival information
ydata.raw <- colData(skcm) %>% as.data.frame %>%
  # Find max time between all days (ignoring missings)
  rowwise %>%
  mutate(time = max(days_to_last_followup, days_to_death, na.rm = TRUE)) %>%
  # Keep only survival variables and codes
  select(patientID, status = vital_status, time) %>%
  # Discard individuals with survival time less or equal to 0
  filter(!is.na(time) & time > 0) %>% as.data.frame

# Set index as the patientID
rownames(ydata.raw) <- ydata.raw$patientID

# keep only features that have standard deviation > 0
xdata.raw      <- xdata.raw[TCGAbarcode(rownames(xdata.raw)) %in%
                              rownames(ydata.raw),]
xdata.raw      <- xdata.raw %>%
  { (apply(., 2, sd) != 0) } %>%
  { xdata.raw[, .] } %>%
  scale

# Order ydata the same as assay
ydata.raw    <- ydata.raw[TCGAbarcode(rownames(xdata.raw)), ]

set.seed(params$seed)
small.subset <- c('FOXL2', 'KLHL5', 'PCYT2', 'SLC6A10P', 'STRAP', 'TMEM33',
                  'WT1-AS', sample(colnames(xdata.raw), 100))

xdata <- xdata.raw[, small.subset[small.subset %in% colnames(xdata.raw)]]
ydata <- ydata.raw %>% select(time, status)

save(xdata,ydata,file = 'skcm_for_lasso.rdata')

The later modeling depends on these data files :

The previous data download , It's just preparation , Complete with a single line of code lasso It doesn't matter to return . If you don't understand , Go through the course by yourself : Use curatedTCGAData download TCGA Is database information easy to use , Anyway, the code I gave out can be run by copying and pasting .

Using functions cv.glmHub To deal with it xdata and ydata Just go ahead lasso Return to

If ydata There are properties like tumors or normal samples , We use cv.glmHub To do it Classification model building , If ydata It's tumor survival information , So we use cv.glmHub To build a model for survival analysis .

library(dplyr)
library(ggplot2)
library(survival)
library(loose.rock)
library(futile.logger)
library(curatedTCGAData)
library(TCGAutils)
library(glmSparseNet)

The first is the construction of classification model

ydata There are properties like tumors or normal samples , The second is classification , Using functions cv.glmHub The code can complete the analysis in one sentence :

load('brca_for_lasso.rdata')  
table(ydata)
fitted <- cv.glmHub(xdata, ydata,
                    family  = 'binomial',
                    network = 'correlation',
                    nlambda = 1000,
                    network.options = networkOptions(cutoff = .6,
                                                     min.degree = .2))

plot(fitted)
resp <- predict(fitted, s = 'lambda.min', newx = xdata, type = 'class')
table(resp,ydata)

The key parameter is family = 'binomial', And then you can see , This classifier works well :

> table(resp,ydata)
                     ydata
resp                  Primary Solid Tumor Solid Tissue Normal
  Primary Solid Tumor                1088                  11
  Solid Tissue Normal                   5                 101

Basically no wrong samples .

Yes, of course , We can also extract the main genes of the constructed classifier , And drawing ROC Curves and so on , It's something else , This is just sharing code :

coefs.v <- coef(fitted, s = 'lambda.min')[,1] %>% { .[. != 0]}
coefs.v %>% {
  data.frame(ensembl.id  = names(.),
             gene.name   = geneNames(names(.))$external_gene_name,
             coefficient = .,
             stringsAsFactors = FALSE)
} %>%
  arrange(gene.name) %>%
  knitr::kable()


flog.info('Misclassified (%d)', sum(ydata != resp))
flog.info('  * False primary solid tumour: %d',
          sum(resp != ydata & resp == 'Primary Solid Tumor'))
flog.info('  * False normal              : %d',
          sum(resp != ydata & resp == 'Solid Tissue Normal'))

## ----predict, echo=FALSE, warning=FALSE---------------------------------------
response <- predict(fitted, s = 'lambda.min', newx = xdata, type = 'response')
qplot(response, bins = 100)

## ----roc, echo=FALSE----------------------------------------------------------
roc_obj <- pROC::roc(ydata, as.vector(response))

data.frame(TPR = roc_obj$sensitivities, FPR = 1 - roc_obj$specificities) %>%
  ggplot() +geom_line(aes(FPR,TPR), color = 2, size = 1, alpha = 0.7)+
  labs(title= sprintf("ROC curve (AUC = %f)", pROC::auc(roc_obj)),
       x = "False Positive Rate (1-Specificity)",
       y = "True Positive Rate (Sensitivity)")

The above code can be run by copying and pasting .

Then there is the model construction of survival analysis

alike , Using functions cv.glmHub The code can complete the analysis in one sentence :

fitted <- cv.glmHub(xdata, Surv(ydata$time, ydata$status),
                    family  = 'cox',
                    lambda = buildLambda(1),
                    network = 'correlation',
                    network.options = networkOptions(cutoff = .6,
                                                     min.degree = .2))
plot(fitted)

Be careful , The parameters at this time family = 'cox', But the same , A classifier that can extract genes , Look at the corresponding gene set , It doesn't matter ;

coefs.v <- coef(fitted, s = 'lambda.min')[,1] %>% { .[. != 0]}
coefs.v %>% {
  data.frame(gene.name   = names(.),
             coefficient = .,
             stringsAsFactors = FALSE)
} %>%
  arrange(gene.name) %>%
  knitr::kable()

names(coefs.v) %>% { hallmarks(.)$heatmap }

lasso The gene identified by regression is 3 individual , as follows :

|      |gene.name | coefficient|
|:-----|:---------|-----------:|
|CD5   |CD5       |  -0.1115553|
|RPS15 |RPS15     |  -0.1266871|
|ZNF80 |ZNF80     |  -0.0645328|

$km
Call: survfit(formula = survival::Surv(time, status) ~ group, data = prognostic.index.df)

            n events median 0.95LCL 0.95UCL
Low risk  540     64   4456    3669      NA
High risk 540     88   3738    3126      NA

> coefs.v
        CD5       ZNF80       RPS15
-0.11155531 -0.06453279 -0.12668706

The most important thing is the effect of our survival prediction model :

coefs.v
separate2GroupsCox(as.vector(coefs.v),
                   xdata[, names(coefs.v)],
                   ydata,
                   plot.title = 'Full dataset', legend.outside = FALSE)

The drawing is as follows , It's quite remarkable :

The well constructed model is related well .

The other two datasets , Let's verify it by yourself !

Real case

The previous expression matrix and phenotypic information , We all used the tutorial directly : Use curatedTCGAData download TCGA Is database information easy to use , Randomly selected genes , So we set up the random seed ,params <- list(seed = 29221) . But in reality , Our genes should have been selected once , Generally speaking, it's a difference analysis , perhaps wgcna analysis , Get a list of the differences or genes of a module . And the data set , Usually 1000 within , And then go and go lasso regression analysis , To locate fewer genes . The essence of data mining, which I started with, is to reduce the number of genes .

Apprenticeship work

Ahead TCGA Database BRCA In the dataset , You can pick it out TNBC Expression matrix of subtype , about 100 About one sample , Some patients have chemotherapy information , Divide them into remission group and ineffective group , In this way, you can first do a difference analysis to get an appointment 1000 There are about two different genes . so what , In view of this Agreement 1000 There are about five different genes 100 About a sample expression matrix to follow our tutorial , One sentence of code completes lasso Return to , Look at the ability of a good model to predict the effects of chemotherapy !

This article is from WeChat official account. - Shengxin skill tree (biotrainee) , author : Shengxin skill tree

The source and reprint of the original text are detailed in the text , If there is any infringement , Please contact the yunjia_community@tencent.com Delete .

Original publication time : 2020-12-03

Participation of this paper Tencent cloud media sharing plan , You are welcome to join us , share .

版权声明
本文为[Shengxin skill tree]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/202012071258304149.html