admin管理员组

文章数量:1444896

4个NC杂志的空间转录组数据分析(GSE190811)

我们生信马拉松授课群里有个学员问空间转录组数据如何读取,给出的数据编号为 GSE190811,由于不是10x spaceranger 的标准格式,对于初学者来说有一些难度,下面一起来看看这个数据集!

数据背景

GSE190811 这个数据来自2022年11月10号发表在NC杂志上,标题为《Single cell profiling of primary and paired metastatic lymph node tumors in breast cancer patients》,文章仅用这个数据绘制了一幅图:上皮细胞的特征基因在空间切片上的打分分布。

这个数据的GEO链接为:.cgi?acc=GSE190811,总共是4个空间转录组的样本:

代码语言:javascript代码运行次数:0运行复制
GSM5732357 PT_3 LNM ST
GSM5732358 PT_6 LNM ST
GSM5732359 PT_7 LNM ST
GSM5732360 PT_8 LNM ST

提供的附件 GSE190811_RAW.tar 包含的文件如下:

代码语言:javascript代码运行次数:0运行复制
GSM5732357_A_1938345_11.jpg.gz 52.6 Mb
GSM5732357_A_raw_feature_bc_matrix.h5 29.6 Mb
GSM5732357_A_scalefactors_json.json.gz 161 b
GSM5732357_A_tissue_hires_image.png.gz 14.5 Mb
GSM5732357_A_tissue_positions_list.csv.gz 66.5 Kb
GSM5732358_B_1938529_9.jpg.gz 68.6 Mb
GSM5732358_B_raw_feature_bc_matrix.h5 34.7 Mb
GSM5732358_B_scalefactors_json.json.gz 164 b
GSM5732358_B_tissue_hires_image.png.gz 16.9 Mb
GSM5732358_B_tissue_positions_list.csv.gz 64.4 Kb
GSM5732359_C_2000752_23.jpg.gz 75.1 Mb
GSM5732359_C_raw_feature_bc_matrix.h5 40.5 Mb
GSM5732359_C_scalefactors_json.json.gz 164 b
GSM5732359_C_tissue_hires_image.png.gz 17.3 Mb
GSM5732359_C_tissue_positions_list.csv.gz 62.6 Kb
GSM5732360_D_2000910_33.jpg.gz 65.7 Mb
GSM5732360_D_raw_feature_bc_matrix.h5 38.5 Mb
GSM5732360_D_scalefactors_json.json.gz 160 b
GSM5732360_D_tissue_hires_image.png.gz 16.1 Mb
GSM5732360_D_tissue_positions_list.csv.gz 59.7 Kb

目录中的文件说明

  • raw_feature_bc_matrix.h5:基因表达矩阵的h5格式
  • tissue_positions.csv:记录了对应的空间切片里面的每个barcode所在的切片的空间坐标,理论上有了这个csv文件其实就可以相当于有了空间切片的图片的细胞定位作用
  • tissue_hires_image.png:分辨率很高,所以图片文件就很大
  • tissue_lowres_image.png:文件很小,因为图片分辨率低,官网说它 is always 600 pixels. 【这个数据没有提供这个文件】
  • scalefactors_json.json:它关联了tissue_hires_image.png and tissue_lowres_image.png,文件内容如下:
代码语言:javascript代码运行次数:0运行复制
{"tissue_hires_scalef": 0.06628223, "tissue_lowres_scalef": 0.019884668, "fiducial_diameter_fullres": 383.95734, "spot_diameter_fullres": 237.68787000000003}

下载下来后整理成如下格式:

代码语言:javascript代码运行次数:0运行复制
├── GSM5732357_A
│   ├── 1938345_11.jpg
│   ├── raw_feature_bc_matrix.h5
│   ├── scalefactors_json.json
│   ├── tissue_hires_image.png
│   └── tissue_positions_list.csv
├── GSM5732358_B
│   ├── 1938529_9.jpg
│   ├── raw_feature_bc_matrix.h5
│   ├── scalefactors_json.json
│   ├── tissue_hires_image.png
│   └── tissue_positions_list.csv
├── GSM5732359_C
│   ├── 2000752_23.jpg
│   ├── raw_feature_bc_matrix.h5
│   ├── scalefactors_json.json
│   ├── tissue_hires_image.png
│   └── tissue_positions_list.csv
├── GSM5732360_D
│   ├── 2000910_33.jpg
│   ├── raw_feature_bc_matrix.h5
│   ├── scalefactors_json.json
│   ├── tissue_hires_image.png
│   └── tissue_positions_list.csv

使用Seurat读取

这里用到的文件为 raw_feature_bc_matrix.h5,以及 图片tissue_hires_image.png:

代码语言:javascript代码运行次数:0运行复制
###
### Create: Jianming Zeng
### Date:  2023-12-31  
### Email: jmzeng1314@163
### Blog: /
### Forum:  .html
### CAFS/SUSTC/Eli Lilly/University of Macau
### Update Log: 2023-12-31   First version 
### Update Log: 2024-12-09   by juan zhang (492482942@qq)
### 

rm(list=ls())
options(stringsAsFactors = F)
library(ggsci)
library(dplyr) 
library(future)
library(Seurat)
library(clustree)
library(cowplot)
library(data.table)
library(ggplot2)
library(patchwork)
library(stringr)
library(qs)
library(Matrix)
getwd()

# 创建目录
getwd()
gse <- "GSE190811"
dir.create(gse)

################## 读取数据,有h5与tissue_hires_image.png
samples <- list.dirs("GSE190811/", recursive = F, full.names = F)
samples
scRNAlist <- lapply(samples, function(pro){
# pro <- samples[1]
print(pro)

# 先读取 h5
  data <- Read10X_h5(filename = paste0("GSE190811/",pro,"/raw_feature_bc_matrix.h5"))
  dim(data)
  data[1:5,1:5]
  object <- CreateSeuratObject(counts = data, assay = "Spatial", min.cells = 3, project = pro)
  object

# 再读取
  image <- Read10X_Image(image.dir = paste0("GSE190811/",pro), 
                         image.name = "tissue_hires_image.png",
                         filter.matrix = TRUE)
  image
  dim(image)
  image <- image[Cells(x = object)]
  DefaultAssay(object = image) <- "Spatial"

# 添加图片到object中
  object[[pro]] <- image
  object@images[[1]]@scale.factors$lowres <- object@images[[1]]@scale.factors$hires
return(object)
})
names(scRNAlist) <-  samples
scRNAlist
# merge
sce.all <- merge(scRNAlist[[1]], y=scRNAlist[-1], add.cell.ids=samples)
sce.all <- JoinLayers(sce.all) # seurat v5
sce.all

## 简单探索一下数据结构
as.data.frame(sce.all[["Spatial"]]$counts[1:4,1:4])
as.data.frame(LayerData(sce.all, assay = "Spatial", layer = "counts")[1:5,1:5])
head(sce.all@meta.data)
table(sce.all$orig.ident)

Layers(sce.all)
Assays(sce.all)

library(qs)
qsave(sce.all, file="GSE190811/sce.all.qs")

简单查看一下数据的分布

每个样本中 nCount_SpatialnFeature_Spatial的小提琴分布:

代码语言:javascript代码运行次数:0运行复制
# Data preprocessing
plot1 <- VlnPlot(sce.all, features = c("nCount_Spatial","nFeature_Spatial"), pt.size = 0.1) + NoLegend()
plot1
ggsave(filename = "1-QC/VlnPlot.pdf", width = 12,height = 6, plot = plot1)

空间分布特征模式:nCount_Spatial

代码语言:javascript代码运行次数:0运行复制
plot2 <- SpatialFeaturePlot(sce.all, features = "nCount_Spatial",pt.size.factor = 2) 
ggsave(filename = "1-QC/FeaturePlot_nCount.pdf", width = 12,height = 6, plot = plot2)

nFeature_Spatial 分布:

代码语言:javascript代码运行次数:0运行复制
plot3 <- SpatialFeaturePlot(sce.all, features = "nFeature_Spatial",pt.size.factor = 2) 
ggsave(filename = "1-QC/FeaturePlot_nFeature.pdf", width = 12,height = 6, plot = plot3)

然后就是后面的空转标准分析

降维聚类分群:

代码语言:javascript代码运行次数:0运行复制
## 计算线粒体/核糖体/特定基因集的比例
mito_genes <- grep("^MT-", rownames(sce.all),ignore.case = T, value = T) 
# 可能是13个线粒体基因
print(mito_genes)
sce.all <- PercentageFeatureSet(sce.all, features = mito_genes, col.name = "percent_mito")
fivenum(sce.all@meta.data$percent_mito)
# 另外一种方法
# sce.all$percent_mito <- (Matrix::colSums(sce.all[["Spatial"]]$counts[mt.genes, ])/Matrix::colSums(sce.all[["Spatial"]]$counts))*100
# fivenum(sce.all$percent_mito)


## 过滤spot
sce.all <- subset(sce.all, subset = nFeature_Spatial > 200 )

## 标准化
options(future.globals.maxSize= 5*1024*1024^2)
sce.all <- SCTransform(sce.all, assay = "Spatial", verbose = T)
sce.all <- RunPCA(sce.all, assay = "SCT", verbose = FALSE)
sce.all <- FindNeighbors(sce.all, reduction = "pca", dims = 1:30)
sce.all <- FindClusters(sce.all, verbose = FALSE,resolution = 0.8)
sce.all <- RunUMAP(sce.all, reduction = "pca", dims = 1:30)

p1 <- DimPlot(sce.all, reduction = "umap", label = TRUE,label.size = 7)
p2 <- SpatialDimPlot(sce.all, label = TRUE, label.size = 3,pt.size.factor = 2,image.alpha = 0.6)
p <- p1 + p2
p
ggsave(filename = "1-QC/DimPlot.pdf", width = 12,height = 6, plot = p)

降维聚类结果如下:

上面的代码运行过程中踩了两个坑,对应的报错以及github上面给出的解决方案如下:

踩坑1:矩阵在log后含NA值,,我这里 设置了 CreateSeuratObject(counts = data, assay = "Spatial", min.cells = 3, project = pro)的时候min.cells过滤掉在所有细胞中表达都为0的基因。

代码语言:javascript代码运行次数:0运行复制
Running SCTransform on assay: Spatial
Running SCTransform on layer: counts
vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.
Error in make_cell_attr(umi, cell_attr, latent_var, batch_var, latent_var_nonreg,  : 
  cell attribute "log_umi" contains NA, NaN, or infinite value

踩坑2:运行内存不足

代码语言:javascript代码运行次数:0运行复制
Error in getGlobalsAndPackages(expr, envir = envir, globals = globals) : 
  The total size of the 19 globals exported for future expression (‘FUN()’) is 555.16 MiB.. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). The three largest globals are ‘FUN’ (538.70 MiB of class ‘function’), ‘umi_bin’ (16.12 MiB of class ‘numeric’) and ‘data_step1’ (270.82 KiB of class ‘list’)

到这里简单的基础分析就做完了~

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。原始发表:2025-03-18,如有侵权请联系 cloudcommunity@tencent 删除数据分析imagespatial数据json

本文标签: 4个NC杂志的空间转录组数据分析(GSE190811)