R/data.table：分隔列并计算出现次数

R/data.table: separate columns and count occurrences

我有一个很大的 data.table(这里只显示五行)。

1
2
3
4
5
6

taxpath N
Bacteroidetes; Flavobacteriia; Flavobacteriales; Flavobacteriaceae; Formosa; Formosa sp. Hel3_A1_48; 57
Bacteroidetes; Flavobacteriia; Flavobacteriales; Cryomorphaceae; NA; Cryomorphaceae bacterium BACL29 MAG-121220-bin8; 54
Proteobacteria; Alphaproteobacteria; Pelagibacterales; Pelagibacteraceae; Candidatus Pelagibacter; NA; 53
Proteobacteria; Alphaproteobacteria; Pelagibacterales; NA; NA; NA; 41
Planctomycetes; NA; NA; NA; NA; Planctomycetes bacterium TMED84; 41

第一列是taxpath(门、纲、目、科、属、种从左到右)，第二列是N，每条税路出现的频率。

我想做的是用分号分隔每个税路并使用第一个条目。

我想计算每个门等级(第一等级，即拟杆菌门、变形菌门或平面霉菌门)出现的频率。但是，此数字应乘以 N 列中的值。

所以，我所期望的或多或少是这样的。

1
2
3
4

phylum Nnew
Bacteriodetes 111
Proteobacteria 94
Planctomycetes 41

你能帮我如何在列中进行拆分，并且 – 我想 – group-by 与列 N 相乘吗？

(PS：稍后，我也想对列 taxpath 中的其他元素也这样做，但我认为将其分配到单独的表中更容易)

相关讨论

问题的第二部分不清楚。你能显示预期
例如，Proteobacteria 出现在两行(第 3 行和第 8 行)中。第 3 行的值为 53，第 8 行的值为 41。我期望的输出是 column phylum 具有条目 proteobacteria，而 Nnew 列的值为 94(53 41)。清楚我的意思吗？
你能检查更新的代码吗
根据示例，我得到 Bacteriodetes 为 326
谢谢，我已将输入数据缩短为 5 行而不是 10 行。
没问题，我的输出是基于你之前展示的 10 行
太好了，非常感谢。

这个标记为 data.table 所以这里是一个简单的 data.table 解决方案。

1
2
3
4
5
6

library(data.table)
DT[, .(Nnew = sum(N)), by = sub(“;.*”,””, taxpath)]
# sub Nnew
# 1: Bacteroidetes 111
# 2: Proteobacteria 94
# 3: Planctomycetes 41

我们在 by 语句中动态提取 taxpath 的第一部分时基本上对 N 求和

数据

1
2
3
4
5
6

DT <- fread(“taxpath\\t N
Bacteroidetes; Flavobacteriia; Flavobacteriales; Flavobacteriaceae; Formosa; Formosa sp. Hel3_A1_48;\\t 57
Bacteroidetes; Flavobacteriia; Flavobacteriales; Cryomorphaceae; NA; Cryomorphaceae bacterium BACL29 MAG-121220-bin8;\\t 54
Proteobacteria; Alphaproteobacteria; Pelagibacterales; Pelagibacteraceae; Candidatus Pelagibacter; NA;\\t 53
Proteobacteria; Alphaproteobacteria; Pelagibacterales; NA; NA; NA;\\t 41
Planctomycetes; NA; NA; NA; NA; Planctomycetes bacterium TMED84;\\t 41″)

我们可以用separate将\\’taxpath\\’根据分隔符;拆分成指定列，按\\’phylum\\’分组，得到\\’N\\’的sum

1
2
3
4
5
6
7
8
9
10
11
12
13

library(tidyverse)
newcols <-c(“phylum”,”class”,”order”,”family”,”genus”,”species”)
df1 %>%
mutate(taxpath = sub(“;$”,””, taxpath)) %>%
separate(taxpath, into = newcols, sep=”;\\\\s*”) %>%
group_by(phylum) %>%
summarise(Nnew = sum(N))
# A tibble: 3 x 2
# phylum Nnew
# <chr> <int>
# 1 Bacteroidetes 326
# 2 Planctomycetes 41
# 3 Proteobacteria 94

来源：https://www.codenong.com/50333127/

R/data.table: separate columns and count occurrences

猜你喜欢