关于 r：如何标准化包含数字和因子变量的数据框

How to standardize a data frame which contains both numeric and factor variables

我的数据框 my.data 包含数值变量和因子变量。我只想标准化此数据框中的数字变量。

1 2	> mydata2=data.frame(scale(my.data, center=T, scale=T)) Error in colMeans(x, na.rm = TRUE) : ‘x’ must be numeric

这样可以标准化吗？我想标准化第 8、9、10、11 和 12 列，但我认为我的代码有误。

1	mydata=data.frame(scale(flowdis3[,c(8,9,10,11,12)], center=T, scale=T,))

提前致谢

相关讨论

你能再解释一下吗？
library(dplyr) 具有函数 mutate_if，我们可以在其中根据条件执行操作。当且仅当变量为数字时，我们才对变量进行缩放。

以下是一些可供考虑的选项，尽管回答较晚：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

# Working environment and Memory management
rm(list = ls(all.names = TRUE))
gc()
memory.limit(size = 64935)

# Set working directory
setwd(“path”)

# Example data frame
df <- data.frame(“Age” = c(21, 19, 25, 34, 45, 63, 39, 28, 50, 39),
“Name” = c(“Christine”,”Kim”,”Kevin”,”Aishwarya”,”Rafel”,”Bettina”,”Joshua”,”Afreen”,”Wang”,”Kerubo”),
“Salary in $” = c(2137.52, 1515.79, 2212.81, 2500.28, 2660, 4567.45, 2733, 3314, 5757.11, 4435.99),
“Gender” = c(“Female”,”Female”,”Male”,”Female”,”Male”,”Female”,”Male”,”Female”,”Male”,”Male”),
“Height in cm” = c(172, 166, 191, 169, 179, 177, 181, 155, 154, 183),
“Weight in kg” = c(60, 70, 88, 48, 71, 51, 65, 44, 53, 91))

让我们检查一下df的结构：

1
2
3
4
5
6
7
8

str(df)
‘data.frame’: 10 obs. of 6 variables:
$ Age : num 21 19 25 34 45 63 39 28 50 39
$ Name : Factor w/ 10 levels”Afreen”,”Aishwarya”,..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num 2138 1516 2213 2500 2660 …
$ Gender : Factor w/ 2 levels”Female”,”Male”: 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num 172 166 191 169 179 177 181 155 154 183
$ Weight.in.kg: num 60 70 88 48 71 51 65 44 53 91

我们看到年龄、薪水、身高和体重是数字，而姓名和性别是分类变量(因子变量)。

让我们仅使用基数 R 来缩放数值变量：

1) 选项：(对 akrun 在此处提出的建议稍作修改)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

start_time1 <- Sys.time()
df1 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
(x-mean(x))/sd(x)
} else x))
end_time1 <- Sys.time()
end_time1 – start_time1

Time difference of 0.02717805 secs
str(df1)
‘data.frame’: 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 …
$ Name : Factor w/ 10 levels”Afreen”,”Aishwarya”,..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 …
$ Gender : Factor w/ 2 levels”Female”,”Male”: 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 …
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 …

2) 选项：(akrun\\’s approach)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

start_time2 <- Sys.time()
df2 <- as.data.frame(lapply(df, function(x) if(is.numeric(x)){
scale(x, center=TRUE, scale=TRUE)
} else x))
end_time2 <- Sys.time()
end_time2 – start_time2

Time difference of 0.02599907 secs
str(df2)
‘data.frame’: 10 obs. of 6 variables:
$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 …
$ Name : Factor w/ 10 levels”Afreen”,”Aishwarya”,..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 …
$ Gender : Factor w/ 2 levels”Female”,”Male”: 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 …
$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 …

3) 选项：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

start_time3 <- Sys.time()
indices <- sapply(df, is.numeric)
df3 <- df
df3[indices] <- lapply(df3[indices], scale)
end_time3 <- Sys.time()
end_time2 – start_time3

Time difference of -59.6766 secs
str(df3)
‘data.frame’: 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 …
..- attr(*,”scaled:center”)= num 36.3
..- attr(*,”scaled:scale”)= num 13.8
$ Name : Factor w/ 10 levels”Afreen”,”Aishwarya”,..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 …
..- attr(*,”scaled:center”)= num 3183
..- attr(*,”scaled:scale”)= num 1329
$ Gender : Factor w/ 2 levels”Female”,”Male”: 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 …
..- attr(*,”scaled:center”)= num 173
..- attr(*,”scaled:scale”)= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 …
..- attr(*,”scaled:center”)= num 64.1
..- attr(*,”scaled:scale”)= num 16.2

4) 选项(使用 tidyverse 并调用 dplyr)：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

library(tidyverse)
start_time4 <- Sys.time()
df4 <-df %>% dplyr::mutate_if(is.numeric, scale)
end_time4 <- Sys.time()
end_time4 – start_time4

Time difference of 0.012043 secs
str(df4)
‘data.frame’: 10 obs. of 6 variables:
$ Age : num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 …
..- attr(*,”scaled:center”)= num 36.3
..- attr(*,”scaled:scale”)= num 13.8
$ Name : Factor w/ 10 levels”Afreen”,”Aishwarya”,..: 4 8 7 2 9 3 5 1 10 6
$ Salary.in.. : num [1:10, 1] -0.787 -1.255 -0.731 -0.514 -0.394 …
..- attr(*,”scaled:center”)= num 3183
..- attr(*,”scaled:scale”)= num 1329
$ Gender : Factor w/ 2 levels”Female”,”Male”: 1 1 2 1 2 1 2 1 2 2
$ Height.in.cm: num [1:10, 1] -0.0585 -0.5596 1.5285 -0.309 0.5262 …
..- attr(*,”scaled:center”)= num 173
..- attr(*,”scaled:scale”)= num 12
$ Weight.in.kg: num [1:10, 1] -0.254 0.365 1.478 -0.996 0.427 …
..- attr(*,”scaled:center”)= num 64.1
..- attr(*,”scaled:scale”)= num 16.2

根据你需要什么样的结构作为输出和速度，你可以判断。如果你的数据是不平衡的，你想平衡它，假设你想在对数值变量进行缩放后进行分类，那么数值变量的矩阵数值结构，即 – Age、Salary、Height 和 Weight 就会出现问题。我的意思是，

1
2
3
4

str(df4$Age)
num [1:10, 1] -1.105 -1.249 -0.816 -0.166 0.628 …
– attr(*,”scaled:center”)= num 36.3
– attr(*,”scaled:scale”)= num 13.8

例如，由于 ROSE 包(平衡数据)不接受除 int、factor 和 num 之外的数据结构，因此会抛出错误。

为了避免这个问题，缩放后的数值变量可以通过以下方式保存为向量而不是列矩阵：

1
2
3
4
5
6
7
8
9

library(tidyverse)

start_time4 <- Sys.time()

df4 <-df %>% dplyr::mutate_if(is.numeric, ~scale (.) %>% as.vector)

end_time4 <- Sys.time()

end_time4 – start_time4

与

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

Time difference of 0.01400399 secs

str(df4)

‘data.frame’: 10 obs. of 6 variables:

$ Age : num -1.105 -1.249 -0.816 -0.166 0.628 …

$ Name : Factor w/ 10 levels”Afreen”,”Aishwarya”,..: 4 8 7 2 9 3 5 1 10 6

$ Salary.in.. : num -0.787 -1.255 -0.731 -0.514 -0.394 …

$ Gender : Factor w/ 2 levels”Female”,”Male”: 1 1 2 1 2 1 2 1 2 2

$ Height.in.cm: num -0.0585 -0.5596 1.5285 -0.309 0.5262 …

$ Weight.in.kg: num -0.254 0.365 1.478 -0.996 0.427 …

来源：https://www.codenong.com/36697424/

How to standardize a data frame which contains both numeric and factor variables

猜你喜欢