If you attempt to perform k-means clustering on data that contains missing values, NaN, or Inf, the following error will be raised: Error In Do_One(Nmeth) : Na/Nan/Inf In Foreign Function Call (Arg 1).
The K-means algorithm in R is unable to handle data with NA, NaN, or Inf values. By introducing these values, the mean and variance are no longer well defined, and the algorithm is unable to determine which cluster center is closest.
You can fix this error by replacing the Inf values with NA and then removing the rows with missing values with na.omit. Alternatively, the missing values can be imputed.
This tutorial will go over the error in detail and show you how to fix it using code examples.
Example-1
Consider the following data frame, which contains a number of NaN, NA, and Inf values.
df <- data.frame(var1=c(2, NaN, 4, 6, 7, Inf, 8, 6, 10, 12), var2=c(NaN, 14, 14, 7, 7, 15, 10, 9, 9, Inf), var3=c(22, NA, 19, 23, 25, 21, 19, 16, 12, 15)) df
var1 var2 var3 1 2 NaN 22 2 NaN 14 NA 3 4 14 19 4 6 7 23 5 7 7 25 6 Inf 15 21 7 8 10 19 8 6 9 16 9 10 9 12 10 12 Inf 15
Let’s try k-means clustering on the data frame with the kmeans() function:
km <- kmeans(df, centers=3) Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
Because the data frame contains NA, NaN, and Inf values, the error occurs.
Solution To This Error
1. Remove Rows
We must purge the data frame of values that kmeans cannot handle. Using a do.call, we will first replace the Inf values with NA.
df_noinf <- do.call(data.frame,lapply(df, function(x) replace(x, is.infinite(x),NA))) df_noinf
We use lapply in the do.call to replace the Inf values in the data frame. Let’s take a look at the new data frame.
var1 var2 var3 1 2 NaN 22 2 NaN 14 NA 3 4 14 19 4 6 7 23 5 7 7 25 6 NA 15 21 7 8 10 19 8 6 9 16 9 10 9 12 10 12 NA 15
Then we’ll use the na.omit() function to remove the rows with NA and NaN values.
df_clean <- na.omit(df_noinf) df_clean
To see the clean data frame, run the following code:
var1 var2 var3 3 4 14 19 4 6 7 23 5 7 7 25 7 8 10 19 8 6 9 16 9 10 9 12
We can now run the k-means clustering algorithm to obtain cluster information now that we have a clean data frame.
km <- kmeans(df_clean, centers=3) km
Let’s run the code to see what happens:
K-means clustering with 3 clusters of sizes 3, 1, 2 Cluster means: var1 var2 var3 1 6.0 11 18 2 10.0 9 12 3 6.5 7 24 Clustering vector: 3 4 5 7 8 9 1 3 3 1 1 2 Within cluster sum of squares by cluster: [1] 28.0 0.0 2.5 (between_SS / total_SS = 81.4 %)
2. Impute Values
If we want to keep the number of rows the same, we can substitute values for the NA and NaN values.
> df_noinf$var1[is.na(df_noinf$var1)] <- mean(df_noinf$var1, na.rm=T) > df_noinf$var2[is.na(df_noinf$var2)] <- mean(df_noinf$var2, na.rm=T) > df_noinf$var3[is.na(df_noinf$var3)] <- mean(df_noinf$var3, na.rm=T) df_noinf
The subscript operator is used in the preceding code to manually impute missing values in each column using the mean for the column containing the missing value. Let’s take a look at the new data frame.
var1 var2 var3 1 2.000 10.625 22.00000 2 6.875 14.000 19.11111 3 4.000 14.000 19.00000 4 6.000 7.000 23.00000 5 7.000 7.000 25.00000 6 6.875 15.000 21.00000 7 8.000 10.000 19.00000 8 6.000 9.000 16.00000 9 10.000 9.000 12.00000 10 12.000 10.625 15.00000
We can now run the k-means clustering algorithm to obtain cluster information now that we have a clean data frame.
km <- kmeans(df_noinf, centers=3) km
Let’s run the code to see what happens:
K-means clustering with 3 clusters of sizes 3, 4, 3 Cluster means: var1 var2 var3 1 9.333333 9.541667 14.33333 2 6.437500 13.250000 19.52778 3 5.000000 8.208333 23.33333 Clustering vector: [1] 3 2 2 3 3 2 2 1 1 1 Within cluster sum of squares by cluster: [1] 29.09375 26.41377 27.42708 (between_SS / total_SS = 70.8 %)
Also read:- [Fixed] Error: ‘\u’ Used Without Hex Digits In Character String Starting “”c:\u”
Conclusion
And that brings us to the end of this article. If you’re reading this, then that means you actually read this article from start to finish. First of all, congratulations on that.
This is the third article in which we’ve covered an error related to the R language. We intend to cover more in the future. If you have a particular error you’d like us to over, please let us know in the comments below.
And, of course, if you have any questions regarding this guide, then too let us know in the comments.