June 20, 2025

Green Health Revolution

Natural Health, Harmonious Life

A recursive embedding and clustering technique for unraveling asymptomatic kidney disease using laboratory data and machine learning

A recursive embedding and clustering technique for unraveling asymptomatic kidney disease using laboratory data and machine learning

Table 3 shows the 40 most correlated features from the collected dataset. Age_Years feature shows a typical age distribution in the sample population. The platelet counts suggested normal platelet levels. Total protein in serum shows a shift in the distribution that might indicate protein loss. The creatinine value shows creatinine levels, a direct indicator of kidney function, and a peak in a higher range could indicate the presence of CKD. The distribution of absolute eosinophil count shows a narrow peak, suggesting few fluctuations in eosinophil levels among the subjects. Higher absolute basophil counts can be observed in chronic inflammation. A low peak in the absolute lymphocyte count might indicate immune suppression. High BMI values are associated with increased CKD risk. A shift to higher values in the BUN distribution might indicate kidney function impairment. A high total leucocyte count suggests infection or inflammation. The eGFR is a key marker for CKD, with lower values indicating reduced kidney function. A peak at a low eGFR could signify the presence of advanced CKD. Low sodium levels could indicate fluid imbalances often observed in CKD patients. Elevated potassium levels in CKD can lead to complications such as hyperkalemia, which is associated with severe health risks. Low calcium levels are common in CKD-related bone disease. High values of the BUN/creatinine ratio may indicate reduced kidney function. Low albumin is associated with malnutrition and advanced CKD. There are no distinct clusters for any feature and there are many overlaps between different data points.

Table 3 Correlations between features.

Principal component analysis (PCA)

We focused on the following features to understand the results of PCA53: collect_year, age_years, BMI, sex, creatinine in serum, eGFR, BUN, total protein in serum, albumin in serum, sodium (Na) in serum, potassium (K) in serum, calcium in serum (Total), the BUN/creatinine ratio, platelet count, absolute eosinophil count, and absolute lymphocyte count. The explained variance (95%) is the proportion of the dataset’s total variance that is explained by the principal components.

These selected components together explain 95% of the variance, suggesting that a relatively small number of components can effectively summarize the data.

We obtained a cumulative variance of 0.954, which means that the first few principal components together explain 95.4% of the total variance in the dataset. This is generally a good amount of explained variance, suggesting that the selected components capture most of the information in the data. The component variance (0.29) indicates that the variance explained by the individual component is 0.29, or 29%. This suggests that this particular component explains a significant portion of the variance on its own.

We adjusted the global structure of the features as the first step, obtaining the first embedding through t-SNE. We got the results of six clustering processes via k-means following the acquisition of six embeddings via t-SNE which each yielded 3 clusters with cosine and Manhattan distance metrics. Clustering and measuring distance metrics are essential for grouping similar data together after we excluded outliers from the dataset. By grouping similar records, we reduce the overall number of records until, after the sixth iteration of clustering, the silhouette score stabilized for 3 clusters with 0.410 to 0.412 scores. We are using K-means clustering, which is sensitive to outliers, and we applied a distance-based method to remove points that were very far from the centroid. This is because these outliers can introduce noise and lead to overfitting.

The optimization was initialized with KMeans++, 10 reruns limited to 100 steps for 1545 instances after removing outliers and focusing on the top features, including collect_year, sex_name, age_years, BMI, Platelet count, Blood urea nitrogen (BUN), Total protein in serum, Estimated glomerular filtration rate (eGFR), BUN/Creatinine Ratio, Calcium in serum (total), Potassium (K) in serum, Albumin in serum, Creatinine in serum, and Sodium (Na) in serum.

Table 4 shows that the dataset decreased to 1581 after a first cosine distance clustering process. Metric-specific requirements likely resulted in the removal of 19 records; records with zero vectors or near-zero variance are excluded, as they have no significance for angular similarity. Subsequent rounds using the Manhattan distance decreased the dataset to 1566,1561, and eventually 1551, due to outlier and noise reduction. At this stage, the silhouette score increased slightly from 0.410 to 0.411, suggesting that the eliminated records may have added noise or ambiguity to the clusters. The 1566 records created a more stable and improved dataset for subsequent rounds. This stage most likely set the basis for increased clustering quality in the following phases. The Manhattan distance, which is sensitive to magnitude variations, likely identified extreme values or undefined records as outliers during repeated clustering. We used the cosine distance metric in the final stage. This created a stable dataset of 1545 records that didn’t decrease any further in the fifth or sixth rounds of clustering. The number of clusters remained at three throughout the whole process.

Table 4 Similar grouped data for six times of clustering.

The silhouette score increased marginally, from 0.410 in the first step to 0.412 at the end of the process, showing some improvement in cluster quality. The score’s stability and the constant number of clusters give strong indication that the clustering method efficiently refined the dataset without changing the clusters’ basic structure.

Figure 3a shows the scatter plots and box plots of the data for each of the three clusters. Cluster 1 (C1) had the lowest number of records, whereas cluster 2 (C2) had the greatest number of data records, followed by cluster 3 (C3), which had a similar size but fewer data records.

It is obvious that the clusters are highly separated from the second round of clustering. Cluster C1 contains 187 records, cluster C3 contains 614 records, and cluster C2 contains 765 records. With a total of 1545 records at the sixth round, a chi-square statistic of 3132, 4 degrees of freedom, and a p value of 0, we have strong evidence to reject the null hypothesis and conclude that there are significant differences among the groups in the dataset.

Figure 3b shows the scatter plot for the validation dataset after prediction. The ANOVA value of 2857 suggests the presence of significant differences among the means of the three clusters.

The clustering revealed several patient groups with a range of important demographic and laboratory features in Fig. 3c. These groups provide clinically significant new perspectives on early asymptomatic CKD diagnoses. We found that C1 represents the general healthy population or with normal lab values doing routine tests. C2 consists of individuals who show unremarkable renal problems in a few features, although they are asymptomatic. C3 comprises individuals who likely have early kidney disease symptoms but do not exhibit signs of ESRD or AKI.

Cluster analysis

We employed sieve diagrams in Fig. 4 to understand the cluster characteristics. Figure 4 shows the distribution of data points across a 3 × 3 grid of clusters, labeled C1, C2, and C3.

The color coding indicates the density or concentrate on of data points within each cluster, with red representing higher density and blue representing lower density. The legend on the left provides the cluster number and the total number of data points (N = 1545). The chi-square statistic (χ2 = 3090.00) and the associated p value (p = 0.000) suggest that the distribution of data points across the clusters is statistically significant. The size of each tile represents the number of data points (or density) in that particular range. Larger areas indicate more data points for that category, whereas smaller areas indicate fewer data points. This image shows that the dataset can be divided into three distinct groups on the basis of patterns in the sodium and potassium data.

The clusters provided can be used to interpret CKD on the basis of the variation in kidney-related biochemical markers (e.g., creatinine, blood urea nitrogen, and eGFR) and other associated parameters, such as electrolytes, proteins, and immune counts. Below is a cluster-based interpretation for CKD.

According to these values, indicators for the early stages of CKD in asymptomatic individuals on the basis of laboratory data and demographic trends include the following:

Collect_year is from 2017 to 2022. Age 40 to 52 years represent middle-aged individuals at potential risk for CKD may show slight reductions in kidney function without obvious symptoms. This group may have mild eGFR decline or slight creatinine elevation, indicating a risk of early-stage disease.

An equal distribution may indicate no sex-based risk. BMI between 24.20 kg/m² and 31.25 kg/m² where overweight or mild obesity is linked to increased risk for developing CKD. This is often a reversible risk factor if lifestyle changes can be implemented before symptoms appear. Absolute eosinophil count between 0.180 and 0.250 × 109 cells/L is indicating normal immune function, but any slight increases may be observed in response to allergies or minor infections, which may indirectly affect kidney function. Absolute lymphocyte count between 1.92 and 2.44 × 109 cells/L is indicating a balanced immune response, which is crucial for avoiding infections that worsen kidney health. Absolute basophil count: Mid-range counts between 0.02 and 0.03 × 109 cells/L reflect balanced immune health with no evident inflammation. Total leucocytic count: Counts between 5.09 and 6.50 × 109 cells/L indicate the normal range and no active infection; any minor elevations within this range might suggest early immune responses. Creatinine in serum: 0.80–0.88 mg/dL is within normal range, but it requires additional monitoring as it is close to the upper range limit of 1.2 mg/dL. eGFR: 60–65 mL/min/1.73 m2 indicates early CKD patients with mild functional decline, generally asymptomatic but requiring monitoring to prevent further progression. Blood urea nitrogen (BUN) with 1 to 4 mg/dL indicates normal range, suggesting efficient kidney function without signs of advanced dysfunction. Total protein in Serum: 7.2–7.3 g/dL is within normal range and no evident protein loss, reflecting good nutritional status. Albumin in serum: 4.3–4.4 g/dL provides normal levels, indicative of adequate nutritional status and minor to no proteinuria. Sodium (Na) in serum: 139 to 140 mEq/L indicates normal sodium levels which provides proper fluid balance. Potassium (K) in serum: 4.4 to 4.5 mEq/L indicates normal healthy balance. Calcium in serum: 7 to 9 mg/dL provides normal calcium levels without signs of bone metabolism issues. BUN/creatinine ratio: 16.2–18.9 indicates normal balanced filtration and normal kidney function. Platelet count: 233 to 315 × 109 cells/L indicates normal adequate platelet levels without bleeding risk and good bone marrow health.

While other features are normal, the eGFR and creatinine in serum serve as the primary indicators in this cluster for early signs of impairment, as they fall within the threshold of normal kidney function. These values suggest a risk of early-stage CKD in asymptomatic patients, and they are crucial to regularly monitor them for these individuals.

Machine learning for automatic cluster classification

We fed our 1545 records to several ML algorithms along with the three clusters. Our training set contained 80% of the data, whereas the testing set contained 20% with 10-fold cross validation. Table 5 compares the performance of these algorithms, including random forest (RF), support vector machine (SVM), logistic regression (LR), neural network (NN), naive Bayes, and gradient boosting, in terms of the AUC, accuracy, F1 score, precision, recall, and Matthew’s correlation coefficient (MCC).

Table 5 Comparison of several ML models.

Figure 5a shows the ROC and (b) shows lift curves of the gradient boosting model in classifying the three clusters. The ROC curves for all three clusters illustrate that the model has perfect performance in all cases. Furthermore, the rapid increase in lift at low P rates suggests that the model is effective at identifying a large proportion of positive cases in the top segment.

The confusion matrix for the testing set shows that 127 were classified in cluster 1, 150 in cluster 2, and 32 in cluster 3; none of the clusters were incorrectly classified. This demonstrates that the dataset’s clustering method was optimal.

We found that the gradient boosting model performed the best. Therefore, we selected it to classify external validation data consisting of 400 records lacking target labels.

After applying the gradient boosting model for prediction, we obtained 151 records in cluster C1, 202 records in cluster C2, and 47 records in cluster C3. With 400 total records, a chi-square statistic of 800, 4 degrees of freedom, and a p value of 0, we have strong evidence to reject the null hypothesis and conclude that there are significant differences among the groups in the validation dataset.

Using multiple feature importance metrics in Fig. 5, the eGFR feature is the most vital feature for kidney function analysis, with the highest importance across all metrics (information gain, gain ratio, and chi-square). This gives evidence to the fact that eGFR is a significant feature for diagnosing CKD. Other features, including albumin in serum, sodium (Na) in serum, and gender, are critical for evaluating kidney health. Cluster C2 indicates these individuals who may have slight problems with their kidneys but do not show symptoms of kidney disease.

The feature importance rankings confirmed the importance of other indicators, like blood creatinine. They also showed how important other features were, such as calcium, total protein, and electrolyte levels. These results show how vital it is to use non-traditional markers (like calcium and albumin) in regular screenings to find CKD that doesn’t have any symptoms. This study shows that it is critical to use regular laboratory data along with ML to detect patients who don’t have symptoms.

BMI ranked 14th out of 18 features, with consistently low scores across metrics such as information gain (0.010), gain ratio (0.006), chi-square (2.178), and ReliefF (0.003). These results suggest that BMI has minimal predictive impact for CKD results in our dataset. However, BMI remains an important clinical indicator of an individual’s health, and its inclusion in clinical models may enhance interpretability for practitioners. The proposed ML model didn’t use BMI as an important feature, but its Spearman correlation with eGFR (r = − 0.404) suggests a possible indirect link with kidney function. This study doesn’t fully look at how BMI might affect the progression of CKD because it doesn’t look at the inflammation that is linked to obesity, high blood pressure, or diabetes.

We could not verify that the training and validation datasets are statistically identical due to the presence of missing values; this was an intentional choice to test the model’s robustness and applicability in situations where incomplete data is a challenge. We collected these datasets with consistent inclusion and exclusion criteria, reflecting the target population. However, the validation dataset includes missing values to mimic real-world scenarios where incomplete data is common. This method allows us to assess the model’s performance in reality, ensuring that it can manage problems like missing or noisy data. The inclusion of missing values adds a controlled difference, focusing on evaluating the model’s robustness and ability to generalize to real-world settings rather than the datasets’ strict statistical similarities. Collecting data from a single lab center ensures that measurement procedures are always the same, removing any differences that could be caused by things like device calibration or assay methods. This increases the reliability of the data. The recursive embedding and clustering technique effectively integrates hierarchical relationships, decreases dimensionality, and maintains significant features. This approach is especially useful for identifying small patterns, such as those present in early-stage asymptomatic CKD, which are usually challenging to detect.

Despite the model’s performance, we must acknowledge some limitations. Firstly, the dataset’s 2,000 records originated from a single source, potentially limiting its generalizability to more diverse or larger populations. Therefore, the lack of external validation is still a limitation, and future research should focus on testing the model on larger and more varied datasets to make sure it can be used in a variety of clinical settings. Finding an appropriate strategy for dealing with the high computing costs of embedding and clustering large data sets is important as well.

link