P1016 Identification of transcriptional-reprogramming signals in high throughput gene expression data using K-Means clustering algorithm optimized by random walks

Huang Tinghua , Iowa State University, Ames, IA
Chen Nanjiang , Beihang University
Tian Bangsheng , Chinese Academy of Sciences
Yao Dan , Hubei University of Science and Technology
Li Biao , Changjiang Waterway Institute of Planning, Design & Research
Yao Min , Huazhong Agricultural University
Bin Fan , Huazhong Agricultural University, Wuhan, China
Shu-Hong Zhao , Huazhong Agricultural University, Wuhan, China
DNA microarrays have been extensively used to monitor gene expression in great depth and on a broad scale in biology studies. The clustering analysis that systematically compares the similarity between genes can reveal common and specific gene expression patterns that might provide further insights into the transcriptome control. For a gene cluster, the interests arise from two components 1) the homogeneity, reflecting the degree of similarity between each gene within the gene cluster; and 2) the response extent, namely, the magnitude of change of the expression of the genes between different conditions. In this study, we improved a K-means schema algorithm by 1) introducing random walks to the K-Means centers and 2) applying force filed (weighting) to the objects in the multidimensional clustering space. The simulated annealing (SA), a technique involving heating and controlled cooling of the K-Means centers were used to help the clustering find the global optimal configuration. The cost function and temperature schedule of SA were established. After applying force filed, different objects in the clustering space had different weights, based on their response extent (or the distance from the Zero-point gene which has no changes at all), and thus contributed differently to the clustering. We tested the proposed K-Means clustering schema using synthetic data. Compared with Lloyd's algorithm, the random walks can help the K-means clustering walk out of local optimal and the weighting improved the ability of the K-Means clustering picking up gene clusters with higher responses. The algorithm is computational intensive, takes much more time than other K-means implementations, but it can always produce clustering solution better than other algorithms. The algorithm is suitable for applications that high quality clustering results are required. A dataset consist of 4,000 chips collected from the public databases were created and analyzed using the proposed clustering algorithm. Gene clusters with distinct expression patterns were identified. Those clusters defined common host-transcriptional-reprogramming signals and provided useful insights for further investigation of the gene regulatory control.