P0975 Parallel-BayesCpC on OSG: Grid-enabled High-throughput Computing for Genomic Selection in Practice

Xiao-Lin Wu , Department of Dairy Science, University of Wisconsin - Madison, WI
Okut Hayrettin , Department of Animal Sciences, University of Wisconsin - Madison, WI
Huihui Duan , Department of Animal Sciences, University of Wisconsin - Madison, WI
Timothy Beissinger , Department of Animal Sciences, University of Wisconsin - Madison, WI
Stewart Bauck , Igenity Livestock Business Unit, Merial Ltd., Duluth, GA
Brent Woodward , Igenity Livestock Business Unit, Merial Ltd., Duluth, GA
Guilherme J. M. Rosa , Department of Animal Sciences, University of Wisconsin - Madison, Madison, WI
Kent A Weigel , Department of Dairy Science, University of Wisconsin - Madison, WI
Natalia de Leon Gatti , Department of Agronomy, University of Wisconsin - Madison, WI
Jeremy Taylor , Division of Animal Sciences, University of Missouri, Columbia, MO
Daniel Gianola , Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison, WI
Grid computing refers to combining computers from multiple administrative domains, typically clusters of varying sizes, for solving a common computational task. It uses middleware to apportion pieces of a complex job among available computers, from tens to several thousands, leading to a dramatically increased computing throughput. Parallel-BayesCpC-OSG is a high-throughput computing (HTC) package  and a member of the WGSE family of distributed HTC pipelines that we have developed for automating all steps involved in the computing and decision making for genomic selection. The package is so named because it uses BayesCπ for feature selection (FS) and BayesC (π=0) for post-FS statistical inference and cross-validation. Parallel Markov chain Monte Carlo is enabled for feature selection and cross-validation. Distributed jobs are submitted to run on a local Condor cluster and Open Science Grid (OSG). The package was built with multiple-language programming, including R, Perl, C/C++ and Fortran. In the first application, we utilized grid-based distributed parallel computing (G-DPC) for selecting optimal values of the π parameter, among a grid of possible values, in the BayesCπ model using simulated data. This was contrasted with the Markov chain Monte Carlo approach, which involved Metropolis-Hastings sampling and treated π as an unknown parameter. The latter approach is time-consuming if tens of thousands of markers are fitted. In the second application, we utilized G-DPC to select optimal panel sizes for genomic prediction in an Angus cattle population, which would take several weeks or more if executed sequentially.