Thursday, August 28, 2008

LRT in bioinformatics papers

Take a look at the the LRT and empirical P-value calculation in following examples:

Familial combined hyperlipidemia is associated with upstream transcription factor 1 (USF1) - Nature Genetics: "Thus, gene-dropping is done under the null hypothesis of linkage equilibrium and no linkage. To calculate an empirical P value, gene-dropping is carried out multiple times. Here, at least 50,000 simulations were carried out for each analysis. The likelihood ratio test statistic (LRT) from each gene-dropping iteration is compared to the LRT for the observed data. The empirical P value is the proportion of iterations in which the gene-dropping LRT equaled or exceeded the observed LRT. In general, the obtained empirical P values of gene-dropping are more conservative than asymptotic P values for small sample sizes."

http://www.smd.qmul.ac.uk/statgen/dcurtis/lc/gctests.html
The asymptotic p value is 0.00027, as reported in the output from scanassoc. The empirical p value is calculated as (r+1)/(N+1), where N is the number of permutations performed and r is the number of permuted datasets which by chance produce a higher for the LRT statistic than does the real dataset. In order to test whether the p value was really as low as 0.00027 one would want to do 9999 or more permutations. In fact, rungc incorporates a feature called "sequential Monte Carlo testing". This means that a target can be set for the number of permuted datasets to reach the value produced by the real dataset and if this target is reached then the simulation procedure can be terminated early.

No comments: