D14-1188 fourth column is the p-value for statistical significance testing against the baseline . The first
D12-1091 demonstrated some limitations of statistical significance testing for NLP . In particular , while
M92-1001 . Second , a method o f doing statistical significance testing was incorporated into the test
N06-1058 based on WordNet . The results of statistical significance testing are summarized in Table 5 . All
D14-1102 F1 to emphasize precision . For statistical significance testing , we use the sign test with bootstrap
D12-1052 system edits are computed . For statistical significance testing , we use sign-test with bootstrap
D12-1091 considered a good practice to include statistical significance testing results with empirical evaluations
D10-1003 tion . We conduct x2 tests for statistical significance testing . We analyze the Penn Treebank
J08-1003 too small to support reliable statistical significance testing of the performance ranking of
M98-1024 metrics , scoring algorithms , and statistical significance testing . The first column in the report
E09-1048 caused by chance , we applied statistical significance testing . As we did not want to make
M92-1003 systems influence the outcome of the statistical significance testing more than the actual test statistics
M92-1043 recall . We intend to conduct statistical significance testing at least for the version of the
J08-1003 Below we give those results and statistical significance testing for the PARC 700 and CBS 500
J93-3001 the MUC-3 data . 4.1 Review of Statistical Significance Testing A statistical significance test
J12-2005 samples that form the basis of the statistical significance testing is less straightforward for the
M95-1002 defined for this task , and no statistical significance testing was performed on the scores .
J93-3001 we review the main concepts in statistical significance testing and describe our approach to
D15-1278 bootstrap test is adopted for statistical significance testing ( Efron and Tibshirani , 1994
M98-1001 are given and included in the statistical significance testing because the systems can achieve
hide detail