Gear / Technical Help > TS Knowledge Base / Archive
AES Paper: A Meta-Analysis of High Resolution Audio Perceptual Evaluation
voltronic:
Thanks, I understood a good part of that. ;)
joshr:
I'm the author of this paper.
I’ve avoided engaging in forum discussions, but Aaronji’s comments caught my eye, and I couldn’t help but respond.
- Full data, analysis methods, source code etc are available at https://code.soundsoftware.ac.uk/projects/hi-res-meta-analysis . I encourage anyone who is interested to perform their own analysis, and I will happily answer any questions. I expect that others may be more rigorous, or may uncover other interesting information that I overlooked. I’ll also try to answer any comments posted on the paper’s forum at https://secure.aes.org/forum/pubs/journal/?ID=591
- I consulted with statisticians and meta-analysis experts at various stages throughout the preparation of the paper. I would have liked a co-author with expertise in those areas, but the people I asked were unavailable.
- The Appendix was not included in the original submission, but was requested by one of the reviewers. I believe this request was correct, since the readers of the AES journal, including those who frequently apply statistical techniques to their data, are generally not familiar with meta-analysis and the techniques applied in that field.
- I’m aware of the importance of homogeneity, and the heterogeneity issues here are more serious than those that would typically be found in medical research, and a world apart from formal clinical trials. However, meta-analysis has been successfully applied to social and behavioural science research with far more heterogeneity problems than those seen here. Anyway, this is a judgement call. So the approach I took was to use all possible studies (for which I could do inverse variance analysis), and then do sensitivity or subgroup analysis on more homogeneous subsets of the data.
- bias. This made me laugh at first since in relation to this paper I’ve been accused of bias from all sides. Before beginning the study, I did not have a strong opinion either way as to whether differences could be perceived. But I could easily be fooling myself. So I committed to publishing all results, regardless of outcome. And again, I included all possible studies, even if I thought they were problematic, then did further analysis looking at alternative choices. I also decided that any choices regarding analysis or transformation of data would be made a priori, regardless of the result of that choice. However, I wrote the paper once all the analysis had been done, and so my writing style may reflect my knowledge of the conclusions.
- I agree that the work would have been improved by using an approach specific to binomial distributions. However, for much of the analysis, the normal approximation is justified. As for independence in the binomial test, under the null hypothesis every randomised trial would be uncorrelated, regardless of whether they involved the same participant or same study (think guessing a truly random coin toss). I also agree that the aggregate binomial test is not appropriate for meta-analysis. It was included only for completeness along with the binomial values for the individual studies in Section 2, and not used as part of the meta-analysis in Section 3.
- For King 2012 (the ‘closer to live’ study), it could have been either excluded it completely, treated higher preference rating as discrimination (which is fraught with issues) or treated closer to live as successful discrimination. Since the live feed was provided as a reference stimulus, similar to many other multistimulus evaluation studies, and the intention of the 192 kHz feed was to be ‘closer to live’ even if not perceived, this seemed a logical approach. Again, this decision was made a priori, in an attempt to minimize any of my own biases influencing the outcome.
- The studies were mainly from the audio engineering discipline and had a strong tendency to expressing and considering results (effect sizes) as means rather than proportions, and expressing probabilities as percentages. This is reflected in the paper, though better editing on my part would have resulted in more consistency with the notation of p values. I could have also performed sensitivity analysis where results were considered as odds ratios. But at some point, one has to stop looking at every variation and just submit the paper.
- The structure of the paper is in-line with the structure of most engineering papers (including IEEE). As such, it looks very different from the structure of papers in medical journals and other places where a lot of meta-analysis is published.
- The standard explanation for the publication bias problem was mentioned several times. The beginning of Section 3.6 first presents it. Figure 3 shows that the apparent evidence of publication bias from the funnel plot mostly goes away when subgrouping is applied. However, it then goes on to state "publication bias may still be a factor" and in Conclusion, "still a potential for reporting bias. That is, smaller studies that did not show an ability to discriminate high resolution content may not have been published."
aaronji:
^ First of all, welcome to taperssection and thanks for coming in and discussing this with us. To be honest, I was initially very surprised to see you post in this little backwater of the web, catering to the practitioners of a pretty uncommon hobby, but I am fairly certain I know how you arrived here on further reflection. At any rate, I would like to respond to a couple of your comments on my more major criticisms (the presence or absence of the appendix, for example, is immaterial in the end).
--- Quote from: joshr on July 20, 2016, 04:50:07 PM ---- I’m aware of the importance of homogeneity, and the heterogeneity issues here are more serious than those that would typically be found in medical research, and a world apart from formal clinical trials. However, meta-analysis has been successfully applied to social and behavioural science research with far more heterogeneity problems than those seen here. Anyway, this is a judgement call. So the approach I took was to use all possible studies (for which I could do inverse variance analysis), and then do sensitivity or subgroup analysis on more homogeneous subsets of the data.
--- End quote ---
With respect to the bolded part, what does "successfully" mean? Obtained a P-value? Published a paper? Generated a useful result that led to downstream hypotheses that were also tested successfully? Settled an open debate? Whatever that defintion, though, do you think your work should fall into the category of "squishy" science (like a lot of social and behavioural science)? I always thought of engineering as "hard" science, with experiments conducted rigorously and in the most methodologically proper way possible. I am sorry, but "others did it worse!" is not a valid rebuttal of this criticism, which, in my mind, completely undermines the entire paper. You are right that, in the end, it is a series of judgement calls, but others can freely interpret the merit of the work based on their assessment of the quality of those judgements.
--- Quote from: joshr on July 20, 2016, 04:50:07 PM ---- I agree that the work would have been improved by using an approach specific to binomial distributions. However, for much of the analysis, the normal approximation is justified. As for independence in the binomial test, under the null hypothesis every randomised trial would be uncorrelated, regardless of whether they involved the same participant or same study (think guessing a truly random coin toss). I also agree that the aggregate binomial test is not appropriate for meta-analysis. It was included only for completeness along with the binomial values for the individual studies in Section 2, and not used as part of the meta-analysis in Section 3.
--- End quote ---
The normal approximation may be justified, particularly for large numbers, but I think you need to show that. It is kind of beside the point, though. Why make those additional, potentially spurious, assumptions when it is easy to implement the correct analysis, modelled on the correct distribution, in freely available software? With respect to the aggregate binomial analysis being included for "completeness", wouldn't it have been more complete to actually put the correct estimate in there? The intra-individual trials are not like coin flips, in my opinion; there is a discrete set of perceptual apparatus that is unique to each individual that causes correlation between that individual's observations. If such correlations did not exist, nobody would ever score higher (or lower) than 50% in a sufficiently large number of trials.
With respect to publication bias, I never said you didn't consider it, only that you never mention, specifically, the implication about the type of study that is not reported based on that funnel plot. In any event, that is a lesser concern for me than the above. I certainly appreciate your comments here, and I hope you understand I am not trying to be a dick in anyway (this, after all, is the nature of scientific discourse), but your rebuttal doesn't much impact my previous assessment...
While you are here, on a somewhat related topic, can you comment on the Journal's review policy? The website says there is a "review board". Who comprises that board? How large is it? Do all reviewers come from this board or are outside experts brought in?
--- Quote from: voltronic on July 20, 2016, 07:19:06 AM ---Thanks, I understood a good part of that. ;)
--- End quote ---
Sorry about that! I'll try to make it a little more obtuse next time; maybe toss in some formulas... :D
joshr:
Apologies in advance if I don't continue the discussion much. I've just got a long 'to do' list to catch up on.
“what does ‘successfully’ mean? ” – I meant something loosely along the lines of ‘Generated a useful result that led to downstream hypotheses that were also tested successfully.’
How about https://www2.ed.gov/rschstat/eval/tech/evidence-based-practices/finalreport.pdf . This was a massive, well-cited study that has led to a better understanding of potential benefits & drawbacks of online learning. And it tested hypotheses that were generated from previous meta-studies in the field. But the data had a huge amount of heterogeneity issues.
Note that I didn't follow the approach from that paper though. I kept mainly to guidelines in the Cochrane Handbook. I'm just using it as an example.
I fully agree about best effort and rigour in research, and did not mean to imply an ‘others did it worse’ justification. But nor do I think the heterogeneity issues are insurmountable here. The studies were all looking at discrimination between high resolution and standard resolution audio. Almost all looked at it directly, and a couple of others (King 2012 and Repp 2006) had data that could be transformed into that form. All had multiple participants, each performing multiple dichotomous trials. And all yielded (single outcome measure) results where, if differences could always be perceived then one expects 100% discrimination, and if differences could never be perceived then one expects 50% correct discrimination. And almost all tests were forced choice, either same/different or an ABX variant (these two approaches were also treated to subgroup analysis). I’ll also note that a random effects model was used, and it can be easily seen from the main forest plot and associated statistics that heterogeneity is not readily apparent from the results with the training subgroup.
Anyway, this is going back to the ‘apples and oranges’ analogy. Meta-analysis is comparing apples and oranges (two studies using different dependent and independent variables), but that is ok if you are trying to learn about the nature of fruit (both studies looking at the same research question).
Regarding normal approximation, binomial analysis, etc. First, I wasn’t aware of the full functionality of the ‘meta’ package in R, and so didn’t use it. But I don’t think that use of the normal approximation invalidates any results. Also, the null hypothesis in this case results in exactly what you said, ‘nobody would ever score higher (or lower) than 50% in a sufficiently large number of trials.’ To clarify, suppose randomly you called the correct answer A half the time and randomly you called it B the other half, but that there is no way anyone can distinguish between them. Then it doesn’t matter how someone answers, it still converges on 50% correct. And given that, then we can give a probability for at least 6736 ‘correct’ results out of 12645 trials.
But these are minor details. I agree that binomial distribution is preferred, that the aggregate binomial analysis is not the right approach, and that if there is any perceptual difference at all then individual’s scores are highly correlated (I make note of that in the paper when discussing Meyer 2007). The disagreement is only over the severity and importance of these things. I don’t think the analysis or conclusions are in any sense invalidated, and I still strongly encourage others to revisit the data.
joshr:
--- Quote from: aaronji on July 21, 2016, 07:44:20 AM ---can you comment on the Journal's review policy? The website says there is a "review board". Who comprises that board? How large is it? Do all reviewers come from this board or are outside experts brought in?
--- End quote ---
The editorial staff of the journal are listed at http://www.aes.org/journal/masthead.cfm . They have a much larger pool of reviewers that they pick from, and also use outside experts. I think they aim for a minimum of three reviews per paper. That said, its always a struggle (as is the case for many journals) to maintain a talented and diverse pool of reviewers, and its hard to find just the right outside experts. I'm sure that they would welcome more potential reviewers.
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version