In a blog post a few weeks ago, I reviewed a study that highlighted the discrepancies between counting and hormonal methods in classifying women as either high or low conception risk in evolutionary psychology research. I concluded that evidence of such discrepancies may “challenge the reliability of some prior findings of cycle phase effects.”
What I mean to suggest with that sentence is not that previous findings of cycle phase effects are false positives, but rather that some null findings in unpublished studies may actually be false negatives, and/or that cycle phase effects may be stronger than currently suggested in the literature.
But why would using a messy, proxy measurement of conception risk (in this case, the counting method) result in false negative findings, or underestimates of a true effect size? Let’s use a simple thought experiment to make this a little clearer:
Say we have a population of 13 males and 13 females, and we are interested in whether there is a significant difference in height between the two sexes. We measure each individual’s height in inches, arrange them in order, and come up with these data. The pink cells represent values for females, and blue for males.
The two distributions overlap a bit, but overall, it looks like males on average are taller than females. We do an independent samples t-test on our small sample, and voila! At p<0.01, our statistical test is significant, and we can conclude that the average height for males and females differ.
Let’s say that for some reason, rather than asking individuals what their biological sex is, we’ll use a proxy measurement to determine biological sex: hair length. We decide that individuals with long hair will be classified as females, and individuals with short hair will be classified as males.
Unfortunately for us, that is a horrible way to differentiate between the biological sexes in this day and age. Plenty of females have short hair, while plenty of males have long hair (especially these days, with the popularity of man-buns reaching an all-time high).
Our data may end up looking a little more like this—our two columns, rather than being ‘female’ and ‘male,’ are ‘long hair’ and ‘short hair,’ because of how we decided to classify sex. Pink cells still reflect values for (truly) biological females, and blue for (truly) biological males.
The mean heights for these two groups still aren’t the same, but we do the same independent samples t-test that we did earlier, and our p value (p=0.12) is no longer statistically significant. This would lead us to conclude that there is no height difference between females and males; however, since we know this is not true, that conclusion would be a false negative.
In the thought experiment above, about 31% of the total sample was misclassified by sex, and this magnitude of misclassification was enough to lead us to a false negative finding. Looking at cycle phase research specifically, classification of days as being either low conception or high conception using the counting method may be incorrect up to 36% of the time when compared to more accurate, hormonal methods. While most of the women classified as high conception risk by counting methods are classified correctly and thus display a specific phenotype in behavior or preferences, mistakenly including low conception risk women (who display a different phenotype) in that group interferes with our ability to truly understand the full extent and magnitude of cycle phase effects.
Now, it has been suggested that some previously reported findings are false positives (rather than false negatives) due to something called ‘researcher degrees of freedom.’ Because the days of the menstrual cycle considered high or low conception risk days are not agreed upon, the classification schema used by a team of researchers to distinguish between phases of the menstrual cycle is in part arbitrary (see this article for a great chart showing the variability among studies in the way phases of the cycle are defined). If statistically significant cycle phase effects are not observed when using one classification schema, it could be that researchers change the days they consider to be high and low risk, and do so until the desired effect is significant.
Though this is possible, meta-analyses and examination of p-curves suggest that this is not the case, and that further inquiry on the extent and breadth of changes in behavior and cognition over the menstrual cycle is warranted.
Note: For those reading who are as interested in counting and hormonal methods of conception risk classification as I am, check out this cool recent article in Evolution and Human Behavior.
Special thanks to Adar Eisenbruch, a current evolutionary psychology graduate student, for his guidance on topics discussed in this post.