From Big Data to Small Samples – How to Avoid Drawing Your Own Conclusions
July 30, 2019 • Reading time 2 minutes
Confidence Intervals for Small Samples
We are often concerned about the high level of accuracy in the data we manage and share. We live in a world of big data, where much of the focus is on interpreting complex data through the use of artificial intelligence. Although, in many real-world situations, information is limited, and conclusions are often drawn on the basis of very few data points. For instance, in examining patterns of clinical behaviour, an initial analysis of patient notes may be required. To save time and resources, a review may be based on a small sample.
It may seem easy to draw conclusions from a small number of observations. Although, such conclusions are often anecdotal and/or biased. Then how confident can users be of results in these circumstances? Typically, we use statistical tools to measure and assess these properties. All clinicians are familiar with p-values from statistical measures, but when sample sizes are small (< 10), the results may be misleading, and sometimes, counter intuitive.
One solution is to use the Wilson Score method, a statistical tool specifically used to measure confidence and margin of error for small sample sizes. Details of how this method can be implemented in R are set out below.
Wilson Score Method
This method will be executed from a built in function binconf() in the Hmisc package. For more information on the function, type ?binconf after loading the package.
Our example data suggests an observed x = 5 successes out of n = 10 attempts. A practical example is measuring the weight of 10 children and observing that 5 of them are overweight.
We normally set alpha = 0.05 (which corresponds to z-score = 1.96) unless there is a justified reason to change it.
The point estimate serves as the approximate value of the parameter, in this case, the proportion of children that our overweight in our sample.
The lower and upper confidence intervals are our best estimates of the potential range of the data based on the circumstances.
The margin of error expresses the amount of error in the results. The larger the margin of error, the less confidence we can be in our results. To find the margin of error, you can either subtract the lower estimate from the point estimate or the point estimate from the upper estimate. Another method of finding the margin of error is to subtract the lower estimate from the upper estimate then divide by 2. Please note that for either of these methods for finding the margin of error, the function should be stored in a vector to extract the relevant information. Both methods produce identical results and are demonstrated in the output below.