- Getting started
- Sample vs population
- Mean Median Mode
- Standard Deviation and Standard Error
- Plotting Data
- Significance Analysis
- Correlation Analysis
The Course is Designed for Biologist willing to get a better understanding of Bio-statistics and methods to analyze Biological data using Excel
Continue Reading or Enroll for the Structured Free Course
For this tutorial we will use diabetes data-set from National Institute of Diabetes and Digestive and Kidney Diseases. Please download the data
The data-set contains 9 Columns as Follows
|Pregnancies: The number of Pregnancies|
|Glucose: Blood Glucose Level|
|BloodPressure: Blood Pressure|
|SkinThickness: Thickness of the Skin|
|Insulin: Insulin Level|
|BMI: Body Mass Index|
|Diabetes Pedigree Function: Function that determines the risk of type 2 diabetes based on family history|
|Age: Age of the person|
|Outcome: Whether the person was diagnosed as Diabetic or Non-Diabetic|
There is total of 768 Records representing 768 individuals some diagnosed as Diabetes, others as non-diabetic. We will use this data-set to perform statistical analysis.
Sample vs population
Whenever, we perform a study, we perform on a rigid set of criteria and the representative selection of samples. The right selection of criteria is required to ensure that robust statistical analysis and interpretation can be performed.
Therefore, for the diabetes dataset it was performed on Females aged 21 from Prima Indian Heritage. The idea behind was to predict the onset of diabetes based on biological parameters.
The result of the findings should be applicable to the entire Females age 21 and above from Prima Indian Heritage. But unfortunately, we cannot include all individual in that study. Therefore we perform the study on the sample. A sample is a carefully selected random subset of the Females aged 21 and above from Prima Indian Heritage.
Population: The entire population which qualifies under the study criteria
Sample: The sub-set of population on which we will perform our study.
Statistics helps in understanding the population based on the sample.
Mean Median Mode
Mean and Median are commonly used as representative values for samples and reflective of population.
Mean (also known as average) is the sum of parameter/number of samples. Therefore the mean of 5,6,4,8 would be (5+6+4+8)/4 = 5.75
Median is the middle value of the data-set e.g. Median of 6, 3, 8, 5, 8 will be 6. What we did was arranged them in ascending order 3, 5, 6, 8, 8 and pick the middle value. If 2 numbers fall as middle value we take the average
Mode is the most frequently observed number e.g. Mode of 3,4,3,6,8,2 will be 3 as 3 occurred twice
Now lets do the same calculation on the diabetes data-set.
When you open the excel sheet this is how you will observe the data.
Every Box in the Excel sheet is identified by its row character and a column number. Its like the address of that box. Therefore the address of 6 (corner left) would be A2. Similarly C8 will be 50. This address is used for all the calculations. An array (more than one values) if in a straight line is represented as Address1:Address2. Therefore, A2:A8 will represent all the values from A2 to A8 (which are 6, 1, 8, 1, 0, 5, 3). This can be achieved by pointing the mouse to A2, clicking it and dragging to A8. Often it is needed in calculations. e.g in an empty box we can type =average(A2:A8) will give the average of values from A2 to A8. This can also be done by typing =average(point to A2 and with mouse drag to A8) and then close the bracket.
Here for example lets calculate the average Glucose level of all 768 participants of the study.
Step 1: Locate the column name. For Glucose its the B Column
Step 2: Locate the Data Start Row and the Data End Row (depicted in the left). In this example the data starts from 2 row and ends at row 769
Therefore our data-set under study (Glucose) is from B2 to B769
To calculate Mean or Average just move to an empty square and type
and type enter.
And Hurray you got the average glucose level of all participants in the study which is 120.89. You can also try the drag method as explained in quote.
Now lets calculate the median
And the median Glucose level will be displayed as 117
Similarly lets Calculate the Mode
and the result will be 100, which means Glucose level of 100 was most observed value among the participants.
When to Use Mean Median Mode
Mean is the default parameter to be used for any analysis. Median is used when there are outliers in the data-set (as the extreme values are eliminated from the estimation of median). Mode is used in cases where Frequency of an event is of particular importance.
Standard Deviation and Standard Error
Standard Deviation is the deviation from the mean, It is used to
- Understand data distribution
- Indicator of noise in the data-set.
- Represent as error bars in bar chart (to indicate the noise)
Standard deviation is calculated as sum of difference with mean squared divided by the number of data points and square rooted. Don’t worry about the equation, its dead simple doing it in Excel. To know more on theory around Standard Deviation, read this article.
Calculating Standard Deviation in Excel
We open up the same Diabetes Data-set and type the below formula in any empty box. You can drag the Column B2 to B769 using mouse after typing =STDEV( and then close the bracket.
This will calculate the Standard Deviation of Glucose Column.
To Remind B2 is the starting value and and B769 is the last value of B column which represents Glucose.
Similarly you can calculate Standard Deviation of any Parameters in the excel sheet.
Calculating Group Mean and Standard Deviation
Have a close look at the dataset, the last column depicted in Grey is the Clinical Outcome column, which indicates whether the participant is diagnosed as Diabetic (0) or Non-Diabetic (1)
One of the questions that is often asked, is to study the difference of Parameters in the two group e.g. How much difference is the Glucose Level of participants who developed Diabetes vs those who did not. The results are often depicted as Mean and Standard Deviation.
The First Step would be to separate the 2 groups. This is done by using the sorting feature of Excel
Sorting and Segregating the two groups in Excel
Follow the below steps to sort based on the Clinical Outcomes (Last Column) shown in Grey.
Select the Column ==> Select Data from menu ==> Select Sort (marked as Red circle in Below Figure)
This screen will appear
Always select Expand the Selection. This will perform sorting on all column. If you select Second option, only the selected Column will be sorted.
Select the Outcome and all Columns will be sorted with respect to Outcome. Remember that the My data has header is checked in if you have headers as is with this diabetes data-set.
By doing this all the Participants with no diabetes comes to the top (0) and all the Participants which developed diabetes comes after that (1).
Now rows 501 and above are the Participants with no diabetes. From row 502 – 769 are participants who developed diabetes.
Again Going by the Glucose Parameter (2nd Column), Lets segregate the Dataset in 2 Groups from Excel Perspective
Participants with no-diabetes ==> B2:B501 (B is the Column with the Glucose Information, and it starts from 2nd row, as 1st row is Header)
Similarly Participants with diabetes ==> B502-B769
Calculating Mean/Average of both groups
Lets Calculate the Mean of Participants with No Diabetes
Type in an empty box
will give an Mean of Glucose for Non-Diabetic group = 109.98
will give the Mean of Glucose for the Diabetic group = 141.25
Calculating Standard Deviation for both groups
Type in an empty box
=stdev(B2:B501) which comes as 26.14
=stdev(B502:B769) which comes as 31.93
Now Lets Plot the Information for quick visualization
Creating the summary of the above information. Place the mean/average and standard deviation on an empty area in the excel sheet as shown below
Now Select the 1 st two column of the above and click on Insert ==> Insert Column or Bar Graph as shown below
You will see lot of options, you can play around and see what graph suits you. For the purpose of this tutorial select the 1st one.
And your chart is ready. You can change colours headings and other aspects by clicking on the plus sign shown as green + Chart Elements. We leave you to explore those functions.
Adding the Standard Deviation on to the Bars
Click on + icon. Then Check Mark Error Bars and then select More Options as shown below.
This Will open the below window
Select Custom and you will see the below block
In the box marked as Blue and Green, for each, Select the SD Column and select and drag the two values marked in Red. On clicking Ok, Bar Graph will be generated along with Standard Deviation as error bars.
The more common approach is to display error bars as Standard Error.
Standard Error is standard deviation divided by square root of number of samples. The idea behind is the larger the samples, less is the error. Therefore, when there are more replicates, less is the error.
In Excel, you can calculate the Standard Error using the above formula.
Example (Glucose level of non-diabetic)
=stdev(B2:B501) = 26.14 we calculate the standard deviation (explained earlier)
Number of samples is 501-2 = 499
Standard Error =26.14/sqrt(499) =1.17
Similarly Calculate the Standard Error of Glucose Level of diabetic patients and plot a graph as was demonstrated earlier for Standard Deviation.
We all love graphs. It is often said that “A image Speaks a 1000 words”. Graphs also helps in getting deeper insight and better inference. In this chapter we will use the same Diabetes Dataset to plot few graphs.
Histogram is widely used for visualizing data distribution. Histograms bins the datasets in subsets and plots it as bar graphs. Using the same Diabetes dataset lets plot the Histogram for Glucose.
Select the Glucose Column. Then Click on Insert, followed by selecting the histogram as shown below. And here appears the histogram.
Histogram can tell if the data is normally distributed or not. Most of the statistical analysis depends on whether the data is normally distributed or not. If the Histogram appears as a bell shape, the data is normally distributed. The Histogram here does not seems to be in a bell shape, but that is because of many missing values (which is set as 0). In advance course we will learn more on missing value imputation and how to force the dataset to follow normal distribution or apply a test specific to the distribution.
Box Plots are also known as Whiskers Plot is useful to depict data distribution in the noisy datasets. We do similarly as we did in Histogram, but lets select 2 column, Glucose and Blood Pressure and Select Box and Whisker. This will generate a Box plot as shown below. Any number of Columns (Parameters) can be selected, but if the data of 2 columns are in different range, its best to normalize the data (will be discussed in advance course)
To understand better, I have clicked in + icon and selected chart title and legends.
To interpret Box Plots, let just concentrate on the Blood Pressure, which is shown in orange.
x with value of 69.1 is the mean Blood Pressure
The line in the middle of orange box denoted as 72 is the median value. The lower line depicted as 62 is the Q1, upper denoted as 80 is the Q2.
The lower end point with value of 38 is Q1-1.5 IQR (Inter quartile range)
The upper end point with value of 106 is Q3 + 1.5 IQR (Inter quartile range)
The values below 38 and above 106 are marked as dots and are potential outliers.
What is Quartile
Quartile divides the data-set in 4 parts known as Q1, Median, Q3.
Q1 is the Lower Quartile Value
Q3 is the Upper Quartile Value
The range between Q3 and Q1 is termed as Inter Quartile Range.
Most of you will never need to calculate the quartile, but if you wish to calculate, the formula in Excel is
B2 to B769 is the range of data and 1 represents the Q1 quartile. Replace 1 with 2 and 3 to get the 2nd and 3rd Quartile. Once you have calculated the quartiles you can calculate the Inter Quartile Range.
This is the most common of all analysis performed by Biologist. Here we compare 2 groups to estimate if there is significant differences in the two group. We are skipping the technical part as it is bit complex for beginners.
In this chapter we will study performing t-test among two groups of samples. We are assuming that the data is Normally Distributed, which means when plotting using Histogram it follows a bell curve as discussed in previous chapter. If the Data is not normally distributed either we can force data-set to follow normal distribution or we can use a non-parametric test (discussed in advanced course)
Again we start with the same Diabetes data-set
To calculate if there is significant difference in Glucose level between Non-Diabetic and Diabetic we use the below formula
will give a p value as extremely low = 5.06513E-10
B2:B501 represents the non-diabetic group
B502:B769 represents the diabetic group
One Tail vs Two Tails
The 2 after that represents “Two Tailed Distribution” which is common in most Biological problems. In some cases when we are only interested in one side of problem. e.g. water purity where our concern is only if the microbe goes beyond a level as it is of no interest on the other side (going lower), we can use One tailed Distribution. Plotting the data as histogram can give an idea on if one tail is needed or two tail is needed.
Paired vs Equal Variance vs Unequal Variance
The Last 2 represents the nature of data. 2 will suit most of the biological problems. Below is the summary of what to use
Paired test: When the results are paired e.g. Glucose level before and after insulin injection from the same patient, then we use Paired t-test
Two samples with equal variance: Mostly datasets will fall in this group. But one can check the variance of both groups. Variance is square of Standard Deviation.
Two sample with unequal variance: In some scenarios, the variance can be widely different and 3 can be used.
p-value cut off
Lets do the same for Blood Pressure which is column C
will give a p-value of 0.07
Values below 0.05 is normally considered as significant difference. In this data-set except for Blood Pressure all the other parameters are significantly different. Depending on the nature of the experiment p-value cut-off can be reduced.