Analysis of Single Continuous Variable
In our earlier blog, we learned to analyze a Single Categorical Variable in R. In this blog, we will Analyze a Single Continuous Variable in R. The below table summarizes the commonly used Descriptive Statistics to Analyze a Single Continuous Variable.
Tabular Methods | Percentile Distribution |
Graphical Methods | Histogram, Density Plot, Box Plot |
Numerical Methods | Measures of Central Tendency and Measures of Dispersion |
Example
We will continue with our same data MBA Students Data used in our previous blog.
Let’s analyze the continuous variable ‘MBA Grades’ in MBA Students Data through Numerical, Graphical, and Tabular Methods.
Analysis of MBA Grades
Variable Name | avg_grades_of_mba_3_semesters |
Description | This variable captures the Average of the Grades secured by students in their First Three Semesters |
Variable Type | Continuous Variable. |
Ok!!! Great. Let us run some R code to analyze our data.
Importing MBA Students Data in R
#Set directory as per your folder file path setwd("D:/k2analytics/datafile")getwd() #Read the File mba_df = read.csv("MBA_Students_Data.csv", header = TRUE)
Numerical Methods | Summary Statistics
R Programming Code to get the Summary Statistics of MBA Grades is given below
#Summary statisitcs missing_count = round(is.na(mba_df$avg_grades_of_mba_3_semesters)) grades_mean = round(mean(mba_df$avg_grades_of_mba_3_semesters),2) grades_median = round(median(mba_df$avg_grades_of_mba_3_semesters),2) grades_min = round(min(mba_df$avg_grades_of_mba_3_semesters),2) grades_max = round(max(mba_df$avg_grades_of_mba_3_semesters),2) grades_std = round(sd(mba_df$avg_grades_of_mba_3_semesters),2) #Print the Values cat("The Number of Missing Observations is", missing_count ) cat("The Mean Grade of the Students is", grades_mean) cat("The Median Grade of the Students is", grades_median) cat("The Minimum Grade of the Students is", grades_min) cat("The Maximum Grade of the Students is", grades_max) cat("The Standard Deviation of the Grade of the Students is", grades_std)
#Output The Number of Missing Observations is 0 The Mean Grade of the Students is 7.43 The Median Grade of the Students is 7.5 The Minimum Grade of the Students is 6.3 The Maximum Grade of the Students is 9.2 The Standard Deviation of the Grade of the Students is 0.6
#Note: For Continuous Variables, Mean is the most important Measure of Central Tendency
Numerical Methods | Percentile Distribution
The Percentile is a measure that represents the percentage of observations that are below a certain value in the data distribution. e.g.
- In the percentile distribution below, the value 6.79 is at the 10th percentile, i.e., 10% of the values in the data are less than 6.79
- the value 7.5 is at the 50th percentile, i.e., 50% of the values in the data are less than 7.5
#Percentile Distribution transform(quantile(mba_df$avg_grades_of_mba_3_semesters, c(0,0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99,1)))
Percentile | Value |
0% | 6.300 |
1% | 6.500 |
5% | 6.500 |
10% | 6.790 |
25% | 6.900 |
50% | 7.500 |
75% | 7.800 |
90% | 8.200 |
95% | 8.600 |
99% | 9.001 |
100% | 9.200 |
Graphical Methods | Histogram
- Histogram is the commonly used method to visually show the distribution of the continuous variable
- Histogram is created by converting the range of continuous variables into categories by Binning/Bucketing, i.e., converting the range of values into Intervals, called Class Intervals.
- The X-axis of the Histogram represents the Class Intervals, and the Y-axis of the Histogram represents the Frequency of Class Intervals.
Default Histogram generated by R
#Default Histogram generated by R Programming hist(mba_df$avg_grades_of_mba_3_semesters, breaks = 10, col = "royalblue", main = "Histogram of MBA Students grades", ylab = "Count Students", xlab = "MBA Grades (Last 3 Sem.Avg.)")
Note: In the above R code we passed the parameter breaks = 10 to create 10 bins. However, the internal logic of the histogram in R has created only 7 bins. It divides the breakpoints into some pretty values as you can see the breakpoints are at an interval of 0.5.
Customized bin size in the histogram
- Total Number of Bins: The total number of class-intervals in the histogram. Let’s create 10 bins.
- Range: The range of average grades of MBA Students is Range = 9.2 – 6.3 = 2.9
- “Bin Width” is obtained by dividing the range by the total number of bins. Bin_width = 2.9 / 10 = 0.29
#Total Number of Bins total_bins = 10 cat("Total Number of bins is", total_bins) #Range grades_range = grades_max - grades_min cat("The Range of the MBA Students grades is", grades_range) #Bin Width bin_width = grades_range/total_bins cat("The Bin Width is", Bin_width) #Breaks bin_breaks = seq(grades_min, grades_max, bin_width) cat("The Breaks are", bin_breaks)
#Output Total Number of bins is 10 The Range of the MBA Students grades is 2.9 The Bin Width is 0.29 The Breaks are 6.3 6.59 6.88 7.17 7.46 7.75 8.04 8.33 8.62 8.91 9.2
Let’s plot a Histogram using these breakpoints
#Histogram with optimized bin size hist(mba_df$avg_grades_of_mba_3_semesters, breaks = bin_breaks, col = "royalblue", main = "Histogram of MBA Students grades", ylab = "Count Students", xlab = "MBA Grades (Last 3 Sem.Avg.)")
Graphical Methods | Density Plots
The density plot is the graphical representation of the Continuous Variables. The ‘Density curve’ is drawn by determining the probability density function of the Continuous Variable by using Kernal Density Estimate.
#Density plot for average grades of MBA Students density_grades = density(mba_df$avg_grades_of_mba_3_semesters) plot(density_grades, frame = TRUE, col = "royalblue", main = "Density Plot of MBA Sutdents grades", ylab = "Count Students", xlab = "MBA Grades (Last 3 Sem.Avg.)") polygon(density_grades, col="#1F78B4")
Graphical Methods | Boxplot
- A Boxplot is constructed from the five-number summary, viz, Minimum, Maximum, First Quartile(Q1), Median (Q2), Third Quartile(Q3)
- The rectangular box in the middle represents the Interquartile Range. IQR = Q3 – Q1.
- The Minimum and Maximum limits are shown as Lower Control Limit (LCL) and Upper Control Limit(UCL).
- LCL = Q1 – IQR * 1.5
- UCL = Q3 + IQR * 1.5
- Any value outside the range of LCL and UCL is outlier value
boxplot_grades = boxplot(mba_df$avg_grades_of_mba_3_semesters, main = "Box Plot for Avg. Grades of MBA Students", xlab = "MBA Grades", col = "royalblue", border = "black", horizontal = TRUE)
#Five summary statistics and Outliers boxplot_5_stats = boxplot_grades$stats rownames(boxplot_5_stats) = c("LCL","Q1","Median","Q3","UCL") colnames(boxplot_5_stats) = "Five Summary Statistics" outliers = boxplot_grades$out print(boxplot_5_stats) cat("The Outliers are",outliers)
#Output #Five Summary Statistics Five Summary Statistics LCL 6.3 Q1 6.9 Median 7.5 Q3 7.8 UCL 9.1 #Outliers The Outliers are 9.2
- Boxplot is the most common method to identify outliers. In the above table, the value 9.2 is an outlier since 9.2 > UCL.
Inferences / Take away
- The grades of the students lie between 6.3 and 9.2.
- The Mean and the Standard Deviation of the student’s grades are 7.43 and 0.6.
- There is not much dispersion in the student’s grades.
- The middle 50% of the students are between grade 6.9 to 7.8
- The IQR of the Student’s grade is 0.9
- The top 10% of the students have secured greater than 8.2
- One student has performed exceptionally value with grade of 9.2
Practise Exercise
- Write R Code to create Histogram with Density Plot in the same chart
- Analyze the 12th Standard percentage marks of the MBA Students. (variable name is “ten_plus_2_pct” in the dataset).
Next Blog
In the next blog, let’s learn “Analysis of two variables”:
- One Categorical and other a Continuous variable
- Both Categorical
- Both Continuous
<<< previous | next blog >>>
<<< statistics blog series home >>>
Recent Comments