Tutorial 2-Exploratory Data Analysis in R

. Exploratory Data Analysis in R
From this section onwards, we’ll dive deep into various stages of predictive modeling. Hence, make sure you understand every aspect of this section. In case you find anything difficult to understand, ask me in the comments section below.
Data Exploration is a crucial stage of predictive model. You can’t build great and practical models unless you learn to explore the data from begin to end. This stage forms a concrete foundation for data manipulation (the very next stage). Let’s understand it in R.
In this tutorial, I’ve taken the data set from Big Data Mart Sales Prediction.  Before we start, you must get familiar with these terms:
Response Variable (a.k.a Dependent Variable): In a data set, the response variable (y) is one on which we make predictions. In this case, we’ll predict ‘Item_Outlet_Sales’. (Refer to image shown below)
Predictor Variable (a.k.a Independent Variable): In a data set, predictor variables (Xi) are those using which the prediction is made on response variable. (Image below).

Train Data: The predictive model is always built on train data set. An intuitive way to identify the train data is, that it always has the ‘response variable’ included.
DOWNLOAD TRAIN DATA
Test Data: Once the model is built, it’s accuracy is ‘tested’ on test data. This data always contains less number of observations than train data set. Also, it does not include ‘response variable’.
DOWNLOAD TEST DATA
Right now, you should download the data set. Take a good look at train and test data. Cross check the information shared above and then proceed.
Let’s now begin with importing and exploring data.
#working directory
path <- ".../Data/BigMartSales"

#set working directory
setwd(path)
As a beginner, I’ll advise you to keep the train and test files in your working directly to avoid unnecessary directory troubles. Once the directory is set, we can easily import the .csv files using commands below.
#Load Datasets
train <- read.csv("Train_UWu5bXk.csv")
test <- read.csv("Test_u94Q5KV.csv")
In fact, even prior to loading data in R, it’s a good practice to look at the data in Excel. This helps in strategizing  the complete prediction modeling process. To check if the data set has been loaded successfully, look at R environment. The data can be seen there. Let’s explore the data quickly.
#check dimesions ( number of row & columns) in data set
> dim(train)
[1] 8523 12

> dim(test)
[1] 5681 11
We have 8523 rows and 12 columns in train data set and 5681 rows and 11 columns in data set. This makes sense. Test data should always have one column less (mentioned above right?). Let’s get deeper in train data set now.
#check the variables and their types in train
> str(train)
'data.frame': 8523 obs. of 12 variables:
$ Item_Identifier : Factor w/ 1559 levels "DRA12","DRA24",..: 157 9 663 1122 1298 759 697 739 441 991 ...
$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
$ Item_Fat_Content : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
$ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
$ Item_Type : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
$ Outlet_Identifier : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
$ Outlet_Size : Factor w/ 4 levels "","High","Medium",..: 3 3 3 1 2 3 2 3 1 1 ...
$ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
$ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
$ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
Let’s do some quick data exploration.
To begin with, I’ll first check if this data has missing values. This can be done by using:
> table(is.na(train))
FALSE TRUE 
100813 1463
In train data set, we have 1463 missing values. Let’s check the variables in which these values are missing. It’s important to find and locate these missing values. Many data scientists have repeatedly advised beginners to pay close attention to missing value in data exploration stages.
> colSums(is.na(train))
Item_Identifier Item_Weight 
0                1463 
Item_Fat_Content Item_Visibility 
0                 0 
Item_Type         Item_MRP 
0                 0 
Outlet_Identifier Outlet_Establishment_Year 
0                 0 
Outlet_Size       Outlet_Location_Type 
0                 0 
Outlet_Type       Item_Outlet_Sales 
0                 0
Hence, we see that column Item_Weight has 1463 missing values. Let’s get more inferences from this data.
> summary(train)
Here are some quick inferences drawn from variables in train data set:
  1. Item_Fat_Content has mis-matched factor levels.
  2. Minimum value of item_visibility is 0. Practically, this is not possible. If an item occupies shelf space in a grocery store, it ought to have some visibility. We’ll treat all 0’s as missing values.
  3. Item_Weight has 1463 missing values (already explained above).
  4. Outlet_Size has a unmatched factor levels.
These inference will help us in treating these variable more accurately.
 Graphical Representation of Variables
I’m sure you would understand these variables better when explained visually. Using graphs, we can analyze the data in 2 ways: Univariate Analysis and Bivariate Analysis.
Univariate analysis is done with one variable. Bivariate analysis is done with two variables. Univariate analysis is a lot easy to do. Hence, I’ll skip that part here. I’d recommend you to try it at your end. Let’s now experiment doing bivariate analysis and carve out hidden insights.
For visualization, I’ll use ggplot2 package. These graphs would help us understand the distribution and frequency of variables in the data set.
> ggplot(train, aes(x= Item_Visibility, y = Item_Outlet_Sales)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("Item Outlet Sales") + ggtitle("Item Visibility vs Item Outlet Sales")

We can see that majority of sales has been obtained from products having visibility less than 0.2. This suggests that item_visibility < 2 must be an important factor in determining sales. Let’s plot few more interesting graphs and explore such hidden stories.
> ggplot(train, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_bar(stat = "identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "black"))  + ggtitle("Outlets vs Total Sales") + theme_bw()


Here, we infer that OUT027 has contributed to majority of sales followed by OUT35. OUT10 and OUT19 have probably the least footfall, thereby contributing to the least outlet sales.
> ggplot(train, aes(Item_Type, Item_Outlet_Sales)) + geom_bar( stat = "identity") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "navy")) + xlab("Item Type") + ylab("Item Outlet Sales")+ggtitle("Item Type vs Sales")

From this graph, we can infer that Fruits and Vegetables contribute to the highest amount of outlet sales followed by snack foods and household products. This information can also be represented using a box plot chart. The benefit of using a box plot is, you get to see the outlier and mean deviation of corresponding levels of a variable (shown below).
> ggplot(train, aes(Item_Type, Item_MRP)) +geom_boxplot() +ggtitle("Box Plot") + theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "red")) + xlab("Item Type") + ylab("Item MRP") + ggtitle("Item Type vs Item MRP")


The black point you see, is an outlier. The mid line you see in the box, is the mean value of each item type.
Now, we have an idea of the variables and their importance on response variable. Let’s now move back to where we started. Missing values. Now we’ll impute the missing values.
We saw variable Item_Weight has missing values. Item_Weight is an continuous variable. Hence, in this case we can impute missing values with mean / median of item_weight. These are the most commonly used methods of imputing missing value.

In next post,will lsee how to combine the data sets and analyze. This will save our time as we don’t need to write separate codes for train and test data sets. To combine the two data frames, we must make sure that they have equal columns, which is not the case.

 I deeply appreciate your kindness in reading my blog posts.and will meet in next post.

Thanks&Regards
NAGARJUNA MAHANTI

Comments

  1. That is very interesting; you are a very skilled blogger. I have shared your website in my social networks! A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article.

    R Programming Online Training|
    Data Science Online Training|
    Hadoop Online Training

    ReplyDelete
  2. Thank you so much for your kind words, valuable feedback and gracious support — all of which are very much appreciated.

    ReplyDelete
  3. The rise of theIOT data analyticsover the last decade has brought with it a huge wealth of data that we can collect and analyze in order to better manage our businesses, as well as our lives. Data analytics is now a major selling point for many companies looking to build a competitive edge, and it's no surprise that the IoT is driving much of that growth.

    ReplyDelete

Post a Comment