The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.
Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by manoj morais, 2017-01-18 17:39:24

BASIC DATA PRE WITH PYTHON

BASIC DATA PRE WITH PYTHON

FREE DATA PREPARATION WITH PYTHON; THE BASICS
GUIDE

Aspire Analytic Solutions
19 West, 34th Street
New York, NY, 10001
Web: www.aspireanalyticsolutions.com

Authors | Manoj Morais MA, MBA & Sreekumar Radhakrishna Pillai PhD

PYTHON BASICS
AN INTRODUCTION

Data mining and analytics have taken the centre stage in
the contemporary world. Varieties of Research studies
are happening based on data analytics.

Data analysis has been made used from testing
hypothesis to building complex models that could predict

the future outcomes.

It has grown from analysing a numerical data to text
data, with respect to media analytics and much more.
Processing and analysing huge amount of data is a simple
process due to the available software tools such as SAS
and SPSS.

At the same time we have open source tools such as R
and Python at free of cost. This is certainly a blessing to
the society and for those who have no access to
expensive packages.

Python is open source software. It is as effective as that
of paid software. It is also called the next generation
tool.

Python comes with high level statistical capabilities,
Visualization techniques (Matplotlib and seaborn) with
varieties of modules and libraries that make our task
easier.

Python is also one of the most sought languages in the
job market especially in the field of Data analytics,
Predictive modelling and Text analysis.

The following makes Python more unique and user
friendly.

 Easy to learn and adapt
 Data cleaning, preparation and advanced analytics

could be done in no time – Time saving
 Easy to make meta data and new data set out of an

existing one
 3D visualization techniques, along with traditional

techniques
 Libraries and modules with advanced statistical

capabilities
 Easy to work with external files such as MS excel

JSON, csv and much more.

 Easy to export results and charts from python to MS
power point to make amazing presentations

 Python modules, libraries and codes are being
updated on a regular basis, and new modules are
being developed.

ENTHOUGHT CANOPY

We shall work with Enthought Canopy python Data
analysis environment. This is really good for the
beginners.

You could download the same by following the link.
Once you are on the download page, you could

download it for windows/Linux/Mac with 64 bit or
32 bit version.

https://store.enthought.com/downloads/#default

WORKING WITH ENTHOUGHT CANOPY
Once you have downloaded the enthought canopy
on your system, click on the canopy icon and then
click on the code editor or some time called Editor.
You will get a screen as shown here.

Here in[1]: implies the input. You could type the
code and then press enter to get an output.
In this document we make use of Pandas and Matoplot
libraries.

Pandas is an open source library that provides rich
and easy to use data structures and related tools
for data analysis in python.

On the other hand, we have matplotlib or the
matplot library which helps make 2D visualizations.

Please note that it is important to import necessary
modules and libraries for respective operations. For
e.g. we import matplotlib to make data
visualizations.

Working with Excel data – practise1.xlsx

You could download the data from our blog by visiting
this link
https://drive.google.com/file/d/0By41tTbcd_5tWXBR
WGpyajJKRVk/view

Opening excel data in python

Now let us view a part of the data by using
the following code

After typing the code, press enter and you will get
the following output

Note: In python, the row location and column
locations starts with 0, whereas in Excel it starts
with 1.

If you want to view the entire dataset, just type
the following code and then press enter

Now let us extract the variable names and its
types by the following code.

In the data, Mkt_exp implies marketing expenses.
In other words it implies the money put towards
all marketing activities. This is more of a technical
term, and it differs from general and other
expenses. The intuition is that as Marketing
expenses goes up rather, as you spend more on

marketing, you get more sales/increment in sales
and thereby revenue.
Also we have NaN values which imply the
missing values in the dataset.
How to view a particular variable/ column wise
For e.g. we want to view variable sales

OR

How to view a particular row
Suppose we want to view the 5th row, then
do the following

SIMPLE DATA PREPARATION PROCESS

Let us do a simple data cleaning to replace the
NaN values.
First let us consider the field ‘Year’ and make the
row 1 to 2001

The following code will accomplish the same

Similarly we also have a missing value in the
column Year on row 6

Note: data.head() will give you records from the
upper end and data.tail () will display records
from the bottom end.

Here we are making use of ‘Fillna’
Now let us view the data
Now let us replace 203 with 2003

Similarly, let us replace Tronto with Toronto in the
variable ‘City’
Now let us fill the missing values in the field ‘City’
with Toronto.

Now let us check the data again

The data is now error free and ready for further analysis

MEASURES OF CENTRAL TENDENCY

MEAN

Let us find out the Mean sales.

MEDIAN

Similarly let us find the median sales

MODE: Numeric data

Finding the mode for the variable Sales

The results imply that the mode is 60000 and
it repeated 3 times.
MODE: Text data

Now let us find the mode for the variable City which is a text data.
You have to import mode from scipy.stats if you haven’t done it
already and then proceed with the following.

Here, the results imply that the mode is Toronto
and it occurred/repeated 3 times.

STANDARD DEVIATION

Find the standard deviation for the variable
‘Revenue’

SOME DATA VISUALIZATION TECHNIQUES

SCATTER PLOT
Let us make a simple scatter plot on Sales and
Revenue.
Now let us plot the graph

You could view the chart at this point.

Let us name the axes and chart

Viewing the chart

LINE PLOT
Now let us make a line graph on the variable Sales and
Year

Now let us name the axes and give a chart title

Let us view the output.

FRAME AND SUB AREAS

Now let us have 4 different charts of the same
data within a single frame.

We shall have Steps-post chart, Steps-post with
line chart, Scatter chart and a simple line chart.

First, we shall begin with creating a frame.
At this moment you can view the empty figure

You could maximise the window for a better view on
your computer.
Now let us start making sub areas. We shall have areas
namely area1, area2, area3 and area4.

SUB AREA 1

You could always maximise for a better view.
Now let us plot a steps-post chart for Year and Sales
Now let us name the axes and title.

Now we shall work on the sub area2.

SUB AREA 2

Now let us have a combined chart with Steps-post with
line for Year and Sales.
Making area2

Now let us construct a step posts chart.

At this point, you could view chart.
Now it is time to combine line chart with steps post.

Now you have a line with steps post chart.

Similarly you have steps –pre and steps-mid
charts, which, at the moment, we are not showing
it here.
Now let us add char title and axes titles.

SUB AREA 3
We shall make a simple line plot for Mkt_exp and
Revenue. Before that we shall construct a sub
plot/area within the frame.

Now we will make a line graph on Marketing
expenses (Mkt_exp) and Revenue.
Giving chart title and axes titles

Viewing the chart

SUB AREA 4

Now we shall make a scatter plot where x is Mkt_exp
and y is revenue.

Titles and labels

In the next page we shall view the entire figure.

Showing all the 4 cha

arts within a single Frame

Conclusion

As you have seen, python is a very promising tool for the
data analysts and researches in the coming days. One of the
key attractions of this tool is that you can easily learn it
even if you have no programming knowledge. Once you
learn this you can easily upgrade it and tailor the codes
according to your needs and convenience. Every day news
codes are released by open source programmers to address
emerging data processing demands. Those who are working
in the field of data mining take advantage of this tool to
find solutions to complex situations. It is especially true in
the case of text mining.

If you any questions or want to learn further on this, please
feel free to contact us. You can reach us at

www.aspireanalyticsolutions.com.

We are happy to hear from you.


Click to View FlipBook Version