introduction to python panda


Pandas are a general-purpose high-performance open-source python library which is used for the analysis of different data type. Pandas are using very powerful underlying data structure to process heavy data.

Pandas provide a variety of functionality.

1. Loading data from different data sources
2. Indexing and slicing of data
3. Transformation of data
4. Insert new data or delete existing data.

Before going next it is recommended that you should have knowledge of numpy. You can read it from our tutorial of numpy.

Data Structures in pandas

Pandas provide mainly 3 types of data Structure.  Pandas data structure is built on numpy so we can understand how fast they are.
1. Series
2. Dataframe
3. Panel

Series

It is a one-dimensional homogeneous array. Series data looks like one-dimensional array like below:
Numeric data: 1,5,8,6,7,9,11

String data: data1,data2,data3,data4,data5,data6,data7,data8

Dataframe

They are a two-dimensional tabular data structure which is capable of storing a heterogeneous data type with different columns.

Example for Dataframe is

ID Name Age Gender
1 Katie 35 Female
101 James 28 Male
306 Steve 21 Male
406 Lia 44 Female

Panels

They are Three dimensional labeled array. As panels are 3 dimensional, it is difficult to show an example here. We can say that panels are the containers for the data frame.

 

How to install panda/numpy?
1) pip install panda
2) pip install numpy

[root]# python
Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> import panda
>>> 

 

Example1: Creating a simple series using numpy array.

#!/usr/bin/python
#import the pandas and numpy library
import pandas as pd
import numpy as np
# Creating a numpy array with some numeric data
numericdata = np.array([100,200,300,400])

#Assigning value to series
series = pd.Series(numericdata)

print(series)


===========Output===========
0    100
1    200
2    300
3    400
dtype: int64

 

In the above example, we can see the index is auto assign by pandas, But pandas provide flexibility to assign our own indexes.

 

#!/usr/bin/python
#import the pandas and numpy library
import pandas as pd
import numpy as np
# Creating a numpy array with some numeric data
numericdata = np.array([100,200,300,400])

#Assigning value and index to series
series = pd.Series(numericdata, index=[7,8,9,22])

print(series)

========output==========
7     100
8     200
9     300
22    400
dtype: int64

in the above example indexes are assigned manually.

Attributes of series.

Series provides the following attributes.
1. Axes
2. Size
3. Value
4. Head
5. Tail
6. empty

Let’s see an example of the above attributes:

#!/usr/bin/python
#import the pandas and numpy library
import pandas as pd
import numpy as np
# Creating a numpy array with some numeric data
numericdata = np.array([100,200,304,400])

#Assigning value to series
series = pd.Series(numericdata)

print("axis of series {0}".format(series.axes))

print("Is series empty {0}".format(series.empty))

print("Size of series {0}".format(series.size))

print("values in series {0}".format(series.values))

# head provides first n elements of series
print("Head of series {0}".format(series.head(2)))

# Tail provides last n elements of series
print("Tail of series {0}".format(series.tail(2)))


=======output=======
axis of series [RangeIndex(start=0, stop=4, step=1)]
Is series empty False
Size of series 4
values in series [100 200 304 400]
Head of series 0    100
1    200
dtype: int64
Tail of series 2    304
3    400
dtype: int64

Attributes of Data frames

Data frames provide the following attributes
1. Transpose
2. Empty
3. Shape
4. Size
5. Values
6. Head
7. Tail

Let’s take an example of all the above:

#!/usr/bin/python
#import the pandas and numpy library
import pandas as pd
import numpy as np
# Creating a numpy array with some numeric data
data = {"ID":[4,3,2,1],
"Name":["Name1","name2","name3","Name4"],
"Age":[10,20,30,40]
}

# Create a dataframe
df =pd.DataFrame(data)
print("Actual data")
print(df)

#Transpose of data
print("Transpose of data")
print(df.T)

# Axes are
print("Axes of data")
print(df.axes)

# empty tell us is data frame has data or not
print("Is empty ?")
print(df.empty)

# shape provides rows and columns contained in the data frame
print("Shape data")
print(df.shape)

# Head and tail get data from the top or bottom of the data frame
print("Top 2 records from data frame")
print(df.head(2))

print("last 2 records from dataframe")
print(df.tail(2))


=======output======
Actual data
   Age  ID   Name
0   10   4  Name1
1   20   3  name2
2   30   2  name3
3   40   1  Name4
Transpose of data
          0      1      2      3
Age      10     20     30     40
ID        4      3      2      1
Name  Name1  name2  name3  Name4
Axes of data
[RangeIndex(start=0, stop=4, step=1), Index([u'Age', u'ID', u'Name'], dtype='object')]
Is empty ?
False
Shape data
(4, 3)
Top 2 records from data frame
   Age  ID   Name
0   10   4  Name1
1   20   3  name2
last 2 records from dataframe
   Age  ID   Name
2   30   2  name3
3   40   1  Name4

Pandas provide some math function some are listed below:

1. sum()
2. count()
3. mean()
4. min()
5. max()
6. Prod

Working with text data

Pandas provide very good functionality to deal with text data. When we process any text data we need to split, sort, count and many such kinds of function which make our task easy.

Pandas provide these kinds of functions some of them are listed below:
1. lower()
2. upper()
3. len()
4. split()
5. replace()
6. count()
7. find()
8. Islower
9. Isupper
10. isnumeric()

Example of above functions:

!/usr/bin/python
#import the pandas and numpy library
import pandas as pd

s = pd.Series(['Raj kumar', 'Amit singh', 'John', 'katie', 20, '897456','Steve','smith'])

print("All text is in lower case")
print(s.str.lower())

print("All text is in upper case")
print(s.str.upper())

print("Length of each text in string")
print(s.str.len())

print("split string by space")
print(s.str.split(' '))

print("check string contains a specific data or not it return true and false")
print(s.str.contains(' '))

print("count a specific word and character in string ")
print(s.str.count('R'))

=======output==========
All text is in lower case
0     raj kumar
1    amit singh
2          john
3         katie
4           NaN
5        897456
6         steve
7         smith
dtype: object
All text is in upper case
0     RAJ KUMAR
1    AMIT SINGH
2          JOHN
3         KATIE
4           NaN
5        897456
6         STEVE
7         SMITH
dtype: object
Length of each text in string
0     9.0
1    10.0
2     4.0
3     5.0
4     NaN
5     6.0
6     5.0
7     5.0
dtype: float64
split string by space
0     [Raj, kumar]
1    [Amit, singh]
2           [John]
3          [katie]
4              NaN
5         [897456]
6          [Steve]
7          [smith]
dtype: object
check string contains a specific data or not it return true and false
0     True
1     True
2    False
3    False
4      NaN
5    False
6    False
7    False
dtype: object
count a specific word and character in string 
0    1.0
1    0.0
2    0.0
3    0.0
4    NaN
5    0.0
6    0.0
7    0.0
dtype: float64

SQL operations on pandas

We can perform many SQL operations on pandas some of the examples are listed below.

An example is mainly focused on Select, where, limit statement How these SQL Query can be used with a data frame with some tricks.

An example is below:

CSV File: (http://insight.dev.schoolwires.com/HelpAssets/C2Assets/C2Files/C2ImportGroupsSample.csv)

GroupName,Groupcode ,GroupOwner,GroupCategoryID 
System Administrators,sysadmin,13456,100
Independence High Teachers,HS Teachers,,101
John Glenn Middle Teachers,MS Teachers,13458,102
Liberty Elementary Teachers,Elem Teachers,13559,103
1st Grade Teachers,1stgrade,,104
2nd Grade Teachers,2nsgrade,13561,105
3rd Grade Teachers,3rdgrade,13562,106
Guidance Department,guidance,,107
Independence Math Teachers,HS Math,13660,108
Independence English Teachers,HS English,13661,109
John Glenn 8th Grade Teachers,8thgrade,,110
John Glenn 7th Grade Teachers,7thgrade,13452,111
Elementary Parents,Elem Parents,,112
Middle School Parents,MS Parents,18001,113
High School Parents,HS Parents,18002,114

 

#!/usr/bin/python
#import the pandas and numpy library
import pandas as pd

# Read CSV
url = 'http://insight.dev.schoolwires.com/HelpAssets/C2Assets/C2Files/C2ImportGroupsSample.csv'

csvdata=pd.read_csv(url)
print(csvdata.head())

# select some specfic columns in pandas like
print("Select some specific columns")
print(csvdata[["GroupName","GroupOwner"]].head())

# Filter data like as Sql WHERE
print("Filer record")
print(csvdata[csvdata["GroupName"]=="System Administrators"])

# select top n rows like as SQL
print("Select only 2 rows like as SQL")
print(csvdata.head(2))

========output==========

                    GroupName     Groupcode   GroupOwner  GroupCategoryID 
0        System Administrators       sysadmin     13456.0               100
1   Independence High Teachers    HS Teachers         NaN               101
2   John Glenn Middle Teachers    MS Teachers     13458.0               102
3  Liberty Elementary Teachers  Elem Teachers     13559.0               103
4           1st Grade Teachers       1stgrade         NaN               104
Select some specific columns
                     GroupName  GroupOwner
0        System Administrators     13456.0
1   Independence High Teachers         NaN
2   John Glenn Middle Teachers     13458.0
3  Liberty Elementary Teachers     13559.0
4           1st Grade Teachers         NaN
Filer record
               GroupName Groupcode   GroupOwner  GroupCategoryID 
0  System Administrators   sysadmin     13456.0               100
Select only 2 rows like as SQL
                    GroupName   Groupcode   GroupOwner  GroupCategoryID 
0       System Administrators     sysadmin     13456.0               100
1  Independence High Teachers  HS Teachers         NaN               101

 

There is a variety of other functionality provided by pandas.

Like merging, grouping, missing data, Date functionality, time delta and many more.

The above article is mainly focused on the basics of pandas and how to start working with pandas.

For detail information, you can read the official document of pandas (https://pandas.pydata.org/pandas-docs/stable/)