introduction to python panda
Pandas are a general-purpose high-performance open-source python library which is used for the analysis of different data type. Pandas are using very powerful underlying data structure to process heavy data.
Pandas provide a variety of functionality.
1. Loading data from different data sources
2. Indexing and slicing of data
3. Transformation of data
4. Insert new data or delete existing data.
Before going next it is recommended that you should have knowledge of numpy. You can read it from our tutorial of numpy.
Data Structures in pandas
Pandas provide mainly 3 types of data Structure. Pandas data structure is built on numpy so we can understand how fast they are.
1. Series
2. Dataframe
3. Panel
Series
It is a one-dimensional homogeneous array. Series data looks like one-dimensional array like below:
Numeric data: 1,5,8,6,7,9,11
String data: data1,data2,data3,data4,data5,data6,data7,data8
Dataframe
They are a two-dimensional tabular data structure which is capable of storing a heterogeneous data type with different columns.
Example for Dataframe is
ID | Name | Age | Gender |
1 | Katie | 35 | Female |
101 | James | 28 | Male |
306 | Steve | 21 | Male |
406 | Lia | 44 | Female |
Panels
They are Three dimensional labeled array. As panels are 3 dimensional, it is difficult to show an example here. We can say that panels are the containers for the data frame.
How to install panda/numpy?
1) pip install panda
2) pip install numpy
[root]# python Python 2.7.5 (default, Oct 30 2018, 23:45:53) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy >>> import panda >>>
Example1: Creating a simple series using numpy array.
#!/usr/bin/python #import the pandas and numpy library import pandas as pd import numpy as np # Creating a numpy array with some numeric data numericdata = np.array([100,200,300,400]) #Assigning value to series series = pd.Series(numericdata) print(series) ===========Output=========== 0 100 1 200 2 300 3 400 dtype: int64
In the above example, we can see the index is auto assign by pandas, But pandas provide flexibility to assign our own indexes.
#!/usr/bin/python #import the pandas and numpy library import pandas as pd import numpy as np # Creating a numpy array with some numeric data numericdata = np.array([100,200,300,400]) #Assigning value and index to series series = pd.Series(numericdata, index=[7,8,9,22]) print(series) ========output========== 7 100 8 200 9 300 22 400 dtype: int64
in the above example indexes are assigned manually.
Attributes of series.
Series provides the following attributes.
1. Axes
2. Size
3. Value
4. Head
5. Tail
6. empty
Let’s see an example of the above attributes:
#!/usr/bin/python #import the pandas and numpy library import pandas as pd import numpy as np # Creating a numpy array with some numeric data numericdata = np.array([100,200,304,400]) #Assigning value to series series = pd.Series(numericdata) print("axis of series {0}".format(series.axes)) print("Is series empty {0}".format(series.empty)) print("Size of series {0}".format(series.size)) print("values in series {0}".format(series.values)) # head provides first n elements of series print("Head of series {0}".format(series.head(2))) # Tail provides last n elements of series print("Tail of series {0}".format(series.tail(2))) =======output======= axis of series [RangeIndex(start=0, stop=4, step=1)] Is series empty False Size of series 4 values in series [100 200 304 400] Head of series 0 100 1 200 dtype: int64 Tail of series 2 304 3 400 dtype: int64
Attributes of Data frames
Data frames provide the following attributes
1. Transpose
2. Empty
3. Shape
4. Size
5. Values
6. Head
7. Tail
Let’s take an example of all the above:
#!/usr/bin/python #import the pandas and numpy library import pandas as pd import numpy as np # Creating a numpy array with some numeric data data = {"ID":[4,3,2,1], "Name":["Name1","name2","name3","Name4"], "Age":[10,20,30,40] } # Create a dataframe df =pd.DataFrame(data) print("Actual data") print(df) #Transpose of data print("Transpose of data") print(df.T) # Axes are print("Axes of data") print(df.axes) # empty tell us is data frame has data or not print("Is empty ?") print(df.empty) # shape provides rows and columns contained in the data frame print("Shape data") print(df.shape) # Head and tail get data from the top or bottom of the data frame print("Top 2 records from data frame") print(df.head(2)) print("last 2 records from dataframe") print(df.tail(2)) =======output====== Actual data Age ID Name 0 10 4 Name1 1 20 3 name2 2 30 2 name3 3 40 1 Name4 Transpose of data 0 1 2 3 Age 10 20 30 40 ID 4 3 2 1 Name Name1 name2 name3 Name4 Axes of data [RangeIndex(start=0, stop=4, step=1), Index([u'Age', u'ID', u'Name'], dtype='object')] Is empty ? False Shape data (4, 3) Top 2 records from data frame Age ID Name 0 10 4 Name1 1 20 3 name2 last 2 records from dataframe Age ID Name 2 30 2 name3 3 40 1 Name4
Pandas provide some math function some are listed below:
1. sum()
2. count()
3. mean()
4. min()
5. max()
6. Prod
Working with text data
Pandas provide very good functionality to deal with text data. When we process any text data we need to split, sort, count and many such kinds of function which make our task easy.
Pandas provide these kinds of functions some of them are listed below:
1. lower()
2. upper()
3. len()
4. split()
5. replace()
6. count()
7. find()
8. Islower
9. Isupper
10. isnumeric()
Example of above functions:
!/usr/bin/python #import the pandas and numpy library import pandas as pd s = pd.Series(['Raj kumar', 'Amit singh', 'John', 'katie', 20, '897456','Steve','smith']) print("All text is in lower case") print(s.str.lower()) print("All text is in upper case") print(s.str.upper()) print("Length of each text in string") print(s.str.len()) print("split string by space") print(s.str.split(' ')) print("check string contains a specific data or not it return true and false") print(s.str.contains(' ')) print("count a specific word and character in string ") print(s.str.count('R')) =======output========== All text is in lower case 0 raj kumar 1 amit singh 2 john 3 katie 4 NaN 5 897456 6 steve 7 smith dtype: object All text is in upper case 0 RAJ KUMAR 1 AMIT SINGH 2 JOHN 3 KATIE 4 NaN 5 897456 6 STEVE 7 SMITH dtype: object Length of each text in string 0 9.0 1 10.0 2 4.0 3 5.0 4 NaN 5 6.0 6 5.0 7 5.0 dtype: float64 split string by space 0 [Raj, kumar] 1 [Amit, singh] 2 [John] 3 [katie] 4 NaN 5 [897456] 6 [Steve] 7 [smith] dtype: object check string contains a specific data or not it return true and false 0 True 1 True 2 False 3 False 4 NaN 5 False 6 False 7 False dtype: object count a specific word and character in string 0 1.0 1 0.0 2 0.0 3 0.0 4 NaN 5 0.0 6 0.0 7 0.0 dtype: float64
SQL operations on pandas
We can perform many SQL operations on pandas some of the examples are listed below.
An example is mainly focused on Select, where, limit statement How these SQL Query can be used with a data frame with some tricks.
An example is below:
CSV File: (http://insight.dev.schoolwires.com/HelpAssets/C2Assets/C2Files/C2ImportGroupsSample.csv) GroupName,Groupcode ,GroupOwner,GroupCategoryID System Administrators,sysadmin,13456,100 Independence High Teachers,HS Teachers,,101 John Glenn Middle Teachers,MS Teachers,13458,102 Liberty Elementary Teachers,Elem Teachers,13559,103 1st Grade Teachers,1stgrade,,104 2nd Grade Teachers,2nsgrade,13561,105 3rd Grade Teachers,3rdgrade,13562,106 Guidance Department,guidance,,107 Independence Math Teachers,HS Math,13660,108 Independence English Teachers,HS English,13661,109 John Glenn 8th Grade Teachers,8thgrade,,110 John Glenn 7th Grade Teachers,7thgrade,13452,111 Elementary Parents,Elem Parents,,112 Middle School Parents,MS Parents,18001,113 High School Parents,HS Parents,18002,114
#!/usr/bin/python #import the pandas and numpy library import pandas as pd # Read CSV url = 'http://insight.dev.schoolwires.com/HelpAssets/C2Assets/C2Files/C2ImportGroupsSample.csv' csvdata=pd.read_csv(url) print(csvdata.head()) # select some specfic columns in pandas like print("Select some specific columns") print(csvdata[["GroupName","GroupOwner"]].head()) # Filter data like as Sql WHERE print("Filer record") print(csvdata[csvdata["GroupName"]=="System Administrators"]) # select top n rows like as SQL print("Select only 2 rows like as SQL") print(csvdata.head(2)) ========output========== GroupName Groupcode GroupOwner GroupCategoryID 0 System Administrators sysadmin 13456.0 100 1 Independence High Teachers HS Teachers NaN 101 2 John Glenn Middle Teachers MS Teachers 13458.0 102 3 Liberty Elementary Teachers Elem Teachers 13559.0 103 4 1st Grade Teachers 1stgrade NaN 104 Select some specific columns GroupName GroupOwner 0 System Administrators 13456.0 1 Independence High Teachers NaN 2 John Glenn Middle Teachers 13458.0 3 Liberty Elementary Teachers 13559.0 4 1st Grade Teachers NaN Filer record GroupName Groupcode GroupOwner GroupCategoryID 0 System Administrators sysadmin 13456.0 100 Select only 2 rows like as SQL GroupName Groupcode GroupOwner GroupCategoryID 0 System Administrators sysadmin 13456.0 100 1 Independence High Teachers HS Teachers NaN 101
There is a variety of other functionality provided by pandas.
Like merging, grouping, missing data, Date functionality, time delta and many more.
The above article is mainly focused on the basics of pandas and how to start working with pandas.
For detail information, you can read the official document of pandas (https://pandas.pydata.org/pandas-docs/stable/)