Introduction¶

This tutorial explains the basics and various functions of Pandas. It is one of the mostly used opensource python library for data analysis. It uses most of the functionalities of NumPy. Pandas deals with the three data structures − Series, DataFrame and Panel. Series is 1D labeled homogeneous array, it is sizeimmutable. A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

Dataframe id made up of series. series is array like structure representing single column. Column attributes are the keys of dictionary.

In [1]:

import pandas as pd
import numpy as np

In [2]:


#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
   'Lee','David','Gasper','Betina','Andres']),
   'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
   'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])}

#Create a DataFrame
df = pd.DataFrame(d)
print (df)

      Name  Age  Rating
0      Tom   25    4.23
1    James   26    3.24
2    Ricky   25    3.98
3      Vin   23    2.56
4    Steve   30    3.20
5    Smith   29    4.60
6     Jack   23    3.80
7      Lee   34    3.78
8    David   40    2.98
9   Gasper   30    4.80
10  Betina   51    4.10
11  Andres   46    3.65

In [3]:

print (df.sum())

Name      TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Age                                                     382
Rating                                                44.92
dtype: object

In [4]:

print (df.sum(1))

0     29.23
1     29.24
2     28.98
3     25.56
4     33.20
5     33.60
6     26.80
7     37.78
8     42.98
9     34.80
10    55.10
11    49.65
dtype: float64

In [5]:

print (df.mean())

Age       31.833333
Rating     3.743333
dtype: float64

In [6]:

print (df.std())

Age       9.232682
Rating    0.661628
dtype: float64

In [7]:

print (df.describe())

             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000

In [8]:

print (df.describe(include=['object']))
print (df.describe(include=['number'])) #all

         Name
count      12
unique     12
top     James
freq        1
             Age     Rating
count  12.000000  12.000000
mean   31.833333   3.743333
std     9.232682   0.661628
min    23.000000   2.560000
25%    25.000000   3.230000
50%    29.500000   3.790000
75%    35.500000   4.132500
max    51.000000   4.800000

In [9]:

print (df.describe(include='all'))

         Name        Age     Rating
count      12  12.000000  12.000000
unique     12        NaN        NaN
top     James        NaN        NaN
freq        1        NaN        NaN
mean      NaN  31.833333   3.743333
std       NaN   9.232682   0.661628
min       NaN  23.000000   2.560000
25%       NaN  25.000000   3.230000
50%       NaN  29.500000   3.790000
75%       NaN  35.500000   4.132500
max       NaN  51.000000   4.800000

For tablewise operation use Pipe method

For row wise and column operation use apply method. axis=0 for column and axis=1 for row

applymap() on DataFrame and map() on Series

In [10]:

##Table-wise operation
def adder(ele1,ele2):
   return ele1+ele2

df_tablewise = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print("Original table : \n {} " .format(df_tablewise))
print("\n After adding 2 by tablewise operation")

print (df_tablewise.pipe(adder,2))

Original table : 
        col1      col2      col3
0  0.662554 -0.034132  1.189290
1 -0.370736 -0.659157  1.808778
2  0.605755 -0.110194 -0.077588
3  0.873752 -0.446084  0.508982
4 -1.167333 -0.969286 -0.201992 

 After adding 2 by tablewise operation
       col1      col2      col3
0  2.662554  1.965868  3.189290
1  1.629264  1.340843  3.808778
2  2.605755  1.889806  1.922412
3  2.873752  1.553916  2.508982
4  0.832667  1.030714  1.798008

In [11]:

## Row or Column Wise Function Application##

df_columnwise = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print (df_columnwise)
print (df_columnwise.apply(np.mean)) #provided that data should be numbers
#print (df.apply(np.mean))

       col1      col2      col3
0  0.751117 -1.026489  0.247884
1  0.379869 -0.155872  1.444529
2 -0.385003  0.315316  2.965140
3  0.338429 -0.879719 -0.695811
4 -0.454858  0.255418 -0.680425
col1    0.125911
col2   -0.298269
col3    0.656263
dtype: float64

In [12]:

df_rowwise = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print (df_rowwise)
print (df_rowwise.apply(np.mean, axis=1)) # axis=0 for columnwise nad axis =1 for rowwise
#print (df.apply(np.mean))

       col1      col2      col3
0  1.220620  1.527704  0.655957
1  1.598960 -1.996201  1.643199
2  2.032741 -0.928201 -0.351705
3  0.749571  0.894567  0.075032
4 -0.148097  0.410690 -1.278748
0    1.134760
1    0.415319
2    0.250945
3    0.573057
4   -0.338718
dtype: float64

In [13]:

df_elementwise = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
print (df_elementwise)
# My custom function
op=df_elementwise.applymap(lambda x:x*100)
print (op)
print (op.apply(np.mean))
df_elementwise['col1'].map(lambda x:x*100)

       col1      col2      col3
0 -1.295957  0.693389 -0.916367
1 -0.507170  0.508616 -0.379203
2  1.351135  0.713741 -0.699786
3 -0.575137  1.616111  0.846915
4  0.817004 -0.030320  0.276768
         col1        col2       col3
0 -129.595667   69.338913 -91.636657
1  -50.716995   50.861562 -37.920265
2  135.113477   71.374141 -69.978616
3  -57.513684  161.611106  84.691463
4   81.700381   -3.032017  27.676756
col1    -4.202498
col2    70.030741
col3   -17.433464
dtype: float64

Out[13]:

0   -129.595667
1    -50.716995
2    135.113477
3    -57.513684
4     81.700381
Name: col1, dtype: float64

In [14]:

import pandas as pd
import numpy as np

N=20

df_reindex = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})
print(df_reindex)

#reindex the DataFrame
df_reindexed =df_reindex.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print (df_reindexed)

            A     x         y       C           D
0  2016-01-01   0.0  0.246914    High   96.738071
1  2016-01-02   1.0  0.626971     Low   99.729046
2  2016-01-03   2.0  0.061072     Low   91.339152
3  2016-01-04   3.0  0.033459    High  106.301486
4  2016-01-05   4.0  0.242078    High  104.684955
5  2016-01-06   5.0  0.281413     Low   85.163331
6  2016-01-07   6.0  0.539611    High   95.554018
7  2016-01-08   7.0  0.404055  Medium   97.644716
8  2016-01-09   8.0  0.852051    High   85.559831
9  2016-01-10   9.0  0.331988    High  101.424936
10 2016-01-11  10.0  0.061462     Low   96.645614
11 2016-01-12  11.0  0.695607     Low   89.040005
12 2016-01-13  12.0  0.034903     Low   81.561846
13 2016-01-14  13.0  0.920957  Medium  110.604128
14 2016-01-15  14.0  0.384684     Low   90.329175
15 2016-01-16  15.0  0.612612  Medium  100.569179
16 2016-01-17  16.0  0.235509    High   96.251810
17 2016-01-18  17.0  0.964536  Medium  116.237577
18 2016-01-19  18.0  0.397522     Low   99.383378
19 2016-01-20  19.0  0.400588    High   98.606609
           A     C   B
0 2016-01-01  High NaN
2 2016-01-03   Low NaN
5 2016-01-06   Low NaN

In [15]:

df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
print(df1)
print(df2)
df1 = df1.reindex_like(df2)
print (df1)

       col1      col2      col3
0 -1.831564 -0.459239 -0.760486
1 -0.157637 -1.543497  0.448835
2  0.746714 -0.880673 -1.134125
3  0.130233  1.032702 -0.685558
4  0.051306 -0.788662  0.034547
5 -2.368614 -0.076649 -0.743274
6 -0.197915 -0.468060  0.197387
7 -1.025164  1.713484 -0.107493
8  0.588698  0.598082  1.642736
9  0.282388  0.918162  1.215639
       col1      col2      col3
0  0.479234 -1.399477 -0.475404
1  0.844047 -1.407968  0.819594
2 -0.041638 -0.573865 -0.165217
3  1.899362 -0.161106 -0.866806
4  0.251061  0.258331 -0.427087
5  0.051707  0.704990  0.192740
6 -0.632569  0.030260  1.803254
       col1      col2      col3
0 -1.831564 -0.459239 -0.760486
1 -0.157637 -1.543497  0.448835
2  0.746714 -0.880673 -1.134125
3  0.130233  1.032702 -0.685558
4  0.051306 -0.788662  0.034547
5 -2.368614 -0.076649 -0.743274
6 -0.197915 -0.468060  0.197387

In [16]:

df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])

# Padding NAN's
print (df2.reindex_like(df1))

# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
print (df2.reindex_like(df1,method='bfill', limit=1)) #limit 1 indicate fill only 1 row

       col1      col2      col3
0 -0.120850  0.782853 -0.367969
1 -2.186293  1.611200  0.321400
2 -1.274295 -0.138021 -0.693892
3  0.441868  0.135784  1.680443
4 -1.430777  0.471035 -1.790769
5 -1.593930  0.627738 -0.759367
6 -0.507239 -1.927682  1.158852
7       NaN       NaN       NaN
8       NaN       NaN       NaN
9       NaN       NaN       NaN
Data Frame with Forward Fill:
       col1      col2      col3
0 -0.120850  0.782853 -0.367969
1 -2.186293  1.611200  0.321400
2 -1.274295 -0.138021 -0.693892
3  0.441868  0.135784  1.680443
4 -1.430777  0.471035 -1.790769
5 -1.593930  0.627738 -0.759367
6 -0.507239 -1.927682  1.158852
7       NaN       NaN       NaN
8       NaN       NaN       NaN
9       NaN       NaN       NaN

In [17]:

df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print (df1)

print ("After renaming the rows and columns:")
print (df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}))

       col1      col2      col3
0  1.095013 -0.814865  0.597993
1  1.163142  0.846129  1.663861
2  0.341342 -0.679917  1.012594
3 -0.137381  0.179107 -0.480849
4 -1.769515 -0.453522 -0.511565
5  2.265916 -0.231434 -1.159136
After renaming the rows and columns:
              c1        c2      col3
apple   1.095013 -0.814865  0.597993
banana  1.163142  0.846129  1.663861
durian  0.341342 -0.679917  1.012594
3      -0.137381  0.179107 -0.480849
4      -1.769515 -0.453522 -0.511565
5       2.265916 -0.231434 -1.159136

Iterating a DataFrame

In [18]:

## Three methods 1. iteritems, iterrows, itertuples
 
N=20

df = pd.DataFrame({
    'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
    'x': np.linspace(0,stop=N-1,num=N),
    'y': np.random.rand(N),
    'C': np.random.choice(['Low','Medium','High'],N).tolist(),
    'D': np.random.normal(100, 10, size=(N)).tolist()
    })
print(df)
for col in df:
   print (col)

            A     x         y       C           D
0  2016-01-01   0.0  0.132175    High  110.191050
1  2016-01-02   1.0  0.596989     Low   83.742718
2  2016-01-03   2.0  0.805215    High  111.604316
3  2016-01-04   3.0  0.549825  Medium  110.312950
4  2016-01-05   4.0  0.998690  Medium   87.304092
5  2016-01-06   5.0  0.766348  Medium  113.616397
6  2016-01-07   6.0  0.133945    High   94.766168
7  2016-01-08   7.0  0.835936     Low   90.815598
8  2016-01-09   8.0  0.234572     Low  103.270011
9  2016-01-10   9.0  0.586804     Low  109.653296
10 2016-01-11  10.0  0.509883  Medium  119.667708
11 2016-01-12  11.0  0.681792    High   94.392546
12 2016-01-13  12.0  0.451434    High  121.559999
13 2016-01-14  13.0  0.353345    High  109.647851
14 2016-01-15  14.0  0.443566    High   90.409164
15 2016-01-16  15.0  0.004881  Medium   94.407092
16 2016-01-17  16.0  0.149589  Medium  100.790720
17 2016-01-18  17.0  0.109709    High  100.347045
18 2016-01-19  18.0  0.488315    High  113.581431
19 2016-01-20  19.0  0.576906    High  115.540724
A
x
y
C
D

In [19]:

### Row wise iterate 
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
print(df)
for key,value in df.iteritems():
   print (key,"\n",value)
   #print (type(value)) ## returns pandas series

       col1      col2      col3
0  0.482949 -0.333236  0.608632
1  0.286628 -0.924229  0.868388
2  1.341135 -1.195910 -0.492072
3  0.156296 -1.330476  0.199868
col1 
 0    0.482949
1    0.286628
2    1.341135
3    0.156296
Name: col1, dtype: float64
col2 
 0   -0.333236
1   -0.924229
2   -1.195910
3   -1.330476
Name: col2, dtype: float64
col3 
 0    0.608632
1    0.868388
2   -0.492072
3    0.199868
Name: col3, dtype: float64

In [20]:

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
   print (row_index,row)
   #print (type(row)) # pandas series

0 col1   -0.717594
col2   -0.961852
col3    0.319508
Name: 0, dtype: float64
1 col1    0.665165
col2   -0.941683
col3   -0.877763
Name: 1, dtype: float64
2 col1   -0.936091
col2    0.376677
col3   -0.625468
Name: 2, dtype: float64
3 col1   -0.867062
col2   -0.051948
col3   -1.607202
Name: 3, dtype: float64

In [21]:

df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row in df.itertuples():
    print (row)
    #print(type(row)) returns list

Pandas(Index=0, col1=0.7325600239896838, col2=-0.6091177739565541, col3=0.2647577126945533)
Pandas(Index=1, col1=1.3010576431306753, col2=-0.7886668337170729, col3=-0.882557291694557)
Pandas(Index=2, col1=-0.8205276703805691, col2=-1.0094248441697704, col3=0.3926227157226406)
Pandas(Index=3, col1=-0.31519346936752574, col2=0.7727981982208723, col3=0.9410792350904242)

In [22]:

##Python Pandas - Sorting
## methods 1. by index 2. by actual values
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
print (unsorted_df)

       col2      col1
1 -0.524067 -0.026807
4  1.133088 -1.969396
6 -0.404062  1.583782
2  0.133941  0.830913
3  0.951446 -0.036375
5 -0.553048  0.364095
9 -0.633351 -2.110858
8  1.830660  0.126638
0  0.405804  1.096510
7  0.042751  0.479346

In [23]:

sorted_df=unsorted_df.sort_index()
print (sorted_df)

       col2      col1
0  0.405804  1.096510
1 -0.524067 -0.026807
2  0.133941  0.830913
3  0.951446 -0.036375
4  1.133088 -1.969396
5 -0.553048  0.364095
6 -0.404062  1.583782
7  0.042751  0.479346
8  1.830660  0.126638
9 -0.633351 -2.110858

In [24]:

##order of sorting
sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)

       col2      col1
9 -0.633351 -2.110858
8  1.830660  0.126638
7  0.042751  0.479346
6 -0.404062  1.583782
5 -0.553048  0.364095
4  1.133088 -1.969396
3  0.951446 -0.036375
2  0.133941  0.830913
1 -0.524067 -0.026807
0  0.405804  1.096510

In [25]:

print (unsorted_df)
sorted_df=unsorted_df.sort_index(axis=1)# by default axis=0 hence column wise sorting

print (sorted_df)

       col2      col1
1 -0.524067 -0.026807
4  1.133088 -1.969396
6 -0.404062  1.583782
2  0.133941  0.830913
3  0.951446 -0.036375
5 -0.553048  0.364095
9 -0.633351 -2.110858
8  1.830660  0.126638
0  0.405804  1.096510
7  0.042751  0.479346
       col1      col2
1 -0.026807 -0.524067
4 -1.969396  1.133088
6  1.583782 -0.404062
2  0.830913  0.133941
3 -0.036375  0.951446
5  0.364095 -0.553048
9 -2.110858 -0.633351
8  0.126638  1.830660
0  1.096510  0.405804
7  0.479346  0.042751

In [26]:

## sorting by value
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')
print (unsorted_df)
print (sorted_df)

sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')

print (sorted_df)

   col1  col2
0     2     1
1     1     3
2     1     2
3     1     4
   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1
   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1

In [27]:

###Pandas provide API e.g displays its relevant functions are:
## get_option()set_option() reset_option() describe_option() option_context()
## Default max values for row and column display
print (pd.get_option("display.max_rows"))
print (pd.get_option("display.max_columns"))
print("\n")
pd.set_option("display.max_rows",90)
pd.set_option("display.max_columns",45)
print (pd.get_option("display.max_rows"))
print (pd.get_option("display.max_columns"))
print("\n")
pd.reset_option("display.max_rows") ## reset to default values
pd.reset_option("display.max_columns") ## reset to default values
print (pd.get_option("display.max_rows"))
print (pd.get_option("display.max_columns"))

print("\n")
with pd.option_context("display.max_rows",10): ##temporory set
   print(pd.get_option("display.max_rows"))   
   print(pd.get_option("display.max_columns"))

Python Pandas - Indexing and Selecting Data

loc() --> Label based .iloc() --> Integer based .ix() --> Both Label and Integer based

In [28]:

## Label based
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
print(df)
#select all rows for a specific column
print (df.loc[:,'A'])

# Select all rows for multiple columns, say list[]
print (df.loc[:,['A','C']])

# Select few rows for multiple columns, say list[]
print (df.loc[['a','b','f','h'],['A','C']])

# Select range of rows for all columns
print (df.loc['a':'f'])

# for getting values with a boolean array
print (df.loc['a']>0)

          A         B         C         D
a  0.838961 -0.643973  0.277190 -1.198172
b -0.169509  1.005068  0.166208  0.943208
c -0.557566 -0.329840 -0.317426 -0.679609
d  0.996009 -0.836951  1.015292  1.123211
e -0.277678  2.117664 -1.518030 -0.913785
f -0.366445 -2.340873 -0.487580  0.188724
g -0.077793  0.998534 -0.724225  0.535968
h  0.399931 -1.310118 -0.099159  0.372527
a    0.838961
b   -0.169509
c   -0.557566
d    0.996009
e   -0.277678
f   -0.366445
g   -0.077793
h    0.399931
Name: A, dtype: float64
          A         C
a  0.838961  0.277190
b -0.169509  0.166208
c -0.557566 -0.317426
d  0.996009  1.015292
e -0.277678 -1.518030
f -0.366445 -0.487580
g -0.077793 -0.724225
h  0.399931 -0.099159
          A         C
a  0.838961  0.277190
b -0.169509  0.166208
f -0.366445 -0.487580
h  0.399931 -0.099159
          A         B         C         D
a  0.838961 -0.643973  0.277190 -1.198172
b -0.169509  1.005068  0.166208  0.943208
c -0.557566 -0.329840 -0.317426 -0.679609
d  0.996009 -0.836951  1.015292  1.123211
e -0.277678  2.117664 -1.518030 -0.913785
f -0.366445 -2.340873 -0.487580  0.188724
A     True
B    False
C     True
D    False
Name: a, dtype: bool

In [29]:

#integer based. #Like python and numpy, these are 0-based indexing.
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
print(df)
# select all rows for a specific column
print (df.iloc[:4])
print (df.iloc[1:5, 2:4])

# Slicing through list of values
print (df.iloc[[1, 3, 5], [1, 3]])
print (df.iloc[1:3, :])
print (df.iloc[:,1:3])
print("\n hybrid approach")
##hyrid approach ## depreceated warning, use loc and iloc only
# Integer slicing
print (df.ix[:3]) ## considers 3rd element also. strange but reality ## 
# Index slicing
print (df.ix[:,'A'])

          A         B         C         D
0  0.012963 -0.057518  1.472855 -0.253070
1  0.807859  1.632489 -0.323745 -0.154050
2 -0.806308 -1.718625  1.120670 -0.868967
3  1.066694  1.875995 -0.489651  1.113774
4  0.005293 -1.284196  0.080734  0.297806
5  0.308887  2.404678  1.297191 -0.841452
6  0.328302  0.287114  0.449698 -0.759473
7 -1.780179 -1.135797 -0.829774  0.926161
          A         B         C         D
0  0.012963 -0.057518  1.472855 -0.253070
1  0.807859  1.632489 -0.323745 -0.154050
2 -0.806308 -1.718625  1.120670 -0.868967
3  1.066694  1.875995 -0.489651  1.113774
          C         D
1 -0.323745 -0.154050
2  1.120670 -0.868967
3 -0.489651  1.113774
4  0.080734  0.297806
          B         D
1  1.632489 -0.154050
3  1.875995  1.113774
5  2.404678 -0.841452
          A         B         C         D
1  0.807859  1.632489 -0.323745 -0.154050
2 -0.806308 -1.718625  1.120670 -0.868967
          B         C
0 -0.057518  1.472855
1  1.632489 -0.323745
2 -1.718625  1.120670
3  1.875995 -0.489651
4 -1.284196  0.080734
5  2.404678  1.297191
6  0.287114  0.449698
7 -1.135797 -0.829774

 hybrid approach
          A         B         C         D
0  0.012963 -0.057518  1.472855 -0.253070
1  0.807859  1.632489 -0.323745 -0.154050
2 -0.806308 -1.718625  1.120670 -0.868967
3  1.066694  1.875995 -0.489651  1.113774
0    0.012963
1    0.807859
2   -0.806308
3    1.066694
4    0.005293
5    0.308887
6    0.328302
7   -1.780179
Name: A, dtype: float64

/home/tan/.conda/envs/tensorflow_gpu/lib/python3.6/site-packages/ipykernel_launcher.py:15: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  from ipykernel import kernelapp as app
/home/tan/.conda/envs/tensorflow_gpu/lib/python3.6/site-packages/ipykernel_launcher.py:17: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

In [30]:

## use of notations
print (df[['A','B']])
## attribute access
print (df.A)

          A         B
0  0.012963 -0.057518
1  0.807859  1.632489
2 -0.806308 -1.718625
3  1.066694  1.875995
4  0.005293 -1.284196
5  0.308887  2.404678
6  0.328302  0.287114
7 -1.780179 -1.135797
0    0.012963
1    0.807859
2   -0.806308
3    1.066694
4    0.005293
5    0.308887
6    0.328302
7   -1.780179
Name: A, dtype: float64

In [31]:

###Python Pandas - Statistical Functions

s = pd.Series([12,3,4,5,4])
print (s.pct_change()) ## percent change w.r.t. previous element/100

df = pd.DataFrame(np.random.randn(5, 2))
print (df)
print (df.pct_change(axis=0)) # by default axis=0 for column wise for row wise axis=1

0         NaN
1   -0.750000
2    0.333333
3    0.250000
4   -0.200000
dtype: float64
          0         1
0  1.751222  0.406665
1  0.549046  0.712827
2  1.031081  0.272820
3  0.041943  0.185428
4 -0.965936 -0.322301
           0         1
0        NaN       NaN
1  -0.686478  0.752862
2   0.877950 -0.617271
3  -0.959322 -0.320327
4 -24.029865 -2.738147

In [32]:

###Co-variance
s1 = pd.Series(np.random.randn(10))
s2 = pd.Series(np.random.randn(10))
print (s1.cov(s2))

-0.5410401047280948

In [33]:

frame = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
print (frame['a'].cov(frame['b'])) #covariance between a and b series
print (frame.cov()) #complete covariance

-0.13937824217082254
          a         b         c         d         e
a  1.683393 -0.139378 -0.164416 -0.198086  0.315020
b -0.139378  0.817134 -0.106425  0.021937 -0.288277
c -0.164416 -0.106425  1.121431 -0.065114  0.031851
d -0.198086  0.021937 -0.065114  1.165282 -0.199347
e  0.315020 -0.288277  0.031851 -0.199347  0.603868

In [34]:

## co-relation
print (frame['a'].corr(frame['b']))

print (frame.corr())

-0.11883805760513368
          a         b         c         d         e
a  1.000000 -0.118838 -0.119665 -0.141431  0.312446
b -0.118838  1.000000 -0.111176  0.022481 -0.410385
c -0.119665 -0.111176  1.000000 -0.056960  0.038704
d -0.141431  0.022481 -0.056960  1.000000 -0.237642
e  0.312446 -0.410385  0.038704 -0.237642  1.000000

In [35]:

## python windows function
##Window functions are majorly used in finding the trends within the data
df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
print(df)
print("\n after rolling window with size 3")
print (df.rolling(window=3).mean())

print("\n after expanding min period  3")
print (df.expanding(min_periods=3).mean())

print("\n after assigining weight exponentially")
print (df.ewm(com=0.5).mean())

                   A         B         C         D
2000-01-01 -0.839572 -0.306912 -0.094920  0.113456
2000-01-02  0.956527  1.301957 -0.615316  0.679741
2000-01-03 -0.122285 -0.354126 -0.331749 -0.784345
2000-01-04  0.531347  1.431400 -0.223712 -1.994049
2000-01-05 -1.288552  0.178789  0.949681  1.786910
2000-01-06  0.844034  0.840933  0.047929 -0.182554
2000-01-07  0.489367 -0.251181 -0.559091 -1.390617
2000-01-08  0.183250  0.432800  0.693299 -0.877747
2000-01-09  0.469011  0.799077 -0.486739  0.516182
2000-01-10  2.176712  1.124894  1.413359  0.434223

 after rolling window with size 3
                   A         B         C         D
2000-01-01       NaN       NaN       NaN       NaN
2000-01-02       NaN       NaN       NaN       NaN
2000-01-03 -0.001777  0.213640 -0.347328  0.002951
2000-01-04  0.455196  0.793077 -0.390259 -0.699551
2000-01-05 -0.293163  0.418688  0.131407 -0.330495
2000-01-06  0.028943  0.817041  0.257966 -0.129898
2000-01-07  0.014950  0.256180  0.146173  0.071246
2000-01-08  0.505550  0.340851  0.060712 -0.816973
2000-01-09  0.380542  0.326899 -0.117510 -0.584061
2000-01-10  0.942991  0.785590  0.539973  0.024219

 after expanding min period  3
                   A         B         C         D
2000-01-01       NaN       NaN       NaN       NaN
2000-01-02       NaN       NaN       NaN       NaN
2000-01-03 -0.001777  0.213640 -0.347328  0.002951
2000-01-04  0.131504  0.518080 -0.316424 -0.496299
2000-01-05 -0.152507  0.450222 -0.063203 -0.039658
2000-01-06  0.013583  0.515340 -0.044681 -0.063474
2000-01-07  0.081552  0.405837 -0.118168 -0.253066
2000-01-08  0.094264  0.409208 -0.016735 -0.331151
2000-01-09  0.135903  0.452526 -0.068957 -0.237003
2000-01-10  0.339984  0.519763  0.079274 -0.169880

 after assigining weight exponentially
                   A         B         C         D
2000-01-01 -0.839572 -0.306912 -0.094920  0.113456
2000-01-02  0.507502  0.899740 -0.485217  0.538170
2000-01-03  0.071495  0.031679 -0.378970 -0.377418
2000-01-04  0.381895  0.976491 -0.274171 -1.468644
2000-01-05 -0.736338  0.442492  0.545102  0.710694
2000-01-06  0.318691  0.708484  0.213198  0.114377
2000-01-07  0.432527  0.068415 -0.301897 -0.889411
2000-01-08  0.266317  0.311375  0.361668 -0.881634
2000-01-09  0.401453  0.636526 -0.203965  0.050291
2000-01-10  1.584979  0.962110  0.874269  0.306250

In [36]:

### Missing values NaN

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print (df)
## check for missing values
## two methos notnull isnull
print (df['one'].isnull())

print (df['one'].notnull())

print("\n sum: ") 
print (df['one'].sum()) # treats NaN values as zero 
# If  data are all NA, then result will b NA

### Cleaning / Filling Missing Data
print (df.fillna(0)) # fill nan with zeros

## pad/fill --> Fill methods Forward fill
## bfill/backfill --> Fill methods Backward

print (df.fillna(method='bfill'))
print (df.dropna()) ## drop mising values
print (df.dropna(axis=1))

        one       two     three
a  0.587194  0.194928 -0.973519
b       NaN       NaN       NaN
c  0.325436  0.124324  0.314423
d       NaN       NaN       NaN
e  0.380413  0.952595  3.023580
f -0.721115  0.329907 -0.927609
g       NaN       NaN       NaN
h -0.185429 -0.562280  0.512828
a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool
a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool

 sum: 
0.3864990721549342
        one       two     three
a  0.587194  0.194928 -0.973519
b  0.000000  0.000000  0.000000
c  0.325436  0.124324  0.314423
d  0.000000  0.000000  0.000000
e  0.380413  0.952595  3.023580
f -0.721115  0.329907 -0.927609
g  0.000000  0.000000  0.000000
h -0.185429 -0.562280  0.512828
        one       two     three
a  0.587194  0.194928 -0.973519
b  0.325436  0.124324  0.314423
c  0.325436  0.124324  0.314423
d  0.380413  0.952595  3.023580
e  0.380413  0.952595  3.023580
f -0.721115  0.329907 -0.927609
g -0.185429 -0.562280  0.512828
h -0.185429 -0.562280  0.512828
        one       two     three
a  0.587194  0.194928 -0.973519
c  0.325436  0.124324  0.314423
e  0.380413  0.952595  3.023580
f -0.721115  0.329907 -0.927609
h -0.185429 -0.562280  0.512828
Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]

In [37]:

## replacing missing or generic values
df = pd.DataFrame({'one':[10,20,30,40,50,2000],
'two':[1000,0,30,40,50,60]})
print (df.replace({1000:10,2000:60}))

   one  two
0   10   10
1   20    0
2   30   30
3   40   40
4   50   50
5   60   60

In [38]:

### Split Data into Groups ##Groupby ## remaining to study
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

print(df)
print ('\n result')
print (df.groupby('Team'))

print (df.groupby('Team').groups)

      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
2   Devils     2  2014     863
3   Devils     3  2015     673
4    Kings     3  2014     741
5    kings     4  2015     812
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
9   Royals     4  2014     701
10  Royals     1  2015     804
11  Riders     2  2017     690

 result
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f7d61605898>
{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}

In [39]:

##Python Pandas - Merging/Joining

left = pd.DataFrame({
         'id':[1,2,3,4,5],
         'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
         'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
         {'id':[1,2,3,4,5],
         'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
         'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print (left)
print (right)

print (pd.merge(left,right,on='id'))

   id    Name subject_id
0   1    Alex       sub1
1   2     Amy       sub2
2   3   Allen       sub4
3   4   Alice       sub6
4   5  Ayoung       sub5
   id   Name subject_id
0   1  Billy       sub2
1   2  Brian       sub4
2   3   Bran       sub3
3   4  Bryce       sub6
4   5  Betty       sub5
   id  Name_x subject_id_x Name_y subject_id_y
0   1    Alex         sub1  Billy         sub2
1   2     Amy         sub2  Brian         sub4
2   3   Allen         sub4   Bran         sub3
3   4   Alice         sub6  Bryce         sub6
4   5  Ayoung         sub5  Betty         sub5

In [40]:

print (pd.merge(left,right,on=['id','subject_id'])) ## here id n sub id matches merged

   id  Name_x subject_id Name_y
0   4   Alice       sub6  Bryce
1   5  Ayoung       sub5  Betty

In [41]:

#Methods of merging how= left right, outer, inner
print (pd.merge(left, right, on=['id','subject_id'], how='left'))

   id  Name_x subject_id Name_y
0   1    Alex       sub1    NaN
1   2     Amy       sub2    NaN
2   3   Allen       sub4    NaN
3   4   Alice       sub6  Bryce
4   5  Ayoung       sub5  Betty

As matches for 0 1 2 entry is not possible NaN are added

In [42]:

## Concat two dataframes

one = pd.DataFrame({
         'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
         'subject_id':['sub1','sub2','sub4','sub6','sub5'],
         'Marks_scored':[98,90,87,69,78]},
         index=[1,2,3,4,5])
two = pd.DataFrame({
         'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
         'subject_id':['sub2','sub4','sub3','sub6','sub5'],
         'Marks_scored':[89,80,79,97,88]},
         index=[1,2,3,4,5])
print (pd.concat([one,two]))
print("\n concat with keys")
print (pd.concat([one,two],keys=['x','y']))

     Name subject_id  Marks_scored
1    Alex       sub1            98
2     Amy       sub2            90
3   Allen       sub4            87
4   Alice       sub6            69
5  Ayoung       sub5            78
1   Billy       sub2            89
2   Brian       sub4            80
3    Bran       sub3            79
4   Bryce       sub6            97
5   Betty       sub5            88

 concat with keys
       Name subject_id  Marks_scored
x 1    Alex       sub1            98
  2     Amy       sub2            90
  3   Allen       sub4            87
  4   Alice       sub6            69
  5  Ayoung       sub5            78
y 1   Billy       sub2            89
  2   Brian       sub4            80
  3    Bran       sub3            79
  4   Bryce       sub6            97
  5   Betty       sub5            88

In [43]:

#resultant object has to follow its own indexing, set ignore_index to True.
print (pd.concat([one,two],keys=['x','y'],ignore_index=True))

print (pd.concat([one,two],axis=1)) ## concat along rows 
print (one.append(two)) ## concat using append and default axis=0

     Name subject_id  Marks_scored
0    Alex       sub1            98
1     Amy       sub2            90
2   Allen       sub4            87
3   Alice       sub6            69
4  Ayoung       sub5            78
5   Billy       sub2            89
6   Brian       sub4            80
7    Bran       sub3            79
8   Bryce       sub6            97
9   Betty       sub5            88
     Name subject_id  Marks_scored   Name subject_id  Marks_scored
1    Alex       sub1            98  Billy       sub2            89
2     Amy       sub2            90  Brian       sub4            80
3   Allen       sub4            87   Bran       sub3            79
4   Alice       sub6            69  Bryce       sub6            97
5  Ayoung       sub5            78  Betty       sub5            88
     Name subject_id  Marks_scored
1    Alex       sub1            98
2     Amy       sub2            90
3   Allen       sub4            87
4   Alice       sub6            69
5  Ayoung       sub5            78
1   Billy       sub2            89
2   Brian       sub4            80
3    Bran       sub3            79
4   Bryce       sub6            97
5   Betty       sub5            88

In [44]:

print (pd.datetime.now()) ## remaining part here

2019-11-27 20:20:20.143156

In [45]:

### Pandas categorical ### overlook remaining
s = pd.Series(["a","b","c","a"], dtype="category")
print (s)

0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

In [46]:

### Visualization pandas
df = pd.DataFrame(np.random.randn(10,4),index=pd.date_range('1/1/2000',
   periods=10), columns=list('ABCD'))
%matplotlib inline
df.plot()
df.plot.bar()
df.plot.bar(stacked=True)
df.plot.hist(bins=20)
df.plot.box()

Out[46]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7d34c53828>

In [47]:

df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot.area()

df.plot.scatter(x='a', y='b')

#df.diff.hist(bins=20) ## not working 
df = pd.DataFrame(3 * np.random.rand(4), index=['a', 'b', 'c', 'd'], columns=['x'])
df.plot.pie(subplots=True)

Out[47]:

array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f7d349c06d8>],
      dtype=object)

Pandas I/O operations

In [48]:

df=pd.read_csv("temp.csv")
#df=pd.read_csv("temp.csv",index_col=['S.No'])
print (df.columns)

Index(['GPS_Latitude', 'GPS_Longitude', 'RawLatitude', 'RawLongitude',
       'CorrectedLatitude', 'CorrectedLongitude', 'DistanceCorrected2GPS',
       'DiffCorrectLatRawLat', 'DiffCorrectLongRawLong', 'DistanceRaw2GPS',
       'DiffRawLatGpsLat', 'DiffRawLongGpsLong', 'DistanceRaw2Corrected',
       'ImageType'],
      dtype='object')

In [49]:

df = pd.read_csv("temp.csv", dtype={'DistanceRaw2GPS': np.float64}) ##Converters
print (df.dtypes)

GPS_Latitude              float64
GPS_Longitude             float64
RawLatitude               float64
RawLongitude              float64
CorrectedLatitude         float64
CorrectedLongitude        float64
DistanceCorrected2GPS     float64
DiffCorrectLatRawLat      float64
DiffCorrectLongRawLong    float64
DistanceRaw2GPS           float64
DiffRawLatGpsLat          float64
DiffRawLongGpsLong        float64
DistanceRaw2Corrected     float64
ImageType                  object
dtype: object

In [50]:

df=pd.read_csv("temp.csv", names=['GPS_Latitude', 'DistanceRaw2GPS'])
df=pd.read_csv("temp.csv", skiprows=2) #skip initial 2 rows
#df

In [51]:

## sparcify data
ts = pd.Series(np.random.randn(10))
ts[2:-2] = np.nan  # from 2 nd row to last 2 rows nan the values
sts = ts.to_sparse()
print (sts)
print (sts.to_dense())
#print("Total score for \n ", ts, "is \n", sts)

### Python Pandas - Caveats & Gotchas part remaining

0    0.158292
1   -0.325862
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8    1.547898
9   -0.300161
dtype: Sparse[float64, nan]
BlockIndex
Block locations: array([0, 8], dtype=int32)
Block lengths: array([2, 2], dtype=int32)
0    0.158292
1   -0.325862
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8    1.547898
9   -0.300161
dtype: float64

Pandas for complete beginers

For tablewise operation use Pipe method

For row wise and column operation use apply method. axis=0 for column and axis=1 for row

applymap() on DataFrame and map() on Series

Iterating a DataFrame

Python Pandas - Indexing and Selecting Data

Pandas I/O operations

No comments:

Post a Comment

My Kinect 3D Background and Foreground Subtraction Demo