Data processing

Most of the time of data analysis and modeling is spent on data preparation and processing i.e., loading, cleaning and rearranging the data, etc. Further, because of Python libraries, Pandas give us high performance, flexible, and high-level environment for processing the data. Various functionalities are available for pandas to process the data effectively.

Hierarchical indexing

For enhancing the capabilities of Data Processing, we have to use some indexing that helps to sort the data based on the labels. So, Hierarchical indexing is comes into the picture and defined as an essential feature of pandas that helps us to use the multiple index levels.

Creating multiple index

In Hierarchical indexing, we have to create multiple indexes for the data. This example creates a series with multiple indexes.

Example:

  import pandas as pd  info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],  index = [[‘x’, ‘x’, ‘x’, ‘x’, ‘y’, ‘y’, ‘y’, ‘y’],  [‘obj1’, ‘obj2’, ‘obj3’, ‘obj4’, ‘obj1’, ‘obj2’, ‘obj3’, ‘obj4’]])  data  

Output:

aobj1   11  obj2   14  obj3   17       obj4   24   bobj1   19  obj2   32  obj3   34  obj4  27  dtype: int64

We have taken two level of index here i.e. (a, b) and (obj1,…, obj4) and can see the index by using ‘index‘ command.

Output:

MultiIndex(levels=[['x', 'y'], ['obj1', 'obj2', 'obj3', 'obj4']],  labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])

Partial indexing

Partial indexing can be defined as a way to choose the particular index from a hierarchical indexing.

Below code is extracting ‘b’ from the data,

  import pandas as pd  info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],  index = [[‘x’, ‘x’, ‘x’, ‘x’, ‘y’, ‘y’, ‘y’, ‘y’],  [‘obj1’, ‘obj2’, ‘obj3’, ‘obj4’, ‘obj1’, ‘obj2’, ‘obj3’, ‘obj4’]])  info[‘b’]   

Output:

obj1   19   obj2   32   obj3   34   obj4   27  dtype: int64

Further, the data can also be extracted based on inner level i.e. ‘obj’. The below result defines two available values for ‘obj2’ in the Series.

Output:

x   14   y 32  dtype: int64

Unstack the data

Unstack means to change the row header to the column header. The row index will change to the column index, therefore the Series will become the DataFrame. Below are the example of unstacking the data.

Example:

  import pandas as pd  info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],  index = [[‘x’, ‘x’, ‘x’, ‘x’, ‘y’, ‘y’, ‘y’, ‘y’],  [‘obj1’, ‘obj2’, ‘obj3’, ‘obj4’, ‘obj1’, ‘obj2’, ‘obj3’, ‘obj4’]])  # unstack on first level i.e. x, y  #note that data row-labels are x and y  data.unstack(0)   

Output:

ab   obj1  11   19  obj2  14   32  obj3 17   34   obj4  24    27  # unstack based on second level i.e. 'obj'  info.unstack(1)

Output:

obj1 obj2 obj3 obj4   a  11       14      17       24  b  19       32      34      27

‘stack()‘ operation is used to convert the column index to row index. In above code, we can convert ‘obj’ as column index into row index using ‘stack‘ operation.

  import pandas as pd  info = pd.Series([11, 14, 17, 24, 19, 32, 34, 27],  index = [[‘x’, ‘x’, ‘x’, ‘x’, ‘y’, ‘y’, ‘y’, ‘y’],  [‘obj1’, ‘obj2’, ‘obj3’, ‘obj4’, ‘obj1’, ‘obj2’, ‘obj3’, ‘obj4’]])  # unstack on first level i.e. x, y  #note that data row-labels are x and y  data.unstack(0)   d.stack()  

Output:

aobj1   11  obj2   14  obj3   17       obj4   24   bobj1   19  obj2   32       obj3   34   obj4  27  dtype: int64

Column indexing

Remember that, since, column-indexing requires two dimensional data, the column indexing is possible only for DataFrame(not for Series). Let’s create new DataFrame for demonstrating the columns with multiple index,

  import numpy as np   info = pd.DataFrame(np.arange(12).reshape(4, 3),  index = [[‘a’, ‘a’, ‘b’, ‘b’], [‘one’, ‘two’, ‘three’, ‘four’]],   columns = [[‘num1’, ‘num2’, ‘num3’], [‘x’, ‘y’, ‘x’]] … )   info  

Output:

num1 num2 num3  x           y             x  a one0 1 2   two3 4 5  b three 6 7 8   four 9 10 11

Output:

MultiIndex(levels=[['x', 'y'], ['four', 'one', 'three', 'two']], labels=[[0, 0, 1, 1], [1, 3, 2, 0]])

Output:

MultiIndex(levels=[['num1', 'num2', 'num3'], ['green', 'red']], labels=[[0, 1, 2], [1, 0, 1]])

Swap and sort level

We can easily swap the index level by using ‘swaplevel‘ command, which takes input as two level-numbers.

  import numpy as np   info = pd.DataFrame(np.arange(12).reshape(4, 3),  index = [[‘a’, ‘a’, ‘b’, ‘b’], [‘one’, ‘two’, ‘three’, ‘four’]],   columns = [[‘num1’, ‘num2’, ‘num3’], [‘x’, ‘y’, ‘x’]] … )   info.swaplevel(‘key1’, ‘key2’)   nnum1 num2 num3   p                             x                  y              x   key2 key1   onea 0 1 2   twoa 3 4 5   three b 6 7 8  four b 9 10 11  

We can sort the labels by using ‘sort_index‘ command. The data will be sorted by ‘key2’ names i.e. key2 that is arranged alphabetically.

  info.sort_index(level=’key2′)   nnum1 num2    num3   p           x             y             x  key1 key2   bfour 9 10 11   aone 0 1 2  bthree 6 7 8   atwo 3 4 5  

Next TopicDataFrame.corr()

Pandas Data processing

Data processing

Partial indexing

Unstack the data

Column indexing

Swap and sort level

Arithmetic in Prolog

Automate Instagram Messages using Python

You may also like