Link to Colab Notebook : https://colab.research.google.com/drive/1OZawdTfZDJeWsRS5RhoXFG4T95MEd7zt?usp=sharing
The Data Set
In this tutorial, we will use a Excel
file we created from Kaggle data. You can download the file by using above Link .
!pip install xlrd==1.2.0
We will use this
data set to find the number of calories for the Food Display Table visualize
movies with highest Fats and Calories and calculate statistical information
about the Food Table. We will be analyzing and exploring this data using Python
and pandas, thus demonstrating pandas capabilities for working with Excel data
in Python.
Read
data from the Excel file
We need to
first import the data from the Excel file into pandas. To do that, we start by
importing the pandas module.
import pandas as pd
We then use the
pandas’ read_excel method to read in data from the Excel file. The easiest way
to call this method is to pass the file name. If no sheet name is specified
then it will read the first sheet in the index (as shown below).
excel_file = 'Food_Display_Table.xlsx'
Food_Display_Table = pd.read_excel(excel_file)
Here,
the read_excel
method read the data from the Excel file into a
pandas Data Frame object. Pandas defaults to storing data in DataFrames. We then
stored this Data Frame into a variable called Food_Display_Table.
Pandas has a built-in DataFrame.head()
method that we can use to easily display the
first few rows of our Data Frame. If no argument is passed, it will display
first five rows. If a number is passed, it will display the equal number of
rows from the top.
Food_Display_Table. Head()
Food_Display_Table_subset_columns = pd.read_excel(excel_file, usecols=[1,2,3,4,5,6])
Food_Display_Table_subset_columns.head()
Food_Display_Data_sheet1 = pd.read_excel(excel_file, sheet_name=1, index_col=0)
Food_Display_Data_sheet1.head()
We can check if
this concatenation by checking the number of rows in the combined DataFrame by
calling the method shape
on it that will give us the number of rows
and columns.
Food_Display_Table.shape
(2014, 26)
We can also use
the Excel File class to work with multiple sheets from the same Excel file. We
first wrap the Excel file using ExcelFile
and then pass it to read_excel
method.
If you have more
than one Sheet in your excel file then you can concat the sheet by using following
codes.
xlsx = pd.ExcelFile(excel_file)
Food_Display_Table_sheets = []
for sheet in xlsx.sheet_names:
Food_Display_Table_sheets.append(xlsx.parse(sheet))
Food_Display_Table = pd.concat(Food_Display_Table_sheets
sorted_by_gross = Food_Display_Table.sort_values(['Calories'], ascending=False)
Since we have the
data sorted by values in a column, we can do a few interesting things with it.
For example, we can display the Food Table by Calories
In Excel, you’re
able to sort a sheet based on the values in one or more columns. In pandas, you
can do the same thing with the sort_values
method. For example, let’s sort our Food
Display Table DataFrame based on the Gross
sorted_by_gross = Food_Display_Table.sort_values(['Calories'], ascending=False)
sorted_by_gross["Calories"].head(10)
723 1667.40 1283 1283.52 60 1069.20 1501 1024.92 1458 976.10 1103 818.37 59 801.90 1062 798.64 1285 792.54 1051 785.40 Name = Calories dtype: float 64
We can also create a plot for the top 10 Food Items by calories. Pandas makes it easy to visualize your data with plots and charts through matplotlib, a popular data visualization library.
And the Result will look like this.
import matplotlib.pyplot as plt
We will draw a bar
plot where each bar will represent one of the top 10 Food. We can do this by
calling the plot method and setting the argument kind
to barh
. This tells matplotlib
to draw a horizontal bar plot.
sorted_by_gross['Calories'].head(10).plot(kind="barh")
plt.show()
Reading a subset of columns
Although read_excel defaults to reading
and importing all columns, you can choose to import only certain columns. By
passing parse_cols=[1,2,3,4,5,6] we are telling the read_excel
method to read only the first columns till
index six or first seven columns (the first column being indexed zero).
Food_Display_Table_subset_columns = pd.read_excel(excel_file, usecols=[1,2,3,4,5,6])
Food_Display_Table_subset_columns.head()
We need to first
identify the column or columns that will serve as the index, and the column(s)
on which the summarizing formula will be applied. Let’s start small, by
choosing Food_Code as the index column and Calories as the summarization column
and creating a separate DataFrame from this data.
Food_Display_Table["Solid_Fats"] = Food_Display_Table["Calories"] - Food_Display_Table["Saturated_Fats"]
This excel
sheet consist a total of 5 Rows and 26 Columns.
Food_Display_Table.head()
Display_Name |
Portion_Default |
Portion_Amount |
Portion_Display_Name |
Factor |
Increment |
|
0 |
Sour cream dip |
1 |
0.25 |
cup |
0.25 |
0.25 |
1 |
Ice cream, regular |
1 |
1.00 |
cup |
1.00 |
0.25 |
2 |
Ice cream, rich |
1 |
1.00 |
cup |
1.00 |
0.25 |
3 |
Soft serve ice cream |
1 |
1.00 |
cup |
1.00 |
0.25 |
4 |
Ice cream bar, chocolate covered |
1 |
1.00 |
bar (bar = 3 fl oz) |
1.00 |
0.50 |
5 rows × 26 columns
Let’s use the sort_values
method to sort the data by the new column we
created and visualize the top 10 Food by Added_Sugars.
sorted_Food_Display_Table = Food_Display_Table[['Added_Sugars']].sort_values(['Added_Sugars'], ascending=[False]) sorted_Food_Display_Table. Head(10)['Added_Sugars'].plot.barh() plt.show()
We need to first
identify the column or columns that will serve as the index, and the column(s)
on which the summarizing formula will be applied. Let’s start small, by choosing
Food_Column as the index column and Calories as the summarization column and
creating a separate DataFrame from this data.
Food_Display_Table_subset=Food_Display_Table[['Food_Code', 'Calories']] Food_Display_Table_subset.head()
calories_by_Food_Code = Food_Display_Table_subset.pivot_table(index=['Food_Code']) calories_by_Food_Code.head()
Calories |
|
Food_Code |
|
7258 |
28.560 |
7259 |
23.520 |
16126 |
50.400 |
16129 |
101.625 |
23559 |
109.060 |
We can use this
pivot table to create some data visualizations. We can call the plot
method on the DataFrame to create a line plot
and call the show
method to display the plot in the notebook.
calories_by_Food_Code.plot() plt.show()
Food_Display_Table_subset = Food_Display_Table[['Food_Code','Calories']] Food_Display_Table_subset.head()
Food_Code |
Calories |
|
0 |
12350000 |
133.65 |
1 |
13110100 |
267.33 |
2 |
13110120 |
368.52 |
3 |
13110200 |
347.73 |
4 |
13120100 |
169.12 |
We saw how to
pivot with a single column as the index. Things will get more interesting if we
can use multiple columns. Let’s create another DataFrame subset but this time
we will choose the columns, Grains, Vegetables and Calories.
Food_Display_Table_subset=Food_Display_Table[['Grains', 'Vegetables', 'Calories']] Food_Display_Table_subset.head()
Grains |
Vegetables |
Calories |
|
|
|
|
|
0 |
0.04799 |
0.0407 |
133.65 |
1 |
0.00000 |
0.0000 |
267.33 |
2 |
0.00000 |
0.0000 |
368.52 |
3 |
0.00000 |
0.0000 |
347.73 |
4 |
0.00000 |
0.0000 |
169.12 |
Exporting the results to Excel
If you’re going to be working with
colleagues who use Excel, saving Excel files out of pandas is important. You
can export or write a pandas DataFrame to an Excel file using pandas to excel
method. Pandas uses the xlwt
Python
module internally for writing to Excel files. The to excel
method is called on the DataFrame we want to
export. We also need to pass a filename to which this DataFrame will be
written.
Food_Display_Table.to_excel('output.xlsx')
!pip install xlsxwriter
We
need to be able to make our output files look nice before we can send it out to
our co-workers. We can use pandas ExcelWriter
class
along with the XlsxWriter
Python module to apply the formatting.
We can do use these advanced output
options by creating a ExcelWriter
object
and use this object to write to the Excel file.
writer = pd.ExcelWriter('output2.xlsx', engine='xlsxwriter')Food_Display_Table.to_excel(writer, index=False, sheet_name='report') workbook = writer.bookworksheet = writer. Sheets['report']
save
on the writer object.
No comments:
Post a Comment