Menanti UU Perlindungan Data Pribadi

Hampir setiap hari pengguna telepon genggam di Indonesia menerima pesan pendek dari nomor tak dikenal. Isinya soal pinjaman dana instan, pengumuman pemenang undian dari brand tertentu, pesan…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




DataFrames 101

(aka why do we care?)

The daily workflow of a data scientist involves data cleaning, data analysis, data modeling, and organization of the results in a presentable manner. Many languages did not have everything that data scientists were looking for in these areas — until pandas came on the scene. Pandas is one of the most common and well-known Python libraries for data analysis around the world. Pandas allows you to carry out the entire data science workflow without having to switch to a more domain specific language (like R).

One of the most powerful tools offered in pandas is the DataFrame. A DataFrame is a tabular data structure with rows and columns. Labelled rows and columns improve the clarity and intuition of a data scientist’s daily tasks. If you understand the structure of the data you are working with, it makes it easier to clean it, slice it, dice it, and analyze it.

The columns of a DataFrame are a data structure called a Series. If you extract one column from a data frame, then you will receive a Series. You can access many of the same methods and attributes of a Series that you can of a DataFrame.

DataFrames can be populated in many different ways, depending on the task at hand. We can create DataFrames from a dictionary of lists with keys as column labels.

We can create DataFrames from a list of lists that contain our data, and a list that contains the column labels. The .zip() method is used to map the similar index of multiple containers so they can be used as a single entity. We wrap this in a list() call to convert the output of .zip() to a list.

Once we have a zipped list, we can use dict() to convert this to a dictionary type. Finally, we can convert the dictionary to a DataFrame. We get the same DataFrame as the one we created above.

It is extremely common as a data scientist to receive data sources in the form of a variety of file types: comma seperated files, tab delimited files, Excel files, etc. The pandas library gives you the ability to read that data into a DataFrame seamlessly.

Once the data that needs analyzing is loaded into a DataFrame, it does not mean that it is time for modelling or analysis. In fact, unless you know exactly what the data is like when you receive it, it probably isn’t even time to begin the data cleaning process yet. The best next course of action is to inspect the DataFrame.

Pandas provides numerous attributes and methods that can help you learn more about the structure and contents of a given DataFrame. Going through this inspection process will help you to have a more organized plan when approaching the later steps of the analysis or modelling process.

The .head() and .tail() methods will give you a snapshot of the beginning or the end of a DataFrame, respectively. There is an option to pass in a parameter for the number of rows to return.

The .info() method and .shape attribute will help to understand the structure and dimensions of the DataFrame you are working with. Note how .info() displays the type associated with each column. Also, it shows the counts of non-null values in each column. If this count is less than the number of entries, then we know that column contains missing values.

If you want to know the summary statistics for each column in the DataFrame, you can take advantage of the .describe() method. This method will compile some handy information (mean, median, max, etc.) about the DataFrame into a nice, organized output for you.

Note that by default, .describe() will include all numeric columns only. If you would also like to see statistics for other types of columns, you may specify so with the include parameter.

Often, a data scientist will want to work with subsets of a DataFrame object. There are numerous ways to accomplish this. If you want to select a certain column from a DataFrame, you can use square brackets [ ] with the name of the column, or a period . with the name of the column. If you want to select multiple columns at once, you may put a list of column names into the square brackets.

When you want to select a subset of one or more rows or columns, you can take advantage of the .iloc property. .iloc is used for index-based selection, so you can pass it integers, lists of integers, slice objects, or even boolean arrays for both the rows and columns.

Note that if you select one row, .iloc returns a series instead of a DataFrame. If you need the row to be selected as a DataFrame, pass the integer as a single-valued list ( like this: .iloc[[1]] ).

If you have found that your data has missing values, and records with missing data are of no use to you, you may drop any records with missing values from the DataFrame with .dropna() .

If you have found that your data has missing values, but you can’t drop missing data, you can take advantage of the .fillna() method which will accept a value as a parameter and assign that value to any missing ones.

Sometimes, your data in your DataFrame may be structured differently than you need it to be. Perhaps it is delivered to you in a more reporting friendly version like below — and you want to transform it to be more analysis friendly like so.

Notice that our data for month, products collected, dollars collected, and visitors is organized much like the reporting friendly data in the diagram above. Thus, in this case, you can use .melt() to turn the columns into rows.

It is also possible that the opposite will be the case. When your data is delivered to you in a more analysis friendly version, and you need it to be more report friendly, you can use .pivot() to turn unique values into separate columns. To illustrate this, let’s pivot the DataFrame that we just melted.

Remember, your data will not always come in one huge file with everything you need. Sometimes, you will need to combine data from different sources to develop the complete data source for analysis. Luckily, DataFrames make it easy to combine data in these cases.

Given that you have data that is of the same structure but segmented into different subsets (for example, five years of sales data split into multiple DataFrames), you can use .concat() to combine the five DataFrames into one complete DataFrame for analysis. Think of .concat() as a way to stitch together pieces of data that were once a single data set.

Finally, there may be times where you have two DataFrames that are connected in some way with a common identifier, but have different attributes. An example of this could be a DataFrame schools with information about each school, and a DataFrame teachers with information about each teacher (including which school they work at). You may need to combine each teacher’s information and the information about their school into one DataFrame for analysis purposes.

This can be accomplished by calling the .merge() method. Think of this method as the pandas equivalent of joining two tables in SQL.

Time series data can be easily plotted with the pandas method .plot(). Given that your DataFrame has the date time set as the index, you can access the values in a series using the [] method. Once you have isolated the series of values, you can take advantage of .plot() and visualize the time series.

Calling plot.bar() produces a multiple bar plot; if you set the stacked parameter to True, you will receive a stacked bar plot. Boxplots are produced by calling the .plot.box() method of a DataFrame. This is useful to visualize the distribution of values within each column.

That’s it! That’s DataFrames 101. There is plenty more to know about DataFrames, but this is intended to give you a good jumping off point to go and learn more and perfect your skills.

Add a comment

Related posts:

Product Management Aspirations

I am a Software Development Engineer in Test (SDET) and have been involved in Software Quality and Testing for 14 years. In that time, I have been part of companies both large and small in a variety…

Sweep That Clutter Under the Rug

This article shows you how to hide the icons on your desktop and describes an alternative organizational workflow that I prefer to use while working through a data science project. As a kid who spent…

5 Easy Ways to Improve SEL for Students

Everyone would like to bring more social-emotional learning into their classroom. But it’s always challenging to find time in the school day…or figuring out where to start! So here’s 5 easy ways you…