Pandas Library – The Powerful Data Custodian

Pandas Library – The Powerful Data Custodian

What is pandas?

The pandas library in Python is one of the foundational and most significant libraries when working with data. pandas handles multiple functions with ease and is designed to be used to import, clean, manipulate, prepare, and export data. It has a central role in the machine learning process and is almost always what data analysts and scientists use to prepare data for reporting or algorithm training / machine learning. Its ongoing development began in 2008 and continues to this day.

pandas library logo

One of the biggest innovations of the pandas library is the implementation of the DataFrame and the tools associated with it. DataFrames are a step above arrays and allow users to operate on data in ways that are much easier than the simpler structure of arrays or lists. DataFrames include indexes that can be defined using almost any data type and allow for columns to be referenced by column header names.

This structure is very similar in many ways to a spreadsheet in a workbook, like in Google Sheets, Excel, or similar software. pandas works to setup, manage, and ultimately utilize these data structures to achieve tasks that are sometimes fairly complicated. DataFrames may also be easy to understand for people who have experience with SQL, as many of the Python functions or operations can achieve similar outputs as one could achieve using SQL.

There is a lot that Python coders can do with pandas, so lets get into learning some of the most important features. Follow me along in this notebook: pandas exploration notebook in Google Collab

Things to know for this article

Variable(s) – If you’ve ever taken Algebra, this word should be familiar. A variable is simply a symbol, character, word, or phrase that represents a value that is likely to change based on context. For more info on variables in python, check out this video.

In Python, the standard convention is to create variables that are descriptive, or indicate what will be stored in the variable, with an underscore replacing any spaces. A variable that might hold baseball scores could simply be named baseball_scores, and any variety of data or values can be stored to it in many different ways. This allows anyone to better understand the role of the variable inside of the written code.

Function(s) In coding, this refers to a process or set of commands that operate based on any provided input.

Parameter(s) Python functions accept more than just variables. Parameters are essentially variables that often have specific set choices that when passed to a function give the function more info on how to operate. In the read_csv section below, an example of this is the sep=”\t” command we use on the tab-separated values import.

Now, lets get into pandas.

Importing Data and Setting up DataFrames with pandas

pandas.read_csv (documentation here) is a quick and easy way to import text from largely any structured text file. That is to say if you have a file that is in any text-based format that has consistent structure to indicate rows and columns, read_csv is likely able to parse it to be stored in a Python variable. People who work with data often need to be able to export data from a variety of systems and import it into databases together to update reports. File extension is irrelevant when using read_csv as long as you know the separator. Lets explore this in the code.

First, we need to import pandas. Best practice and what you will see most commonly is for coders to import a library into their Python instance and store it as a variable. Mostly, this just shortens the code needed to call any function of the library. Here\’s how we import the pandas library:

import pandas as pd

pd.read_csv

We can then call any of the functions in the pandas library using pd.[functionname]. The first function we will use is read_csv. Here is how we run read_csv. This example reads the iris dataset which has data on the various features of iris flowers. We import it from an online source in this example:

iris_csv = pd.read_csv("https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/iris.csv")

We have imported this dataset and can now print it on screen:

Input:
print("Iris Dataset from CSV format")
display(iris_csv.head)
print("---")

Output:
Iris Dataset from CSV format
<bound method NDFrame.head of      Unnamed: 0  Sepal.Length  ...  Petal.Width    Species
0             1           5.1  ...          0.2     setosa
1             2           4.9  ...          0.2     setosa
2             3           4.7  ...          0.2     setosa
3             4           4.6  ...          0.2     setosa
4             5           5.0  ...          0.2     setosa
..          ...           ...  ...          ...        ...
145         146           6.7  ...          2.3  virginica
146         147           6.3  ...          1.9  virginica
147         148           6.5  ...          2.0  virginica
148         149           6.2  ...          2.3  virginica
149         150           5.9  ...          1.8  virginica

We can also import a tab-separated file simply by indicating the separator used to indicate columns. In this case, I created a file from the above csv to show how this works. The code below loads it from a local source, in this case my Google Drive as I code python in Google Collaboratory. If using a desktop environment like Anaconda or some other coding software, you’d simply use the file location inside double-quotes, right after the opening parentheses after calling the read_csv function. Read the read_csv documentation to better understand the parameters it accepts. Here’s how we would import from a tab separated file

iris_tsv = pd.read_csv("/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Datasets/iris.tsv", sep=",
")

We can then print similar to the above:

Input:
print("Iris Dataset from TSV format")
display(iris_tsv.head)
print("---")

Output:
Iris Dataset from TSV format
<bound method NDFrame.head of      Unnamed: 0  Sepal.Length  ...  Petal.Width    Species
0             1           5.1  ...          0.2     setosa
1             2           4.9  ...          0.2     setosa
2             3           4.7  ...          0.2     setosa
3             4           4.6  ...          0.2     setosa
4             5           5.0  ...          0.2     setosa
..          ...           ...  ...          ...        ...
145         146           6.7  ...          2.3  virginica
146         147           6.3  ...          1.9  virginica
147         148           6.5  ...          2.0  virginica
148         149           6.2  ...          2.3  virginica
149         150           5.9  ...          1.8  virginica

[150 rows x 6 columns]>
---

It doesn’t matter if the file extension is strange or bizzare, either, as long as you know the internal structure of the file. Here’s an example of read_csv importing data from a .text file, a format that some data systems use for exports:

Input:
iris_text = pd.read_csv("/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Datasets/iris.text")

print("Iris Dataset from .txt file extension")
display(iris_text.head)
print("---")

Output:
Iris Dataset from .txt file extension
<bound method NDFrame.head of      Unnamed: 0  Sepal.Length  ...  Petal.Width    Species
0             1           5.1  ...          0.2     setosa
1             2           4.9  ...          0.2     setosa
2             3           4.7  ...          0.2     setosa
3             4           4.6  ...          0.2     setosa
4             5           5.0  ...          0.2     setosa
..          ...           ...  ...          ...        ...
145         146           6.7  ...          2.3  virginica
146         147           6.3  ...          1.9  virginica
147         148           6.5  ...          2.0  virginica
148         149           6.2  ...          2.3  virginica
149         150           5.9  ...          1.8  virginica

[150 rows x 6 columns]>
---

I could write an entire article on the various parameters of read_csv, but this is enough to get you started importing data. Read the documentation here for more info.

pd.DataFrame

Once we have the data imported, then we can create a DataFrame based on that data. Here’s how:

iris_final = pd.DataFrame(iris_csv)

We can even combine both functions to create the DataFrame from the import in one embedded command:

iris_final = pd.DataFrame(pd.read_csv("https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/iris.csv"))

Once the data is in a DataFrame, it is much simpler to clean, modify, edit, etc. See more about DataFrames in the pandas.DataFrame documentation here. There is a lot you can do at the creation step, including indicating if you want to use a specific column as index. Read the documentation for more details.

Cleaning and Manipulating Data

Lets take a look at what our dataset looks like once we’ve imported it using read_csv and created a DataFrame:

Input:

display(iris_final)

Output:
    Unnamed: 0  Sepal.Length    Sepal.Width Petal.Length    Petal.Width Species
0   1   5.1 3.5 1.4 0.2 setosa
1   2   4.9 3.0 1.4 0.2 setosa
2   3   4.7 3.2 1.3 0.2 setosa
3   4   4.6 3.1 1.5 0.2 setosa
4   5   5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3 virginica
146 147 6.3 2.5 5.0 1.9 virginica
147 148 6.5 3.0 5.2 2.0 virginica
148 149 6.2 3.4 5.4 2.3 virginica
149 150 5.9 3.0 5.1 1.8 virginica
150 rows × 6 columns

The first unlabeled column in the dataset is the index. We can also see that the origin data had a index column as well, which we will need to clean up. Should we run into this, there’s actually two ways we can correct it, the best of which is to alter the code we used to import the file. pandas.read_csv accepts a parameter index_col which accepts column numbers or names. read_csv then sets the indicated column as the index, instead of creating a new index column. Lets fix and re-run this code and see what it looks like:

Input:
iris_final = pd.DataFrame(pd.read_csv("https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/iris.csv", index_col=0))

display(iris_final)

Output:
    Sepal.Length    Sepal.Width Petal.Length    Petal.Width Species
1   5.1 3.5 1.4 0.2 setosa
2   4.9 3.0 1.4 0.2 setosa
3   4.7 3.2 1.3 0.2 setosa
4   4.6 3.1 1.5 0.2 setosa
5   5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
150 rows × 5 columns

This looks much better. Keep in mind, when finding bugs in your code, it is a best practice to go back to the origin of the problem and solve it there. We could have simply deleted a column in the DataFrame above and set the index to the origingal index column, however that requires far more code than simply adding in the index-col parameter to read_csv. Next, lets explore some of the features of the dataset.

DataFrame.info() in pandas

When evaluating a dataset, there are often a few key functions to run to get a better idea of the data you are working with. Here is a list of the ones I like to run when first analyzing a dataset:

pandas.DataFrame.info() – Documentation – this function gives you a decent snapshop look of the DataFrame structure, including the number of rows, the column names and numbers, the number of non-null or NaN values in each column, and the data type.

I also like to print pandas.DataFrame.columns to get the column names in list form. This is helpful if / when one might want to rename whatever names come with the source data column headers.

There are numerous other functions you could run, but these give you the most info in the shortest amount of time and in most cases enough info to determine where to go next. The Iris dataset we’ve been using is a dataset that is already cleaned, so in order to look at how to clean data, we’ll need to grab a dataset that contains errors. For this, I’ve decided to use the steam spy database which can be accessed from this page on kaggle – Steam app and steamspy data from Kaggle

These datasets are raw data scraped from Valve’s Steam platform and contain information on computer games, both data and descriptions about each game and data from reviews. Lets import them using what we’ve done above with the iris dataset:

steam_spy = pd.DataFrame(pd.read_csv("/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Datasets/steamspy_data.csv"&#039;"))

display(steam_spy.head())

steam_app = pd.DataFrame(pd.read_csv("/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Datasets/steam_app_data.csv"))
display(steam_app.head())

# see notebook for outputs

There are numerous columns between both of these datasets that have null values. There are many reasons this can occur. With Steam data, it could be that data features (columns) were added to the table after some of the older games. It could be human error. It could simply be a small bug with the program that pulled the data, too. There are many reasons why errors occur and the most important thing is to assess what to do with them, unless you are the engineer in charge of buildign the systems that capture this data. Lets take a look at the .info() for both of these DataFrames.

Input:
print(steam_app.info())
print()
print(steam_spy.info())

Output:
<class &#039;pandas.core.frame.DataFrame&#039;>
RangeIndex: 66803 entries, 0 to 66802
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   type                     66655 non-null  object 
 1   name                     66803 non-null  object 
 2   steam_appid              66803 non-null  int64  
 3   required_age             66655 non-null  object 
 4   is_free                  66655 non-null  object 
 5   controller_support       14422 non-null  object 
 6   dlc                      9342 non-null   object 
 7   detailed_description     66587 non-null  object 
 8   about_the_game           66585 non-null  object 
 9   short_description        66593 non-null  object 
 10  fullgame                 0 non-null      float64
 11  supported_languages      66599 non-null  object 
 12  header_image             66655 non-null  object 
 13  website                  36119 non-null  object 
 14  pc_requirements          66655 non-null  object 
 15  mac_requirements         66655 non-null  object 
 16  linux_requirements       66655 non-null  object 
 17  legal_notice             19502 non-null  object 
 18  drm_notice               278 non-null    object 
 19  ext_user_account_notice  1124 non-null   object 
 20  developers               66495 non-null  object 
 21  publishers               66655 non-null  object 
 22  demos                    6009 non-null   object 
 23  price_overview           47350 non-null  object 
 24  packages                 47963 non-null  object 
 25  package_groups           66655 non-null  object 
 26  platforms                66655 non-null  object 
 27  metacritic               3726 non-null   object 
 28  reviews                  8872 non-null   object 
 29  categories               65699 non-null  object 
 30  genres                   66526 non-null  object 
 31  screenshots              66533 non-null  object 
 32  movies                   61673 non-null  object 
 33  recommendations          11674 non-null  object 
 34  achievements             30294 non-null  object 
 35  release_date             66655 non-null  object 
 36  support_info             66655 non-null  object 
 37  background               66547 non-null  object 
 38  content_descriptors      66655 non-null  object 
dtypes: float64(1), int64(1), object(37)
memory usage: 19.9+ MB
None

<class &#039;pandas.core.frame.DataFrame&#039;>
RangeIndex: 63968 entries, 0 to 63967
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   appid            63968 non-null  int64  
 1   name             63865 non-null  object 
 2   developer        56705 non-null  object 
 3   publisher        56740 non-null  object 
 4   score_rank       48 non-null     float64
 5   positive         63968 non-null  int64  
 6   negative         63968 non-null  int64  
 7   userscore        63968 non-null  int64  
 8   owners           63968 non-null  object 
 9   average_forever  63968 non-null  int64  
 10  average_2weeks   63968 non-null  int64  
 11  median_forever   63968 non-null  int64  
 12  median_2weeks    63968 non-null  int64  
 13  price            56854 non-null  float64
 14  initialprice     56856 non-null  float64
 15  discount         56856 non-null  float64
 16  languages        56814 non-null  object 
 17  genre            56653 non-null  object 
 18  ccu              63968 non-null  int64  
 19  tags             63968 non-null  object 
dtypes: float64(4), int64(9), object(7)
memory usage: 9.8+ MB
None

For the steam_app DataFrame, we can see that it has 66803 rows, but numerous columns have varying amounts of non-null objects in them. In the steam_spy DataFrame, we are working with 63968 rows, with many of the columns having an equal number of non-null fields, or at the very least, much higher counts of non-null fields. For the purpose of this article, lets pick a few fields from either table to work with and correct. Different datatypes can be corrected in different ways, too, so we will select features (columns) that have different data types and different solutions for correction. Here are the fields we will be working with:

developers from the steam_app DataFrame (datatype: object)

steam_appid from the steam_app DataFrame (datatype: int64)

controller_support from steam_app DataFrame (datatype: object)

All of these are missing some values / have null (NaN) values. Some also are not the best type of data and will need to be converted. Lets take a look at each one separately.

Data cleaning and preparation using pandas

.isna() & DataFrame.loc() in pandas

The developers field has 66495 non-null objects, so it has exactly 308 records that need correction. Lets take a look closer at these rows. We can do this by grabbing the slice of data where this column is null or NaN. df.loc is a function that allows us to easily grab slices of data, and actually is built to work without explicitly calling the function. See the documentation on df.loc here. We will also be using .isna() which will easily return a TRUE value to df.loc on all null rows. .isna() documentation can be found here.

Here’s how we use df.loc to generate a data slice for all rows where developers is null or NaN:

Input:

null_developers = steam_app[steam_app['developers'].isna()]
display(null_developers.head)

# Click the link below to view the output.

output

At a quick glance, some of these rows appear to have null or NaN values in other columns as well. This should usually lead us to a conclusion that there is a problem with source information for those data rows or the simply don’t have data for these fields. In our case, this could be due to games being relatively obscure, dying in development, or awaiting future updates from a developer. The datatype of object is fine to remain as is. Objects can be many things, but in short are often text or string values that describe some aspect of the specific data row. In this case, developers should tell us the name of the group developing the game.

DataFrame.fillna() in pandas

We may want to consider more detailed approaches to cleaning an entire dataset in a case like this where there are rows with a significant number of null values as these rows can throw off or skew data for the entire dataset, especialy when we start to calculate any descriptive statistics, but I’ll leave that for a future article. We know that NaN or null values in developers field likely means that the developer is either unknown, so it is better to simply update all null / NaN values to ‘unknown’. Here is how we can easily do that in python, using .fillna():

Input:
steam_app['developers'] = steam_app['developers'].fillna('Unknown')
print(steam_app.info())

Output:

RangeIndex: 66803 entries, 0 to 66802
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   type                     66655 non-null  object 
 1   name                     66803 non-null  object 
 2   steam_appid              66803 non-null  int64  
 3   required_age             66655 non-null  object 
 4   is_free                  66655 non-null  object 
 5   controller_support       14422 non-null  object 
 6   dlc                      9342 non-null   object 
 7   detailed_description     66587 non-null  object 
 8   about_the_game           66585 non-null  object 
 9   short_description        66593 non-null  object 
 10  fullgame                 0 non-null      float64
 11  supported_languages      66599 non-null  object 
 12  header_image             66655 non-null  object 
 13  website                  36119 non-null  object 
 14  pc_requirements          66655 non-null  object 
 15  mac_requirements         66655 non-null  object 
 16  linux_requirements       66655 non-null  object 
 17  legal_notice             19502 non-null  object 
 18  drm_notice               278 non-null    object 
 19  ext_user_account_notice  1124 non-null   object 
 20  developers               66803 non-null  object 
 21  publishers               66655 non-null  object 
 22  demos                    6009 non-null   object 
 23  price_overview           47350 non-null  object 
 24  packages                 47963 non-null  object 
 25  package_groups           66655 non-null  object 
 26  platforms                66655 non-null  object 
 27  metacritic               3726 non-null   object 
 28  reviews                  8872 non-null   object 
 29  categories               65699 non-null  object 
 30  genres                   66526 non-null  object 
 31  screenshots              66533 non-null  object 
 32  movies                   61673 non-null  object 
 33  recommendations          11674 non-null  object 
 34  achievements             30294 non-null  object 
 35  release_date             66655 non-null  object 
 36  support_info             66655 non-null  object 
 37  background               66547 non-null  object 
 38  content_descriptors      66655 non-null  object 
dtypes: float64(1), int64(1), object(37)
memory usage: 19.9+ MB
None

We can now see that developers has 66803 non-null objects, so we have succesfully completed this task.

DataFrame.astype() in pandas

Our next field to look at is steam_appid. Steam uses a unique numerical code for each game on its platform. We also know the this field has 66803 non-null values. The problem with this field is not that it has null values, but rather that those values are stored in this DataFrame as a int64 data type. 

This is a common error when read_csv() scans a data import as it interprets all columns that only have numerical data into a numerical data type. These codes serve as a unique identifiers for Steam games, we don’t need to do any type of calculations with them, and even if we did, it wouldn’t provide us with any helpful information. 

We should convert them to an object type in order to ensure it is not mistakenly used for anything other than identifying specific games and tying data from other tables (like the steam_spy DataFrame) to the game. Here is how you can change a column datatype in pandas using .astype():

Input:

steam_app['steam_appid'] = steam_app['steam_appid'].astype(str)
print(steam_app.info())

Output:

RangeIndex: 66803 entries, 0 to 66802
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   type                     66655 non-null  object 
 1   name                     66803 non-null  object 
 2   steam_appid              66803 non-null  object 
 3   required_age             66655 non-null  object 
 4   is_free                  66655 non-null  object 
 5   controller_support       14422 non-null  object 
 6   dlc                      9342 non-null   object 
 7   detailed_description     66587 non-null  object 
 8   about_the_game           66585 non-null  object 
 9   short_description        66593 non-null  object 
 10  fullgame                 0 non-null      float64
 11  supported_languages      66599 non-null  object 
 12  header_image             66655 non-null  object 
 13  website                  36119 non-null  object 
 14  pc_requirements          66655 non-null  object 
 15  mac_requirements         66655 non-null  object 
 16  linux_requirements       66655 non-null  object 
 17  legal_notice             19502 non-null  object 
 18  drm_notice               278 non-null    object 
 19  ext_user_account_notice  1124 non-null   object 
 20  developers               66803 non-null  object 
 21  publishers               66655 non-null  object 
 22  demos                    6009 non-null   object 
 23  price_overview           47350 non-null  object 
 24  packages                 47963 non-null  object 
 25  package_groups           66655 non-null  object 
 26  platforms                66655 non-null  object 
 27  metacritic               3726 non-null   object 
 28  reviews                  8872 non-null   object 
 29  categories               65699 non-null  object 
 30  genres                   66526 non-null  object 
 31  screenshots              66533 non-null  object 
 32  movies                   61673 non-null  object 
 33  recommendations          11674 non-null  object 
 34  achievements             30294 non-null  object 
 35  release_date             66655 non-null  object 
 36  support_info             66655 non-null  object 
 37  background               66547 non-null  object 
 38  content_descriptors      66655 non-null  object 
dtypes: float64(1), object(38)
memory usage: 19.9+ MB
None

.astype(str) converts the value of the proceeding data (in our case steam_app[‘steam_appid’]) to a string or text type, called ‘object’ in pandas. steam_app.info() shows us that we have successfully converted the datatype. On to controller_support.

Lets actually take a closer look at the slice of data where controller_support is not null. We can grab this lice of data by entering this code:

display(steam_app[~steam_app['controller_support'].isnull()])

see output here

In cases where we just want to look at a data slice, we can simplfy the process we did in the developers section above by simply printing the data, instead of storing it then printing it. We also use the tilde or ‘`’ symbol in this line of code to indicate “not” as in return to us the rows from steam_app where steam_app[‘controller_support’].isnull() is NOT true. Take a look at this Stack Overflow post for a bit more information.

We can see from this data slice that rows without null / NaN values appear have have ‘full’ in their record. We can reasonably assume one of two things – either NaN values mean there is no controller support OR that whether or not a game supports the controller is unknown. Cases like this can be complicated to resolve. If we store ‘none’ in the field for all null values, we might be incorrect on some records. Keep in mind that NaN values can arise from simply not having data – but not having data does not mean that the correct data doesn’t exist somewhere else.

This is a situation where it is best to consult with other people to better understand the source of the error, the impacts of the various repalcement choices, and any other factors before proceeding. It is quite possible, for instance, that controllers weren’t support on some early Steam games or that there are other reasons / conditions that would allow us to be more specific about our corrections. 

One of the most important steps in data cleaning is making sure you have enough information to make the decision on what to do – and who to ask if you need more information. How would you correct the NaN values in controller_support? Would you fill it with ‘none’ or ‘unknown’? What best reflects what we know? What questions could we ask that might help us get more clarity?

Data Exports using pandas

pandas can be used to export data in various ways. I use Google Collaboratory, so I can actually push exports directly into a drive folder, but if you are running locally via Anaconda or some other platform, all you need to do is enter the filepath you want to export the file to. I’ll cover more integrations with Google in the future as I have a number of scripts that integrate with and utilize libraries that connect with the Google API.

DataFrame.to_csv() in pandas

Exporting data using pandas is relatively simple using the pandas.DataFrame.to_csv. This function will export a DataFrame as a Comma Separated Values (CSV file), which is a fairly standard format that most data systems, tools, and applications can easily import.It also has numerous parameters to customize your export. You can define the separator, change the column names, and even define what it should store in fields that have missing data. Here’s an example from our companion Collab notebook:

iris_final.to_csv('/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Exports/iris_final.csv')

This creates a file of the same name as the DataFrame (because I set it in the filepath) inside of my google drive. This is what the file looks like when opened:

CSV of data describing iris flowers

We can also do something more complicated, too. Here’s what it looks like when we rename all the columns in a table, replace N/A with blanks, and decide we want to store the output as a .zip file to compress the file size for sending.

 
compression_options = dict(method='zip', archive_name='steam_app.csv') # defines some custom parameters to pass to the to_csv compression parameter


steam_app.to_csv('/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Exports/steam_app.zip', index=False, compression=compression_options)

This process may take a few momoents to run depending on the size of your DataFrame. I ran this on the larger steam_app dataset that we created earlier. Lets say we didn’t want to leave all the NaN values as blanks. By default, to_csv converts all NaN values in a DataFrame to “” or just blank. In some cases, we might want to populate this with something more useful. For instance, if we wanted all NaN values to instead show “No Data”, we can set this value on the na_rep parameter of .to_csv():

steam_app.to_csv('/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Exports/steam_app.zip', index=False, na_rep="No Data", compression=compression_options)

We can also select the specific columns we want to export using the columns parameter of the to_csv() function. This is useful when we are only looking for data in specific columns and need to cut down the file size of the dataset.

steam_app.to_csv('/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Exports/steam_app_columns.zip',columns=['type', 'name', 'steam_appid', 'required_age', 'is_free','controller_support', 'dlc'], index=False, na_rep="No Data", compression=compression_options)

It is important when handling data to understand the implications of the size of your dataset. Having a larger filesize or dataset makes it harder to share and can lead to longer run times when executing commands. That being said, sometimes we need large sets. The trick is to be mindful about what is in the dataset and when it has data that we don’t need, we should simply not include it. If it doesn’t have value to our audience, it just muddies up whatever we are trying to communicate and/or slows down our process of reaching conclusions. Here’s the file size difference comparison of all the steam_app exports stored as .zip files respectively for the three code snippets above:

file export comparisons with the columns .zip file being significantly smaller

Use pandas to clean and control your data

pandas is an incredibly powerful library for importing, manipulating, and exporting data. DataFrames are a super powerful tool, and later on I will show you how we can integrate DataFrames into numerous other libraries and even machine learning algorithims. pandas is a core library for all modern data work. If you want to learn more, check out the pandas documentation. You can learn a lot more by signing up for DataCamp or follow in my footsteps to take a course with Practicum by Yandex or Pathstream. Yandex offers more robust courses, but both are great ways to learn more about Python. Check out the links below:

Sign up for DataCamp today – https://www.datacamp.com/promo/build-the-future-2022 – Get 53% off for your first year

Practicum by Yandex – https://practicum.yandex.com/

Pathstream – https://www.pathstream.com

Sign up for our Mailing List for updates and new posts sent to your inbox

Check out other posts on Python Unbound

Click HERE to learn more about Python Unbound and my story returning to code.

Connect with Python Unbound on the following Social Media Platforms:

Connect with me personally on Linkden, Upwork, and Twitter

Similar Posts


Last Updated On: