Pandas Library – The Powerful Data Custodian
What is pandas?
The pandas library in Python is one of the foundational and most significant libraries when working with data. pandas handles multiple functions with ease and is designed to be used to import, clean, manipulate, prepare, and export data. It has a central role in the machine learning process and is almost always what data analysts and scientists use to prepare data for reporting or algorithm training / machine learning. Its ongoing development began in 2008 and continues to this day.
One of the biggest innovations of the pandas library is the implementation of the DataFrame and the tools associated with it. DataFrames are a step above arrays and allow users to operate on data in ways that are much easier than the simpler structure of arrays or lists. DataFrames include indexes that can be defined using almost any data type and allow for columns to be referenced by column header names.
This structure is very similar in many ways to a spreadsheet in a workbook, like in Google Sheets, Excel, or similar software. pandas works to setup, manage, and ultimately utilize these data structures to achieve tasks that are sometimes fairly complicated. DataFrames may also be easy to understand for people who have experience with SQL, as many of the Python functions or operations can achieve similar outputs as one could achieve using SQL.
There is a lot that Python coders can do with pandas, so lets get into learning some of the most important features. Follow me along in this notebook: pandas exploration notebook in Google Collab
Things to know for this article
Variable(s) – If you’ve ever taken Algebra, this word should be familiar. A variable is simply a symbol, character, word, or phrase that represents a value that is likely to change based on context. For more info on variables in python, check out this video.
In Python, the standard convention is to create variables that are descriptive, or indicate what will be stored in the variable, with an underscore replacing any spaces. A variable that might hold baseball scores could simply be named baseball_scores, and any variety of data or values can be stored to it in many different ways. This allows anyone to better understand the role of the variable inside of the written code.
Function(s) In coding, this refers to a process or set of commands that operate based on any provided input.
Parameter(s) Python functions accept more than just variables. Parameters are essentially variables that often have specific set choices that when passed to a function give the function more info on how to operate. In the read_csv section below, an example of this is the sep=”\t” command we use on the tab-separated values import.
Now, lets get into pandas.
Importing Data and Setting up DataFrames with pandas
pandas.read_csv (documentation here) is a quick and easy way to import text from largely any structured text file. That is to say if you have a file that is in any text-based format that has consistent structure to indicate rows and columns, read_csv is likely able to parse it to be stored in a Python variable. People who work with data often need to be able to export data from a variety of systems and import it into databases together to update reports. File extension is irrelevant when using read_csv as long as you know the separator. Lets explore this in the code.
First, we need to import pandas. Best practice and what you will see most commonly is for coders to import a library into their Python instance and store it as a variable. Mostly, this just shortens the code needed to call any function of the library. Here\’s how we import the pandas library:
import pandas as pd
pd.read_csv
We can then call any of the functions in the pandas library using pd.[functionname]. The first function we will use is read_csv. Here is how we run read_csv. This example reads the iris dataset which has data on the various features of iris flowers. We import it from an online source in this example:
iris_csv = pd.read_csv("https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/iris.csv")
We have imported this dataset and can now print it on screen:
Input:
print("Iris Dataset from CSV format")
display(iris_csv.head)
print("---")
Output:
Iris Dataset from CSV format
<bound method NDFrame.head of Unnamed: 0 Sepal.Length ... Petal.Width Species
0 1 5.1 ... 0.2 setosa
1 2 4.9 ... 0.2 setosa
2 3 4.7 ... 0.2 setosa
3 4 4.6 ... 0.2 setosa
4 5 5.0 ... 0.2 setosa
.. ... ... ... ... ...
145 146 6.7 ... 2.3 virginica
146 147 6.3 ... 1.9 virginica
147 148 6.5 ... 2.0 virginica
148 149 6.2 ... 2.3 virginica
149 150 5.9 ... 1.8 virginica
We can also import a tab-separated file simply by indicating the separator used to indicate columns. In this case, I created a file from the above csv to show how this works. The code below loads it from a local source, in this case my Google Drive as I code python in Google Collaboratory. If using a desktop environment like Anaconda or some other coding software, you’d simply use the file location inside double-quotes, right after the opening parentheses after calling the read_csv function. Read the read_csv documentation to better understand the parameters it accepts. Here’s how we would import from a tab separated file
iris_tsv = pd.read_csv("/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Datasets/iris.tsv", sep=",
")
We can then print similar to the above:
Input:
print("Iris Dataset from TSV format")
display(iris_tsv.head)
print("---")
Output:
Iris Dataset from TSV format
<bound method NDFrame.head of Unnamed: 0 Sepal.Length ... Petal.Width Species
0 1 5.1 ... 0.2 setosa
1 2 4.9 ... 0.2 setosa
2 3 4.7 ... 0.2 setosa
3 4 4.6 ... 0.2 setosa
4 5 5.0 ... 0.2 setosa
.. ... ... ... ... ...
145 146 6.7 ... 2.3 virginica
146 147 6.3 ... 1.9 virginica
147 148 6.5 ... 2.0 virginica
148 149 6.2 ... 2.3 virginica
149 150 5.9 ... 1.8 virginica
[150 rows x 6 columns]>
---
It doesn’t matter if the file extension is strange or bizzare, either, as long as you know the internal structure of the file. Here’s an example of read_csv importing data from a .text file, a format that some data systems use for exports:
Input:
iris_text = pd.read_csv("/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Datasets/iris.text")
print("Iris Dataset from .txt file extension")
display(iris_text.head)
print("---")
Output:
Iris Dataset from .txt file extension
<bound method NDFrame.head of Unnamed: 0 Sepal.Length ... Petal.Width Species
0 1 5.1 ... 0.2 setosa
1 2 4.9 ... 0.2 setosa
2 3 4.7 ... 0.2 setosa
3 4 4.6 ... 0.2 setosa
4 5 5.0 ... 0.2 setosa
.. ... ... ... ... ...
145 146 6.7 ... 2.3 virginica
146 147 6.3 ... 1.9 virginica
147 148 6.5 ... 2.0 virginica
148 149 6.2 ... 2.3 virginica
149 150 5.9 ... 1.8 virginica
[150 rows x 6 columns]>
---
I could write an entire article on the various parameters of read_csv, but this is enough to get you started importing data. Read the documentation here for more info.
pd.DataFrame
Once we have the data imported, then we can create a DataFrame based on that data. Here’s how:
iris_final = pd.DataFrame(iris_csv)
We can even combine both functions to create the DataFrame from the import in one embedded command:
iris_final = pd.DataFrame(pd.read_csv("https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/iris.csv"))
Once the data is in a DataFrame, it is much simpler to clean, modify, edit, etc. See more about DataFrames in the pandas.DataFrame documentation here. There is a lot you can do at the creation step, including indicating if you want to use a specific column as index. Read the documentation for more details.
Cleaning and Manipulating Data
Lets take a look at what our dataset looks like once we’ve imported it using read_csv and created a DataFrame:
Input:
display(iris_final)
Output:
Unnamed: 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 1 5.1 3.5 1.4 0.2 setosa
1 2 4.9 3.0 1.4 0.2 setosa
2 3 4.7 3.2 1.3 0.2 setosa
3 4 4.6 3.1 1.5 0.2 setosa
4 5 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3 virginica
146 147 6.3 2.5 5.0 1.9 virginica
147 148 6.5 3.0 5.2 2.0 virginica
148 149 6.2 3.4 5.4 2.3 virginica
149 150 5.9 3.0 5.1 1.8 virginica
150 rows × 6 columns
The first unlabeled column in the dataset is the index. We can also see that the origin data had a index column as well, which we will need to clean up. Should we run into this, there’s actually two ways we can correct it, the best of which is to alter the code we used to import the file. pandas.read_csv accepts a parameter index_col which accepts column numbers or names. read_csv then sets the indicated column as the index, instead of creating a new index column. Lets fix and re-run this code and see what it looks like:
Input:
iris_final = pd.DataFrame(pd.read_csv("https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/iris.csv", index_col=0))
display(iris_final)
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
150 rows × 5 columns
This looks much better. Keep in mind, when finding bugs in your code, it is a best practice to go back to the origin of the problem and solve it there. We could have simply deleted a column in the DataFrame above and set the index to the origingal index column, however that requires far more code than simply adding in the index-col parameter to read_csv. Next, lets explore some of the features of the dataset.
DataFrame.info() in pandas
When evaluating a dataset, there are often a few key functions to run to get a better idea of the data you are working with. Here is a list of the ones I like to run when first analyzing a dataset:
pandas.DataFrame.info() – Documentation – this function gives you a decent snapshop look of the DataFrame structure, including the number of rows, the column names and numbers, the number of non-null or NaN values in each column, and the data type.
I also like to print pandas.DataFrame.columns to get the column names in list form. This is helpful if / when one might want to rename whatever names come with the source data column headers.
There are numerous other functions you could run, but these give you the most info in the shortest amount of time and in most cases enough info to determine where to go next. The Iris dataset we’ve been using is a dataset that is already cleaned, so in order to look at how to clean data, we’ll need to grab a dataset that contains errors. For this, I’ve decided to use the steam spy database which can be accessed from this page on kaggle – Steam app and steamspy data from Kaggle
These datasets are raw data scraped from Valve’s Steam platform and contain information on computer games, both data and descriptions about each game and data from reviews. Lets import them using what we’ve done above with the iris dataset:
steam_spy = pd.DataFrame(pd.read_csv("/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Datasets/steamspy_data.csv"'"))
display(steam_spy.head())
steam_app = pd.DataFrame(pd.read_csv("/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Datasets/steam_app_data.csv"))
display(steam_app.head())
# see notebook for outputs
There are numerous columns between both of these datasets that have null values. There are many reasons this can occur. With Steam data, it could be that data features (columns) were added to the table after some of the older games. It could be human error. It could simply be a small bug with the program that pulled the data, too. There are many reasons why errors occur and the most important thing is to assess what to do with them, unless you are the engineer in charge of buildign the systems that capture this data. Lets take a look at the .info() for both of these DataFrames.
Input:
print(steam_app.info())
print()
print(steam_spy.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66803 entries, 0 to 66802
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 66655 non-null object
1 name 66803 non-null object
2 steam_appid 66803 non-null int64
3 required_age 66655 non-null object
4 is_free 66655 non-null object
5 controller_support 14422 non-null object
6 dlc 9342 non-null object
7 detailed_description 66587 non-null object
8 about_the_game 66585 non-null object
9 short_description 66593 non-null object
10 fullgame 0 non-null float64
11 supported_languages 66599 non-null object
12 header_image 66655 non-null object
13 website 36119 non-null object
14 pc_requirements 66655 non-null object
15 mac_requirements 66655 non-null object
16 linux_requirements 66655 non-null object
17 legal_notice 19502 non-null object
18 drm_notice 278 non-null object
19 ext_user_account_notice 1124 non-null object
20 developers 66495 non-null object
21 publishers 66655 non-null object
22 demos 6009 non-null object
23 price_overview 47350 non-null object
24 packages 47963 non-null object
25 package_groups 66655 non-null object
26 platforms 66655 non-null object
27 metacritic 3726 non-null object
28 reviews 8872 non-null object
29 categories 65699 non-null object
30 genres 66526 non-null object
31 screenshots 66533 non-null object
32 movies 61673 non-null object
33 recommendations 11674 non-null object
34 achievements 30294 non-null object
35 release_date 66655 non-null object
36 support_info 66655 non-null object
37 background 66547 non-null object
38 content_descriptors 66655 non-null object
dtypes: float64(1), int64(1), object(37)
memory usage: 19.9+ MB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63968 entries, 0 to 63967
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 appid 63968 non-null int64
1 name 63865 non-null object
2 developer 56705 non-null object
3 publisher 56740 non-null object
4 score_rank 48 non-null float64
5 positive 63968 non-null int64
6 negative 63968 non-null int64
7 userscore 63968 non-null int64
8 owners 63968 non-null object
9 average_forever 63968 non-null int64
10 average_2weeks 63968 non-null int64
11 median_forever 63968 non-null int64
12 median_2weeks 63968 non-null int64
13 price 56854 non-null float64
14 initialprice 56856 non-null float64
15 discount 56856 non-null float64
16 languages 56814 non-null object
17 genre 56653 non-null object
18 ccu 63968 non-null int64
19 tags 63968 non-null object
dtypes: float64(4), int64(9), object(7)
memory usage: 9.8+ MB
None
For the steam_app DataFrame, we can see that it has 66803 rows, but numerous columns have varying amounts of non-null objects in them. In the steam_spy DataFrame, we are working with 63968 rows, with many of the columns having an equal number of non-null fields, or at the very least, much higher counts of non-null fields. For the purpose of this article, lets pick a few fields from either table to work with and correct. Different datatypes can be corrected in different ways, too, so we will select features (columns) that have different data types and different solutions for correction. Here are the fields we will be working with:
developers from the steam_app DataFrame (datatype: object)
steam_appid from the steam_app DataFrame (datatype: int64)
controller_support from steam_app DataFrame (datatype: object)
All of these are missing some values / have null (NaN) values. Some also are not the best type of data and will need to be converted. Lets take a look at each one separately.
Data cleaning and preparation using pandas
.isna() & DataFrame.loc() in pandas
The developers field has 66495 non-null objects, so it has exactly 308 records that need correction. Lets take a look closer at these rows. We can do this by grabbing the slice of data where this column is null or NaN. df.loc is a function that allows us to easily grab slices of data, and actually is built to work without explicitly calling the function. See the documentation on df.loc here. We will also be using .isna() which will easily return a TRUE value to df.loc on all null rows. .isna() documentation can be found here.
Here’s how we use df.loc to generate a data slice for all rows where developers is null or NaN:
Input:
null_developers = steam_app[steam_app['developers'].isna()]
display(null_developers.head)
# Click the link below to view the output.
At a quick glance, some of these rows appear to have null or NaN values in other columns as well. This should usually lead us to a conclusion that there is a problem with source information for those data rows or the simply don’t have data for these fields. In our case, this could be due to games being relatively obscure, dying in development, or awaiting future updates from a developer. The datatype of object is fine to remain as is. Objects can be many things, but in short are often text or string values that describe some aspect of the specific data row. In this case, developers should tell us the name of the group developing the game.
DataFrame.fillna() in pandas
We may want to consider more detailed approaches to cleaning an entire dataset in a case like this where there are rows with a significant number of null values as these rows can throw off or skew data for the entire dataset, especialy when we start to calculate any descriptive statistics, but I’ll leave that for a future article. We know that NaN or null values in developers field likely means that the developer is either unknown, so it is better to simply update all null / NaN values to ‘unknown’. Here is how we can easily do that in python, using .fillna():
Input:
steam_app['developers'] = steam_app['developers'].fillna('Unknown')
print(steam_app.info())
Output:
RangeIndex: 66803 entries, 0 to 66802
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 66655 non-null object
1 name 66803 non-null object
2 steam_appid 66803 non-null int64
3 required_age 66655 non-null object
4 is_free 66655 non-null object
5 controller_support 14422 non-null object
6 dlc 9342 non-null object
7 detailed_description 66587 non-null object
8 about_the_game 66585 non-null object
9 short_description 66593 non-null object
10 fullgame 0 non-null float64
11 supported_languages 66599 non-null object
12 header_image 66655 non-null object
13 website 36119 non-null object
14 pc_requirements 66655 non-null object
15 mac_requirements 66655 non-null object
16 linux_requirements 66655 non-null object
17 legal_notice 19502 non-null object
18 drm_notice 278 non-null object
19 ext_user_account_notice 1124 non-null object
20 developers 66803 non-null object
21 publishers 66655 non-null object
22 demos 6009 non-null object
23 price_overview 47350 non-null object
24 packages 47963 non-null object
25 package_groups 66655 non-null object
26 platforms 66655 non-null object
27 metacritic 3726 non-null object
28 reviews 8872 non-null object
29 categories 65699 non-null object
30 genres 66526 non-null object
31 screenshots 66533 non-null object
32 movies 61673 non-null object
33 recommendations 11674 non-null object
34 achievements 30294 non-null object
35 release_date 66655 non-null object
36 support_info 66655 non-null object
37 background 66547 non-null object
38 content_descriptors 66655 non-null object
dtypes: float64(1), int64(1), object(37)
memory usage: 19.9+ MB
None
We can now see that developers has 66803 non-null objects, so we have succesfully completed this task.
DataFrame.astype() in pandas
Our next field to look at is steam_appid. Steam uses a unique numerical code for each game on its platform. We also know the this field has 66803 non-null values. The problem with this field is not that it has null values, but rather that those values are stored in this DataFrame as a int64 data type.
This is a common error when read_csv() scans a data import as it interprets all columns that only have numerical data into a numerical data type. These codes serve as a unique identifiers for Steam games, we don’t need to do any type of calculations with them, and even if we did, it wouldn’t provide us with any helpful information.
We should convert them to an object type in order to ensure it is not mistakenly used for anything other than identifying specific games and tying data from other tables (like the steam_spy DataFrame) to the game. Here is how you can change a column datatype in pandas using .astype():
Input:
steam_app['steam_appid'] = steam_app['steam_appid'].astype(str)
print(steam_app.info())
Output:
RangeIndex: 66803 entries, 0 to 66802
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 66655 non-null object
1 name 66803 non-null object
2 steam_appid 66803 non-null object
3 required_age 66655 non-null object
4 is_free 66655 non-null object
5 controller_support 14422 non-null object
6 dlc 9342 non-null object
7 detailed_description 66587 non-null object
8 about_the_game 66585 non-null object
9 short_description 66593 non-null object
10 fullgame 0 non-null float64
11 supported_languages 66599 non-null object
12 header_image 66655 non-null object
13 website 36119 non-null object
14 pc_requirements 66655 non-null object
15 mac_requirements 66655 non-null object
16 linux_requirements 66655 non-null object
17 legal_notice 19502 non-null object
18 drm_notice 278 non-null object
19 ext_user_account_notice 1124 non-null object
20 developers 66803 non-null object
21 publishers 66655 non-null object
22 demos 6009 non-null object
23 price_overview 47350 non-null object
24 packages 47963 non-null object
25 package_groups 66655 non-null object
26 platforms 66655 non-null object
27 metacritic 3726 non-null object
28 reviews 8872 non-null object
29 categories 65699 non-null object
30 genres 66526 non-null object
31 screenshots 66533 non-null object
32 movies 61673 non-null object
33 recommendations 11674 non-null object
34 achievements 30294 non-null object
35 release_date 66655 non-null object
36 support_info 66655 non-null object
37 background 66547 non-null object
38 content_descriptors 66655 non-null object
dtypes: float64(1), object(38)
memory usage: 19.9+ MB
None
.astype(str) converts the value of the proceeding data (in our case steam_app[‘steam_appid’]) to a string or text type, called ‘object’ in pandas. steam_app.info() shows us that we have successfully converted the datatype. On to controller_support.
Lets actually take a closer look at the slice of data where controller_support is not null. We can grab this lice of data by entering this code:
display(steam_app[~steam_app['controller_support'].isnull()])
see output here
In cases where we just want to look at a data slice, we can simplfy the process we did in the developers section above by simply printing the data, instead of storing it then printing it. We also use the tilde or ‘`’ symbol in this line of code to indicate “not” as in return to us the rows from steam_app where steam_app[‘controller_support’].isnull() is NOT true. Take a look at this Stack Overflow post for a bit more information.
We can see from this data slice that rows without null / NaN values appear have have ‘full’ in their record. We can reasonably assume one of two things – either NaN values mean there is no controller support OR that whether or not a game supports the controller is unknown. Cases like this can be complicated to resolve. If we store ‘none’ in the field for all null values, we might be incorrect on some records. Keep in mind that NaN values can arise from simply not having data – but not having data does not mean that the correct data doesn’t exist somewhere else.
This is a situation where it is best to consult with other people to better understand the source of the error, the impacts of the various repalcement choices, and any other factors before proceeding. It is quite possible, for instance, that controllers weren’t support on some early Steam games or that there are other reasons / conditions that would allow us to be more specific about our corrections.
One of the most important steps in data cleaning is making sure you have enough information to make the decision on what to do – and who to ask if you need more information. How would you correct the NaN values in controller_support? Would you fill it with ‘none’ or ‘unknown’? What best reflects what we know? What questions could we ask that might help us get more clarity?
Data Exports using pandas
pandas can be used to export data in various ways. I use Google Collaboratory, so I can actually push exports directly into a drive folder, but if you are running locally via Anaconda or some other platform, all you need to do is enter the filepath you want to export the file to. I’ll cover more integrations with Google in the future as I have a number of scripts that integrate with and utilize libraries that connect with the Google API.
DataFrame.to_csv() in pandas
Exporting data using pandas is relatively simple using the pandas.DataFrame.to_csv. This function will export a DataFrame as a Comma Separated Values (CSV file), which is a fairly standard format that most data systems, tools, and applications can easily import.It also has numerous parameters to customize your export. You can define the separator, change the column names, and even define what it should store in fields that have missing data. Here’s an example from our companion Collab notebook:
iris_final.to_csv('/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Exports/iris_final.csv')
This creates a file of the same name as the DataFrame (because I set it in the filepath) inside of my google drive. This is what the file looks like when opened:
We can also do something more complicated, too. Here’s what it looks like when we rename all the columns in a table, replace N/A with blanks, and decide we want to store the output as a .zip file to compress the file size for sending.
compression_options = dict(method='zip', archive_name='steam_app.csv') # defines some custom parameters to pass to the to_csv compression parameter
steam_app.to_csv('/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Exports/steam_app.zip', index=False, compression=compression_options)
This process may take a few momoents to run depending on the size of your DataFrame. I ran this on the larger steam_app dataset that we created earlier. Lets say we didn’t want to leave all the NaN values as blanks. By default, to_csv converts all NaN values in a DataFrame to “” or just blank. In some cases, we might want to populate this with something more useful. For instance, if we wanted all NaN values to instead show “No Data”, we can set this value on the na_rep parameter of .to_csv():
steam_app.to_csv('/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Exports/steam_app.zip', index=False, na_rep="No Data", compression=compression_options)
We can also select the specific columns we want to export using the columns parameter of the to_csv() function. This is useful when we are only looking for data in specific columns and need to cut down the file size of the dataset.
steam_app.to_csv('/content/drive/MyDrive/Data Business/Blog/Notebooks for Articles/Exports/steam_app_columns.zip',columns=['type', 'name', 'steam_appid', 'required_age', 'is_free','controller_support', 'dlc'], index=False, na_rep="No Data", compression=compression_options)
It is important when handling data to understand the implications of the size of your dataset. Having a larger filesize or dataset makes it harder to share and can lead to longer run times when executing commands. That being said, sometimes we need large sets. The trick is to be mindful about what is in the dataset and when it has data that we don’t need, we should simply not include it. If it doesn’t have value to our audience, it just muddies up whatever we are trying to communicate and/or slows down our process of reaching conclusions. Here’s the file size difference comparison of all the steam_app exports stored as .zip files respectively for the three code snippets above:
Use pandas to clean and control your data
pandas is an incredibly powerful library for importing, manipulating, and exporting data. DataFrames are a super powerful tool, and later on I will show you how we can integrate DataFrames into numerous other libraries and even machine learning algorithims. pandas is a core library for all modern data work. If you want to learn more, check out the pandas documentation. You can learn a lot more by signing up for DataCamp or follow in my footsteps to take a course with Practicum by Yandex or Pathstream. Yandex offers more robust courses, but both are great ways to learn more about Python. Check out the links below:
Sign up for DataCamp today – https://www.datacamp.com/promo/build-the-future-2022 – Get 53% off for your first year
Practicum by Yandex – https://practicum.yandex.com/
Pathstream – https://www.pathstream.com
Sign up for our Mailing List for updates and new posts sent to your inbox
Check out other posts on Python Unbound
Connect with Python Unbound on the following Social Media Platforms: