Become a Powerful and Promising Coder with Python

Origins

Python was released in 1991 after being developed by Guido van Rossum since the 1980s. It has gone through multiple iterations since that time, but has always been based around the core design principles of accessibility and readability. Python focuses on minimizing the need for extra characters in code and maximizing the readability of the code itself by using more common words to execute commands. The high level of accessibility of the language makes it one of the most powerful languages today.

Python is an incredibly user friendly language. Commands in Python are written in relatively easy to understand language, with minimal punctuation, and can be run without any need to be compiled into a program. You can write Python code on almost any computer, from your phone to the most powerful cloud servers. It also has built in systems to help coders identify and correct errors, making it all around one of the easier to learn languages. Python is second only to Java in terms of community size, with over 9 million active developers. [https://content.techgig.com/10-programming-languages-with-the-biggest-community-support/articleshow/78908491.cms]

Python is expansive, accessible, and becoming more powerful every day. It provides the backbone to many modern systens and applications in the fields of Big Data, machine learning, as well as application for Google, NASA, and many other high-tech firms. But what can we really do with Python? Why is it one of the fastest growing coding languages over the past decade? And how could you get started learning python? Lets take a look at answers to some of these questions.

What Python Can do?

The better question is what can\’t Python do, but lets look at three specific use cases for Python that make it a really powerful platform for modern applications and needs – Big Data, Computer Vision, and Machine Learning.

Big Data

Data is a ubiquitous part of our daily lives from what sites we visit, to our behaviors on social media, even what things we buy at the store. All of these are being leveraged to tailor advertisements and enticements to get us to buy more products. There are ways to maintain your anonimity online and when you make purchases, but the more powerful computers get the more companies are looking to create massive data systems (called data warehouses or lakes) to be able to further predict and target certain user or customer behavior. There are numerous reasons for this – whether it is to develop or improve an advertising campaign, evaluate the effectiveness of projects, or even identify where there might be problems in a system.

Data collected in large amounts tends to tell or suggest a story or pattern that outlines aspects of human behavior in relation to whatever the data is collected on. Facebook feeds you ads based on things you click on or react to. Developers of various apps can perform sentiment analysis on comments and reviews, giving them an idea of how their users feel about the application and where they may need to make improvements. Even governments are using the systems of ‘Big Data’ to introduce efficiencies and improve methods of data collection, moving further and further away from paper based systems.

Python is an incredibly powerful coding language for data collection and analysis. One of the things that makes Python so powerful is that it is relatively easy to bring in new libraries depending upon your task. The pandas library in Python is built specifically for handling data. pandas brings in the concept of DataFrames, which make it easy to arrange large sets of data in order to clean, analyze, and report out data. The structure of DataFrames is very similar to a spreadsheet, but Python can handle thousands upon thousands of rows while Excel and Google Sheets much smaller limits. Here’s a screenshot of a dataframe from one of my recent Python Project:

There are numerous other Python libraries like numpy, seaborn, matplotlib and others that can manipulate data in a python dataframe, correct errors, create graphs, charts, etc. If you have a medium to large set of data to work with, Python is what you should be working in to get the job done.

Computer Vision – Identifying objects in an image

One of the most interesting and innovative aspects of Python is its ability to translate images into manipulatable data. This data can be used to train machine learning models to find and identify objects in an image. There are specific tools built for Python that help coders build these models, and some that are pretrained to identify specific common objects.

I recently worked on a project to help improve a machine learning model that was designed to identify book spines from images of book shelves. The goal the team had in mind was to be able to pull book names and authors – as well as any other information – from these book spines combined with geolocation data to database collections of books in public places. This process identifies segments of an image that may have useful information based on what the coder gives as inputs.

This is called Computer Vision – quite literally a computer analyzing layers of an image to “see” what is in it. This doesn’t work any way like the human eye as the computer converts the whole image into different layers by color masks (Red, Green, Blue etc) and shadows / highlights, converts those into numbers, and analyzes the resulting numbers. This analysis produces predictions from a model which are interpreted back into data that can overlay the image as shown below to identify areas of interest and objects in an image. Computer Vision is what sits behind applications that utilize facial recognition or other similar applications. In the example project I mentioned above, we were using it to identify book spines in an image, from which we could do optical character recognition to extract boock titles from the image, but there are many applications it can be used for.

There are Python libraries specifically for this work, including sci-kit images, pytorch, opencv, and more. This image is a great example of what these models can do, being able to identify multiple different objects in the same image, return bounding boxes for those objects, and a color mask for the area that object most likely takes up in the image.

Another python package / library that I’ve used and liked called detectron is demonstrated below with these images identifying balloons in images:

Here are a couple of images from my book spine project:

For more information and examples on Computer Vision, check out these articles from Towards Data Science and Kaggle.

https://towardsdatascience.com/image-segmentation-with-six-lines-0f-code-acb870a462e8

https://www.kaggle.com/vbookshelf/basics-of-detectron2-balloon-detection

Machine Learning

Now I have mentioned machine learning and models before, but what is it really? Basically, with larger and larger sets of data, it becomes easier and easier to create and train algorithims that become increasingly more accurate. Not only is that possible, but these models can also be tweaked with various parameters to tune the model however we’d like to produce better results based on whatever statistical metric we want. In many situations, we are given features or characterisitics of a record and the specific result we want to develop a model for. Let me explain this in slightly less technical terms.

Lets say you have a massive database that gives you summary data for all cell phone plan users with a specific provider for each month going back a year. Some users may use more minutes on a voice call than others, some may primarily browse the web and access social media on their phones, and some users might be texting fiends. You also likely would have data on users that cancel their service or transfer to another provider. Now, it is near impossible to look through an entire database of months of data for thousands of users to decipher patterns manually. What we can do, however, is take something like the length of contract for a user, ie how long they have been with and or were with the company, and use numerous tools in Python to identify usage patterns that correlate with canceling service using machine learning models. We can tune these models to find the settings that work best for our data set. Given all of this information, we can then provide the model with a current month’s data – and it can be a fairly strong predictor of which clients might be considering moving service! This would allow a company to extend deals to these customers or offer alterations to their packages that might keep them with the company instead of seeing them leave – an important task in keeping a business running.

Here’s some examples from a project I recently run on data very similar to what I described above. Here’s some of the sample data:

In this project, we measured success by looking at how accurate the predictions were and the area under the probability curve. ROC stands for Receiver Operating Characteristic and is a plot of the true positive (accurate prediction) rate agains the false positive (incorrect positive) rate. AUC stands for Area Under Curve and simply quantifies the area that exists under this probability curve.

In this case, the CatBoost model not only had the highest AUC-ROC score, but also the highest pure accuracy rate. You can also see that better quality model also took slightly longer to run, which is often a factor in an organization deciding which model to use – as running a model to predict on new data can use up significant computational resources.

Want to learn more about machine learning? Check out the various posts on this blog for a whole variety of resources and info: https://www.datascienceblog.net/categories/machine-learning/

Why is Python growing so fast?

This is possibly the simplest question to answer. Python is growing for one simple core reason – it is easy to read, understand, and learn. Von Rossum’s focus on building a language in this way has made it maccessible to more people, resulting in a larger community and more time that people can spend building complex and complicated librarys and algorithims, rather than getting caught up in more complicated code – or syntax. Here’s a quick sample of some python code:

img_cat_to_remove = img_cat_to_remove.set_index('image_id')
img_cat_to_remove.index.names = ['id']
display(img_cat_to_remove)

print(images.shape)
images = images.join(img_cat_to_remove, how='left')
display(images)
# print(images.shape)

In this snippet, i took a dataframe I created that contained a set of image records that need to be removed from the main dataframe, set the index (think row identifiers) to match the image_id for each specific image, then joined that to the main database, allowing me to easily select and delete or exclude those records from future use of the data from the main table. We can name variables pretty much anything we want and simple codes like ‘print’ ‘display’ or ‘shape’ allow us to see the data and or info about it. Here’s what that snippent create as an output when it ran in that notebook:

If you want to learn more about Python’s growth, check out this post over on plainenglish.io outlining so much more that I’ll lay out here: https://python.plainenglish.io/how-python-is-proving-to-be-a-turning-point-language-in-2021-463c9a8fa26e

How can I start learning Python today?

The wonderful thing aobut Python is that there are plenty of useful free tutorials and guides to get started. If you are just looking to dip your toes in, check out numerous videos over on Youtube or try out this resource I found: https://www.freecodecamp.org/learn

If you are looking to go further, I’d definitely suggest creating an account on GitHub – https://github.com/ and Stack OverFlow – https://stackoverflow.com/ and see if DataCamp has a sale. In fact, if you head through this link, you’ll get $20 off your first month, so you can try out their platform for only $9!: https://www.datacamp.com/join-me/ODExNDU5Nw== Datacamp has solid video tutorials and even offers some certificates that can be added to your Linkedin profile.

Lastly, there are countless programs out there that you can enroll in that are more akin to traditional classes and instruction. Personally, I participated in the Data Science by Practicum beta program that started in the spring of 2020 – and I would encourage everyone to check this program out. The community was solid, the tutors were very helpful, and the project based orientation of the course allowed the right amount of flexibility. It also reflected what work in the field would be like out in the real world. I’ve taken other courses run by programs in the US and simply didn’t get the same level of support and challenge. You don’t have to go through Yandex to get this, but there are many programs out there that simply aren’t as good. Here’s the link for Yandex: https://practicum.yandex.com/

And make sure to head over to this page to sign up for our mailing list. You’ll get these articles sent ot your email 🙂
https://pythonunbound.com/newsletter