As data scientists or gig workers, when we market our services we often only think about uses of Python that involve traditional data. However, there is a lot of opportunity working with text also known as Natural Language Processing or NLP.
Also, when you inform your client that you will leverage Artificial Intelligence methodologies to work with their text or data — generally, this wow factor excites them. Then you have them hooked. It also helps to increase the amount that you may charge them for their particular job.
A while back I had the fortune of a medium-sized company contacting me. They had a list of about 100,000 questions that they wanted to classify into 25 different pre-defined categories. They had already classified about 50,000 previous questions. They expected this to grow even more in the future. The reason they had the backlog was due to competing priorities and it was a dull job making it difficult to get people to dedicate their time to complete the classification. Nonetheless, it was important to their business because they wanted to understand their customer questions to effectively market to them.
This request lit a light bulb in my mind. So, first I asked them how long it took them to classify these 50,000 questions already classified. They said that they could do about 100 to 200 per hour depending on the nature of the questions — I thought this might be an exaggeration due to the number of categories but I did not think that I should doubt their numbers. So, I decided to give them the benefit of the doubt and go with the high number of 200.
Given this estimate, the time required to process these 100,000 questions would be about 500 hours. Next, I asked them the average pay for the people classifying the questions, then they told me about $25 per hour since a classifier had to have a little knowledge about their line of work to classify the questions correctly. So, the total cost to manually classify these questions was $25 x 500 = $12,500.
So, I told them that using AI, I could build them a tool that would classify their questions and also provide them with a solution so that they would not have to pay the manual classifiers or me again. I gave them a fair offer of $10,000 which they accepted and surprisingly they did not make a counteroffer. I asked them for a 20% down payment to get started and the remainder on delivery.
Now, let’s talk about how I accomplished this task in a few days to quickly earn $10,000 from this particular customer.
Fortunately, they had stored all the questions in Excel files. I told them that I needed the 50,000 questions that they had already classified and the new set of 100,000. I then created 2 folders on my computer, one for the new questions and one for the old questions.
The main packages that I used to deliver the final product to this customer were as follows: pandas, NumPy, glob, NLTK, and scikit-learn.
I used the 50,000 question database to build a database. First, I loaded the 50,000 questions along with their categories into a pandas DataFrame. Next, I had to prepare the data to process. The cleaning process involved changing all questions to lower case, stripping punctuation and stripping extra spaces in the questions along with a few other nuances.
I also removed stopwords such as a, to, because, and, but, etc. Additionally, I used text lemmatization. Lemmatization is used to reduce text redundancy by converting words having the same meaning but different inflected forms to their base form.
Please see an example of lemmatization below:
After all this, I split the data into training and test dataset.
The accuracy for the test and training was 80.6% and 81% respectively, not bad. The recall was around 91%.
I then tested the performance of the model against 10% of the data or 500 questions which I held out to test the true performance of the model. I achieved 92.96% for this test — truly amazing results.
My final step was to tweak the Machine Learning model so that they could simply input a list of questions and then the model would output that list of questions with a classification for each question. This took about an hour to do as wanted it to be simple and user friendly.
Finally, I documented my work and shared the script with the customer along with instructions on how to use it.
They tested it without within the next week and then wired the $8000 balance into my account. I have a happy customer who I continue to work with now and then.
In conclusion, we need to think beyond numbers when we think about the power of Python!
Thank you for reading.