Build your own AI pipeline
Work through these easy-to-do no-coding-experience-required articles and I’ll have you creating your first data science notebook within 15 minutes. Change AI from an intangible thing you’re kinda comfortable talking about to something tangible that you actually understand.
Everyone’s talking about AI, it is going to happen and if you work for a SaaS, you will be implementing it in your application soon if you haven’t already! So I’ve put together a concise exercise that will take you from zero to hero in a few 30-minute sessions. No prior data science knowledge or coding experience required.
Get Your Hands Dirty
You learn by doing. There’s just no alternative to this – you could go to hundreds of meetings about AI and rattle off all the right catch phrases, but still not really know much about it. So if you want to understand it – and what VP of Product, Product Designer, Product Manager can do their jobs properly these days without understanding it? – well, you need to actually do some of it and then things will click into place.
You’re Going to Build A Simple End-To-End AI Solution
We’re going to build our own mini AI pipeline, which covers all the value propositions Junction AI delivers with our JAI-a-a-S product (only ours is a lot more sophisticated):
In this article, we’ll cover INGEST. In follow-up articles, we’ll address TRANSFORM, TRAIN, PREDICT & LEARN.
What Problem Are We Solving?
We have a problem-solving approach to AI and that’s why we always start with the problem. We are going to solve a pretty simple problem here so you can get a taste of every step in the process from sourcing the data to designing the model, training the model, deploying the model, and surfacing the insights to users in a web application and then collecting feedback so the model can learn.
For this article, I’ve put together a proof of concept that uses a simple AI implementation as a good teaching tool. In our process at Junction, we start every project with a little proof like this as it helps our customers – who are generally pretty new to AI – quickly get their heads around all the new concepts.
The problem we are solving is “How can we use data to help artists on Instagram choose photos that will appeal to their audience?”. The end result of the pipeline is something like this:
This is how the user (me) would see the results and feedback so the model can learn.
To keep things simple for this article, we’re going to look at a pretty small scope so this doesn’t take you too long to work through it. We’ll focus on color and its impact on post success (measured by likes). Our data science solution will help an Instagram user discover which colors in their photos appeal the most to their followers. Knowing this, they can then use those colors more in images and get more likes.
We do lots of this type of modelling for the insights on JAI-a-a-S – color is just one of the factors you can analyse with data science models – we also analyse concepts, sentiment, contrast, texture, and thousands of other factors which open-source models are available for. I’ll talk about this more in the modelling steps. What saves time for teams who work with us is that our solution includes the use case implementation – so you can get those results to the user in context.
Let’s get started with Step 1 INGEST
The first thing we need to do is get some data and save it in our data lake so we can access it. You can’t do anything without data and you can source it from lots of different places, like open-source data sources, scraped data sources, pull it from APIs or you may have data from your internal systems. For this exercise, we’re going to scrape data from an Instagram handle because it’s pretty easy to do that with a free Scrapehero account, and the data is something most people can relate to.
A) Get Data
Follow these steps to create the first building block in your AI Pipeline: a data source that will periodically scrape data.
1. Create a free account on scrapehero and verify your email address.
2. Login and go to the Marketplace and search for the Instagram Scraper
3. Click on ADD THIS CRAWLER TO MY ACCOUNT
4. Paste in the URL for the Instagram handle you want to scrape and increase the posts to 25 (the limit for a free account) and click GATHER DATA. I am scraping my own account but you can scrape any public account.
(If you are a data science beginner, I recommend you use my handle because the example in this exercise is especially relevant for my account. I’m an artist and my oil paintings contain lots of negative space in a single color – usually pink but often orange. A lot of people tell me they love my pink paintings and others insist the orange ones are their favorite – so this is how I came up with the concept for this example – is pink better or not?)
5. When the scraper finishes, you can download the data in CSV format by clicking DOWNLOAD DATA and choosing CSV:
Just be aware that sometimes Scrapehero shows that it is still collecting data when you have scraped 24 records… as soon as the Download Data button appears, you are ready to go.
So now you have scraped some data!
B) Move it into your “Data Warehouse”
Data warehouse is just a fancy word for cheap file storage. We are going to use Dropbox as a data lake.
With a paid Scrapehero account, you can schedule your crawler so that it runs periodically and integrate it with Dropbox so that the CSV file will automatically be saved to a folder on your “data warehouse” – ie Dropbox.
You can’t do this with a free account, but if you do want to set something like this up for a proof of concept, you need to use the Advanced Mode on the crawler UI to add a schedule and integrate with Dropbox.
If/when you want to do this, log into your Dropbox (or create a free account) and create a folder called RAW-DATA. Then in Scrapehero go back to your crawler and click on ADVANCED MODE (top right of screen) and use the Schedules and Integrations tabs to set these up:
For our pipeline at JAI-a-a-S where the customer uses scraped data, we scrape it via an API and move the files to our s3 instance, which then triggers the next step in the pipeline. This is basically what you have now – files are being scraped and dropped onto your pseudo s3.
Now that we have some data we need to read it in our data science environment so we can do something with it. Let’s use a Jupyter Notebook as our data science environment. Jupyter is a UI that allows you to run a whole lot of different data science packages which can do stuff with data – like to get an image and convert it to a number that represents how much of a particular color it has in it (what we need to do for your use case here).
You can install Jupyter with Anaconda and do this locally but to make this super easy, I’m using the online version of Jupyter notebooks provided by Colaboratory (awesome!) which you can access from here:
Signup or sign in so you can save your notebook, then create a new notebook by clicking File > New Notebook.
You’ll then see a new blank notebook like this:
Your blank notebook contains a code CELL which allows you to do stuff with data – it’s a bit like Excel on steroids. Click the + Code button in the menu to add a new cell so you have 2 code cells available:
Before you do anything else, we need to get our CSV file from Scrapehero saved somewhere that Colab can get access to it. The easiest way to do this is to give Colab permission to access your Google Drive account by selecting the Folder icon (in the left-hand nav) and then the Google Drive icon:
In Colab select the folder icon and then, when your options load, choose the Google Drive folder icon to allow Colab to access your files on Google Drive.
Now you need to save the posts.csv file in Google Drive so we can access it from inside Colab. Upload the file to the Colab Notebooks folder on your Google Drive (Google Drive created this when you gave it permission to access your drive).
Once you have the file saved to Google Drive, you’re ready to start coding!!! As you can see already, a lot of “data science” time is consumed with getting data to where you need it. If you have to move data for many models, it quickly becomes a time dump – another way we can save teams a lot of time.
Now let’s start coding by using the first CELL in our notebook to read in our data! Select the first cell and paste in this code which installs the Pandas package (which is a cool Python data analysis library) and uses the read_csv method from that package to get your data into the notebook:
# Get CSV from Scrapehero (instagram feed for handle @meli_axford import pandas #Python Data Analysis Library df = pandas.read_csv('/content/drive/MyDrive/Colab Notebooks/Posts.csv') # change Colab Notebooks/Posts.csv above to your location on Drive df.head(25))
When you press run (play button on the cell) you will see the data from your file, is in your notebook:
When your code cell runs without error, you’ll see the file contents printed for you in the workspace.
I’ve tested this article with everyone on our team – both developers and our marketing team – and these are some of the most common mistakes that got people stuck at this step:
1. Not connected to Google Drive account
If you haven’t mounted your Google Drive account, the pandas.read.csv command won’t be able to find the file as it’s looking in the Colab Notebooks folder on the mounted google drive. Just mount the drive to fix this issue.
2. File wasn’t called Posts.csv (case sensitive)
I’ve edited my code here to make the file name lower case (posts.csv) so it will fail. If your file is called Posts2.csv, or any other name, just edit the command to the correct file name and run it again to fix this error.
3. Cell won’t run / stuck on red play error
Sometimes this happens – just choose Restart and run all from the Runtime menu in Colab to fix this. It just stops the runtime process and starts executing the cells again.
So far, we have just read the raw data, and that means we have completed the INGEST step of our pipeline. In the next article, we’ll carry out the TRANSFORM process. Before we go rename your Notebook so you can find it again for the next article:
In the next articles I’ll cover:
PART 2 TRANSFORM: We’ll get all the images from their URLs (which is the raw data we have for them) and convert them into a number which represents how much “pink” is in each one. We’ll also cover how to automate the transform process.
PART 3 TRAIN: We’ll design a simple model that can predict how many likes a post will get based on the amount of “pink” in the post image. I’ll also cover how to automate the train process.
PART 4 PREDICT: How do you get the valuable insights derived from the model in front of the people who need them when they need them? That’s something we specialize in at Junction AI!
PART 5 LEARN: Everything needs to be iterated on in order to improve and remain relevant and in part 5 we’ll look at how that would work with this simple example.
Meli Axford is CPO at Junction AI
Connect with me on Linkedin: https://www.linkedin.com/in/meliaxford
Follow me on Instagram: https://www.instagram.com/meli_axford
or Medium: https://medium.com/@Meliaxford
Sorry, the comment form is closed at this time.