Heroku is a powerful tool to supplement your value as a developer. In this article, I’m going to discuss how to combine Heroku and Salesforce to collect valuable information on the Internet for free. I’ve been dabbling in Python and web scraping for a few months now and felt comfortable meshing multiple systems together into an automation symphony. We’re going to pull a list of countries from a site to keep this simple. However, if you want to scrape more complex sites, there are plenty of Selenium WebDriver Python tutorials out there.
Before I dive too deep, I wanted to clarify a few things. Python is a high level abstraction programming language often used to teach new computer scientists, write artificial intelligence, or perform various tasks like web scraping. It’s one of the simpler languages to pick up but does about as much as any other language. I usually write a third as many lines in Python as I do in other languages, including Apex.
Web scraping (also known as “spidering”) is a method of programmatically retrieving information from websites to use for your own intent. Instead of you manually going to a website like a stock market page to find out information, wouldn’t it be nice if your computer just downloaded that information for you into an Excel spreadsheet every day? What if that downloaded info could be converted into datapoints that allow you to make quicker decisions about your business? Those are the kind of problems web scraping attempts to solve.
A word of caution, not every site likes to be scraped and some may even utilize the judicial system to prosecute you if you violate their rules. It’s a good idea to review the robot.txt file to make sure you’re behaving in an acceptable way because every site has different rules. If the website has an API, use that instead. The robot.txt file will tell you which pages can be scrapped and how often. There are also debates on the legality of this. Developers sometimes refer to this as “Jay-walking on the Internet.” But I always come back to the fact that companies like Google’s search engine relies entirely on its ability to effectively scrape websites. You should be fine if you’re respectful to websites about how often you scrape, only scrape whitelisted pages, and what time of the day you scrape. For instance, only querying a page every 10 seconds during non-peak hours (e.g. midnight) is often an accepted method.
As with any article I’ve written, I’ll discuss why you would want to do something like this for your role as a Salesforce developer. Salesforce is a complex and powerful system, but it can’t do everything; nor should it. There are other things which other languages do beautifully that just can’t be done easily in Salesforce. Since Salesforce acquired Heroku around 2010-2011, developers now can incorporate more advanced techniques and logic that is generally available in open-source languages. For instance, you could pull in project opportunities from RFP websites automatically, parse contact information from websites as a Lead generation tool, and so much more. This project is a nice way to add value to your work without incurring any (or hardly any) costs.
Before you can code any of this, you’ll need the right tools. Below are the items I recommend for a straightforward experience.
- IDE: PyCharm (choose the Community option – its free for learning purposes)
- Language Compiler: Python
- Command Line interface: NodeJs
- Chromium Driver (you’ll use this to programmatically control Chrome for local testing – make sure it matches your current Chrome version)
- Heroku – Python Getting Started (do all of this to ensure it’s all working. You won’t use the web app part but it’s good to understand)
We’ll use a sandbox to store this information. All that needs to be done is to create an Object called “Countries.” You’ll need to go to Setup -> Object Manager -> Create.
The first thing is to configure the Heroku application. We’ll have to do the following.
- Create a Heroku App
- Add Python Build pack
- Add Headless Google Chrome build pack
- Add Chromedriver build pack
- Add environment variables
- Install Postgres Add-on
- Install Heroku Connect Add-on
Create a Heroku App
First, we’ll need to make an app. When logged into the Heroku website click on “New” -> “Create new app.”
You’ll have to select a name as long as its unique to all of Heroku.
Add Python Build Pack
Click on the Settings tab.
Scroll down to “build packs” and click the “Add buildpack” buttons.
Click on the “Python” option and “Save Changes.”
Repeatedly adding Build Packs for the following URLs
Scroll up to “Config Vars” and click the “Reveal Config Vars.”
Add these options to match the next image.
Now you’ll have to add both the Postgres and Heroku Connect Add-ons. Simply go to your app, and click on Add-ons in the overview. The hyperlink for “Configure Add-ons” is where we’ll add them. Type in and provision both the Heroku Postgres and Heroku Connect. The free versions will work for this.
Click on the Heroku Connect add-on so you can begin authentication.
Click “Setup Connection” to start the login process with the org you wish to tie to this.
Now we must define the mappings and which objects are permitted. Click on the “Mappings” tab and then “Create Mapping.”
Find the Country object you made and click on it. Check all the fields you want to come over. At minimum, we need the Name field. Also, check “Accelerated Polling.” Click Save.
Now that you have a source of truth, let’s dive into the local computer environment used to build this application. First, you’ll need to download this project into a folder on your computer. Open the folder with PyCharm. Then you’ll need to open the console and enter “pip install -r requirements.txt” like below. This magically configures everything to the same build I had when I made this.
If you see this error message below, the simplest fix is to download Postgres to your machine here and restart your Pycharm to try again. It just needs a PATH variable with Postgres files to work.
Next, update the PostgresHandler.py file in this section with the URI of your Postgres database where it says “<YOUR DATABASE URI>”. This makes it easier to test database records in a local program, but you don’t want to version this data in Github.
To find your Postgres URI, you’ll go into your Heroku app and click on the Add-on, Settings tab, and then the button “View Credentials.”
Next, you need to fix the database name of your file. To do that, go to your Heroku app and then to the Heroku Postgres add-on. Create a Dataclip by clicking on the Dataclips tab. Click “Create Dataclip” and you’ll see the database definitions on the right-hand side. Here you can test queries without doing it in code.
Change the Postgreshandler.py file in the “FROM” statements to match your table name.
Lastly, you need to set the path of your ChromeDriver download. Make sure it’s unzipped and pasted the directory in the WebCrawler.py file.
Note: you may run into issues with the ChromeDriver “not being found” when running the program. When that happens, I temporarily place the ChromeDriver file in the same directory as the project, update the googleChromeExecutePath variable, and it starts working.
Now that you’ve downloaded the code. You can begin to understand the mantra behind this scraping process. We’re going to scrape this example site for the list of countries to get the hang of the code. The flow is below and everything in the code is commented.
- We’ll load the webpage
- Identify the table we want
- Pull the text
- Click next and repeat 1 through 4 until Next is no longer available
- Get a list of existing countries in the database
- Check the list and bulk insert anything new
If everything was set up correctly and you run the code, then you should see this when you run a dataclip in Heroku.
Heroku has a great tutorial on how to handle uploading code directly to Herorku or using a service like Github. I will not cover that part here, but you can upload the code to Heroku if it’s working.
Taking it a step further
I left a lot of print statements mainly because that’s a simple way to see the order of execution and read logs. I also left them in there because now you’re able to schedule your Heroku application to run like a Batch job in Salesforce. All you need to do is verify your account with a credit card and install the add-on “Heroku Scheduler.” You will set a frequency and then put in the command “python run main.py.” That will do the trick. It will run on your desired scheduled.