Google Sheets: Data Scraping

With our 9th article in the series, you will learn how to capture lead customers, assign them to team members and qualify them with a set of information. We will explain how to set up your sheet to automatically highlight new entries. This way, you can get to scrape them right away as you load your CRM. You will also learn more about how to give your teammates a great solution for recording if they have made contact.

If you are interested in CRM (Customer Relationship Management), then feel free to read our previous articles about building a CRM spreadsheet and setting up a complete CRM system.

Qualify Contacts with Web Scraping

Let’s see what it takes to select lots of elements from web pages and import their content. This can be done by setting up an automatic data scraping system (transferring any other website’s data) right to one of your sheets, which is pretty easy. You just need to add some functions which are all built-in. Pretty much all the data can be grabbed through Google Sheets from most web addresses with the help of the following IMPORT functions:

  • IMPORTXML() – Imports the XML markup of a website
  • IMPORTDATA() – Use this to import page content from a site directory and save it in .tsv or .csv format
  • IMPORTFEED() – This one imports ATOM or RSS feed
  • IMPORTHTML() – This is for importing every HTML table from a page
  • IMPORTRANGE() – The function Imports another Google Sheet’s data

Now let’s learn some more about these prospects automatically within a CRM.

  1. Choose the desired data for scraping

As you may know, each website uses a slightly different coding system. Although there are many shifting standards and best practices, the core of a site is very rarely the same if you compare them. The elements inside <Body> tags are particularly different.

For example, Zapier’s search bar in the help section is coded and tagged differently than Google’s search bar on its homepage. This means different styling, HTML elements, and their scripts for user interactions are different as well (using many different tags).

These are the differences that also make each element’s XML (Extensible Markup Language) paths different. Let’s just call them XPaths. With an XPath, you can find similar elements on a certain page when it comes to its tagging and HTML structure. This structural information is used by Google Sheets, so it can grab contents from a webpage. Suppose you have Twitter bio’s XPath, and the Twitter account of all your contacts. That’s all Google Sheets needs in order to import the Twitter bio for each person automatically. The XPath is the key.

You can actually reveal the XPath of elements on every page by using Chrome’s Developer Tools (just push CMD+OPT+I and it will open) or you can also right-click > Inspect Element to track the path. First, it opens the Dev Tools, and you will see that the element is highlighted below the ”Elements”.

You will see the highlighted element that is standardly gray or blue, so just right-click there, choose the Copy Xpath option and the original XPath of the selected element will be copied to the clipboard. When you try to insert the results by pasting them after using Copy XParh in Developer Tools panel, this is what you get: //*[@id=”1st-lb”]. It’s a specific code for the element which doesn’t need to be generalized, and is only working well for that particular element.

There are certain XPaths on the web that are the same across a huge number of sites. If plan to scrape elements from multiple sites, then you should watch out for those. Let’s call them standard elements, and one example is the Meta Tags on a website. These are SEO (Optimization of Search Engines) elements, such as Title, Author, Description of each page and they contain useful information. If you need to gather data from a bunch of websites, they are your best bet.

For example, Meta Tags are very useful sources if you want to gather information about the service or website of your lead, thus helping your sales representatives who are assigned in your CRM section.

  1. Scrape Data with Google Sheets

It’s time make use of the =IMPORTXML() command on different pages to grab data within a category without any information needed about the method they used for structuring and tagging. There is also the =IFERROR() function, which works as a standard IF(), but rather than typing in your specific condition statement (if a specified thing happens, then execute a command), the statement will be the following: Find out if the function returns an Error while running.

If you receive #NAME?, #REF!, #ERROR!, #N/A, which are all possible outcomes, these may occur because a particular formula doesn’t work properly, therefore it throws errors at you. In other times the formula is just not grabbing the data. The source of the issue can be revealed by Google Sheets.

When you use IMPORTXML(), you may encounter websites that block scraping and receive the #ERROR! text. As you highlight the cell with the error message in Sheets, it will tell you what’s wrong. It displays the same error if the site URL is dead, returning a 404 error.

Of course, these errors can be also hidden, you just need to insert your formula into IFERROR() inside the brackets, so the result will be different if a problem occurs. However, it also silences the errors which may be helpful, thus they would provide you with helpful information. First, make sure the function works as expected, and then wrap it in IFERROR() if you want.

Take a look at this function that scrapes the “Title” from a page: =IFERROR(IMPORTXML(website,”//title”,”none”)

  • With IFERROR(, you wrap your IMPORTXML() function for an error check.
  • ‘website’ – represents the range of URLs.
  • ‘//title’ – works as the XML reference that Is related to the webpage’s “title” element as a part of the code.
  • ‘none’ – whenever the IMPORTXML() can’t return a “title”, then IFERROR() returns this default result.

The drawback is that IMPORTXML() is limited to 50 instances in the entire spreadsheet (or per worksheet), so you need to use this function wisely. Therefore, it’s essential to start using XPath’s “OR” operator that looks like | (a “pipe”), so you can combine queries and create a complete formula for Keywords, Title and Meta Description. The path functions for the Keywords and Descriptions will look like this:

//meta[@name=’descr’]/@content, and //meta[@name=’keyword’]/@content. You need to refer to the correct elements, which requires a little specification.

Now let’s check out the full formula: =IFERROR(IMPORTXML(site,”//title|//meta[@elementname=’descr’]/@content|//meta[@name=’keyword’]/@content”),”none”)

The list of results should appear on the same row as the particular contact in your CRM. It’s very simple, put the previous formula into a TRANSPOSE() command. In default, the results will appear vertically, spilling into the next contact, but once you add this function, it will flip the data vertically and sets it to the right order.

  1. Make Your Formulas Work To All Possible Contacts

When we explained Forms and CRM, we used the ARRAYFORMULA() function which applied the formulas to every new entry that has been imported to the CRM. Since IMPORTXML() has the “50 instances” stint, that simply doesn’t work.

With ARRAYFORMULA(), you don’t actually get new formulas to your subsequent rows (it just copies the previous function to the very next row), therefore, every time you need the function IMPORTXML(), you need to copy it down.

A straightforward way to do this is to select the cell with the scraping formula, grab the tiny box that appears in the corner, and drag it all the way down through the new contacts that will be scraped next. But still, as the 50th client arrives, you can’t use the scraping method anymore except if you start removing some of your previous IMPORTXML() functions. You can remove the formulas and keep the data you gathered from the websites by selecting the cells, right-clicking on the very first cell and choosing Paste Special and then you will see Values Only.

Nothing will change in the text appearance, but if you try to double-click the cells, there will be no formula anymore, just the sole text. Keep in mind that by maintaining one IMPORTXML() function in the very bottom row will benefit you, because you can drag your function down again when new contacts appear. There will be no need to copy it from your hidden row.

  1. Set Up Conditional Formatting Rules for Highlighting

All types of data can be highlighted with this little trick. It’s quite an improvement to your CRM when it comes to usability, but Conditional Formatting can be useful in other ways as well. You can fastly identify your VIP prospects by setting up conditions such as keywords, date created and numeric comparisons. With this formatting, you can literally format any text with colors, bold, decoration and similar visual changes.

Go to your sheet and on your top-left, you will find a gray square, above the A row. Click this to highlight every cell in your sheet. If you right-click instead, you can select Conditional Formatting, which does the same and also brings up a Conditional Formatting bar in the right of the sheet. Here, you have the tools to define the format with numerous custom rules and text-based statements. Choose Custom Formula and then insert this one: =AND($C:$C<>””,$K:$K””). The =AND returns a logic statement, which will be TRUE or if the statement fails, it will be FALSE. $C:$C<>”” searches for CRM entries in the column you labeled Name, and $K:$K=”” looks for the Meta info of the contact inside your Meta Information columns.

Once you’re done, select any color for highlighting! If the formula finds no data in a section, the row will be highlighted in its entirety based on how you formatted it.

Finally, you should add another column in order to gather information about whether your assistant contacted. For this purpose, you will make a column, then name it “Contacted” for example. Then it’s up to you how you want to “check off” in case your agent contacted the customer. You can keep it simple by using a Yes / No code.

Stay tuned for our next article where we talk about setting up outreach and social management via Twitter.


David Cross

David is the chief editor at WebHostingMedia right from the beginning. He has a great passion for building and managing websites and creating helpful content. He is also interested in programming - currently learning python.