Working With Large Excel Datasets in a Bot

  • 20 August 2020
  • 1 reply
  • 2885 views

Userlevel 6
Badge +9

We’ve all been there: Someone in a business role asks, “Hey can you create one of your RPA bot things to automate validating data from one spreadsheet and writing data to another file?”. Without hesitation, most Automation Anywhere developers reply “Sure, that's easy”…only to subsequently receive a spreadsheet with 40,000+ records paired with a sudden sense of self-doubt questioning how easy this might actually be.

Fortunately, automating with large Excel datasets doesn’t have to be something to make you question your bot building aptitude. In this article, we’ll take a look at 3 approaches for reading data from a large dataset and discuss the pro’s and con’s for each.

Scenario

As a part of our automation, we need to validate the zip code from an address and retrieve the corresponding city and state values to make sure they match up. Per USPS, there are nearly 42,000 Zip codes in the United States. (for those outside the United States - a zip code is a postal code that allows mail delivery services to break down a city/state into smaller geographical regions). For the purposes of this article, we will be using ZipSample.xlsx which has 43,522 zip codes (and is available on the GitHub for this article if you want to follow along). For each example solution below, we will assume that the bot is looking up the zip code 95164.

Excel Advanced Package

With so many rows, this sounds like this could get tricky, I better reach for the Excel Advanced Package” - a fair thought to start. The Excel Advanced package has by far the most available actions for any of the currently available Excel packages.

For the laid-out scenario however, most of those advanced actions won't actually be needed. We’d need to use the Excel advanced: Open action to select and open our spreadsheet, followed by a Loop that will allow for the iteration of each record in the sheet. Within the loop, we can utilize an If statement with at string match condition to check if the current record’s zip column is equal to our target zip, 95164.

Finally, we’ll use a Message Box to display the result of our search, with a 3 second self-close time - which will be held constant in our other approaches as well.

These 4 actions will be surrounded by a “Set Start Time”, “Set End Time” and “Calculate Difference” steps which will allow us to establish the total processing time for the bot’s run.

In testing, the Excel Advanced approach took 2:35 on average with its best time coming in at 2:30.

Excel Basic Package

It's served me well for everything else - sticking to the tried and true Excel Basic Package” - a reasonable approach for sure. Unlike the Excel Advanced Package, the Excel Basic package does not actually require that Excel be installed for its use - but offers fewer actions for use.

 

Almost identical to the use of the Excel Advanced package, the excel basic approach will need a total of 4 actions.

Excel Basic Open will allow us to select a spreadsheet and establish a session with said sheet. Again, because Excel doesn’t need to be installed for Excel Basic to work, you shouldn't expect to see any windows opening or anything during this Open action.

The Loop command will allow us to loop through each record in the established session through the end of the file.

We’ll use an if statement with a string match condition to look for our target zip of 95164. Once that match is found, the matching city and state will be displayed in a message box which will self-close after 3 seconds (in line with the other tests).

In testing, this Excel Basic approach had an average time of 9.33 seconds with a fastest time of 9 seconds. Much faster than the Excel Advanced approach.

Database Package

That dataset is huge. Its probably best to connect to the spreadsheet like a database and use a Select statement to return the matching data.” - creative, outside the box thinking no doubt. Connecting to an Excel spreadsheet as a database is possible by using a connection string leveraging the Microsoft Access Database Engine. While this may sound like a brand-new concept, this approach was actually available and widely used in Automation Anywhere v11 as well.

Like the previous approaches, using the Database package to connect to a spreadsheet starts with establishing a connection. However, instead of choosing an xlsx to read from like we did with the Excel packages, for the Database package we’ll need to select the “Default” connection mode and provide a connection string.

 

The connection string used for this demonstration was tested on Windows 10 and Server 2019.

Provider=Microsoft.ACE.OLEDB.12.0;Data Source="C:full path to your xlsx file in quotes.xlsx";Extended Properties="Excel 12.0 Xml;HDR=YES"

This connection string leverages the Microsoft Access Database Engine 2010 Redistributable - (install the 32 bit version even if you’re using a 64 bit OS). Before installing this package, you may consider testing to see if the connection can be made without installing anything. As of last testing, it was working on our machines without additional installs, but it very clearly could also be due to how your machine baseline has been set up. Also of note: This approach does NOT require Office/Excel to be installed.

Once the connection to the xslx file has been established, the method for finding our data changes a bit too. Instead of looping through every row like we did using the excel packages, we can use a Select statement to find the matching records for our zip code. This is done by using the Database: Export to data table action from the Database package.

Provide a session name (established in the connect action) and enter a select statement. For this example, we used:

Select * from [Sheet1$$] WHERE Zip = 95164

Note: There are two $ signs at the end of the Sheet name here…this is a result of writing SQL Queries against excel files - which requires a $ after the sheet name. In Automation Anywhere, $ is used to prefix variables, so in this case, we have to use two $ to specify where one should actually exist (an escape character).

Once our query has executed, we should have a data table variable type returned. Using the loop action, we can loop through each row in the data table and display the corresponding results in a message box. Like the examples before, the results will be displayed in a message box set to self close after 3 seconds.

In testing, this approach had an average run time of 4 seconds with the fastest time being 4 seconds.

Of note: when this same test was run with the target spreadsheet already opened on the desktop, testing times jumped to 18 seconds. Just something to keep in mind should you want to replicate any of these tests yourself.

Conclusion

There are many ways to work with large Excel datasets from within your bot…even some approaches beyond what were covered in this article. Each method has specific pros and cons to its approach, even extended to the developer’s comfort level in working with different package types.

Excel Advanced: The slowest of the methods tested, provides the most actions out of box which may be useful in making manipulations to the spreadsheet once the matching data has been found - and as such, shouldn't be completely ruled out as a viable option.

Excel Basic: Great balance of speed and flexibility - in that with multiple conditions being used together in a single if statement - advanced data validations could be performed. Allows for spreadsheet manipulations using the Excel Basic actions and can work with Control Room files as well as files stored locally.

Database: Clearly the fastest of the methods tested - assuming the target spreadsheet is not already opened on your desktop. The speed comes at the cost of needing to know how to properly use SQL queries however and be able to loop through results which may be returned. Would only work on files stored locally or on network drives, not for use for CR stored files.

There is no ONE BEST WAY to solve these kinds of problems. The best solution is an RPA developer who knows each method, understands the pros and cons of each approach, and can understand the business needs well enough to solve the task at hand.

 

Final Note

If you want to replicate any of the tests that were performed on your own, or want to explore any of the code that was used - check out the supporting files on the Automation Anywhere GitHub page.


1 reply

Badge

what will be the connection string for .xls file 
I used this as connection string=(Provider=Microsoft.ACE.OLEDB.12.0;Data Source="FileNameWithPath.xls";Extended Properties="Excel 8.0;HDR=YES")
Im getting this error while using the above as connection string
“Could not Connect to Database. External table is not in the expected format.”

Reply