Community Article: Taking String Manipulation To the Next Level


Userlevel 6
Badge +9

This tutorial is a guest post from Ashwin A.K from Speridian Technologies  - a global business and technology solutions provider. Speridian's business and technology experts help clients modernize their businesses through Digital Transformation.

 


 

I bet everyone enjoyed participating in this year's Bot Games event.

 

Many of you probably weren’t even aware of the Web Automation Package, until Bot Games. Please try it out, it's packed with incredibly versatile tools as you can see below.

 

 

The exercises this year were really great this time around, so I decided to add an exercise of my own – which may intrigue some of you. The solution is also presented in this article, so I will let you know when to stop reading.

 

Why would I want you to stop reading?

 

  • Learning doesn’t happen through osmosis – it happens when you get your hands dirty with it.
  • You didn’t learn to drive a car by simply watching someone else drive now, did you?
  • I want you (yes YOU, don’t look away) to try this exercise on your own before studying the solution.
  • Then again, if you’ve never worked with Regular Expressions, the solution might traumatize you.
  • Who knows, you might end up traumatizing me with a solution of your own.

 

Exercise Time!

 

This is what we have:

 

 

Use this sample text: 

 

I don't belong here

1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 4,4 4,1 4,2 4,3 4,4

Neither do I

 

A Text File containing some Text.

Pretty darn obvious, but stick with me here.

You have to retrieve the line items and write them in Excel as shown below.

 

Not so pretty darn obvious.

 

I used Regular Expressions to achieve my solution because I’m comfortable working with them. Maybe a little too comfortable.

 

You can use whatever means necessary, whether through split, substring, or DLL, surprise me.

 

My Solution (STOP Reading here and Start trying yourself and then continue to compare your solution with mine!)

.

.

.

.

.

Did you try the exercise?

 

Did you really?

 

How far do you think you will get without putting in the effort? Your ability to design solutions doesn’t stem from the solutions you view online, but from exercising whatever you know and THEN studying other solutions – to fill in the gaps you didn’t know you had.

 

Well, it’s not like I can stop you from proceeding. These are just a couple of sentences designed to guilt-trip you back into trying the exercise.

 

First, let's study the problem

 

We can divide it into three parts:

  • Text which is not required
  • Text which is required
  • Convert Text to Table

 

With Regular Expressions, you may either extract whatever the pattern matches, or replace the matched pattern with another string. All of that and more is possible with the Regex Tools Package in Bot Store, but I won’t be using it. That would be too easy, so I decided to torture you with something a bit more complex, which is using the String: Replace Action toggled to is a regular expression.

 

What are we trying to achieve here?

 

We can’t simply generate a table from a block of string, however, it will transform into a table once it’s written into a comma-separated values file. But the process won’t be as straightforward, since the text itself contains commas which means if we were to log this to CSV, we’d end up with this monstrosity and It’s all wrong!

 

 

Not to worry, CSV can actually compartmentalize comma containing values into a single cell, provided we surround the value with double quotes, which is what we will do.

 

Remove Unwanted Text

 

Since the unwanted text is composed of letters and spaces, we have to craft a pattern which detect both letters (regardless of it's casing) and spaces.

 

 

Oh, what do we have here? The pattern has matched items we don’t want to remove, even newlines. Well, the last bit is no surprise, as s doesn’t just detect spaces, but line terminators as well. This, however, can be easily resolved. Think about it, the unwanted text always starts with letters, right? So let’s use that to our advantage.

 

 

The period detects any character except line terminators, while the asterisks detects one or more instances of the previous item. Put two and two together, and you have a pattern which detects items that start with a letter, and anything(except newlines/carriage returns) after that until it hits the end. The reason our pattern doesn’t expand into the lines beneath is that the period won’t match line terminators, making this a pretty robust pattern.

 

I know I’m repeating myself, and that is done on purpose. It’s through repetition that we internalize.

 

Hey, look, something’s leftover.

 

This might not be as straightforward, so you might want to put that phone aside and pay attention for a while. We have to detect the start and end of a given value too:

 

 

Believe it or not, you can detect boundaries with Regular Expressions. The only problem is the String: Replace Action doesn’t allow multiline detection, so we have to lengthen the pattern a bit to detect the boundaries we are interested in.

 

 

Remember, s detects spaces and line terminators. When we nest s within positive Zero-Length assertions we are looking before and after the spaceline terminator. Now just what the heck is a Zero Length Assertion? In layman’s terms, it’s a technique used to match patterns before/after a pattern. I hope that explanation didn’t force you to lay down. Mind-boggling stuff but stick with me, we are almost at the finish line!

 

Comma separating the values

 

Now that we are done with the refining, the last step is to add commas in place of space and log it to CSV. We have to detect only spaces, but s detects line terminators as well which means we can’t use it here right?

 

Wrong. You can negate patterns and create a new pattern from that as well. This is where you have to PAY attention.

 

\n is a newline, \r is a carriage return and \s detects spaces and line terminators.

\S is the inverse of \s i.e., it detects anything that is not a line terminator or a space.

If we combine \S\r\n, it will detect line terminators and anything other than spaces.

But if we were to negate \S\r\n, it will only detect spaces.

 

You might want to read that again…and maybe lay down for a while.

 

 

The caret symbol is used to specify the start of a pattern but changes its operation when placed in between square brackets. It took me a while to wrap my head around it as well, so I can understand your frustration. The last step is to log it to CSV and we are done.

 

What about writing it to excel?

 

That is for you to figure out. Fire some of those neurons!

 

This was kind of pointless when you think about it.

 

The point of this pointless exercise was to encourage you to get into the habit of recognizing patterns. Regular Expressions are incredibly useful as it lets you filter data with ease. That might not make you go “OMG thass so keeewl!”, or at least not yet. You can achieve a lot with Regular Expressions, and this is just one unnecessarily complex example.

 

But I Can't Recognize Patterns!

 

If you can't recognize the patterns, that means you aren't paying attention. Yes, it’s hard, and that’s because paying attention is hard. I remember how frustrated I was when I first started studying regular expressions. But the more I explored it, the easier it became.

 

 

 

Conclusion

 

Regular Expressions are applicable across the board and are an invaluable skill. Its application goes beyond checking if a given text is an email or not, and if you know what you are doing, you can actually use it in areas you never thought possible. I even use it in data scraping, and it works like a charm. Don’t forget to share it with your friends so that they too can experience some of that trauma.  

 

 


0 replies

Be the first to reply!

Reply