Regular expressions – metacharacters (part 1)


Difficulty

Today’s topic is separated from programming languages but included in.
Regular expressions are a very efficient meta-language when it comes to performing validations, replacements or searches where accuracy of intent is crucial. In particular, in this part, we will analyze the syntax of regular expressions metacharacters.
Many hate them as they are quite “hermetic” and others find them efficient as they are free of lace and useless parts (myself included). When it comes to regular expressions, we get to the point. There are no waits, spaces or punctuation. A wrong character or in the wrong place and your regular expression could do something different than what you expected.
If hermeticism is poetry, so too are regular expressions, in their elegance and compactness.

Regular expressions metacharacters

We’ll cover:

  • some of the aspects to be taken into consideration when using regular expressions.
  • a minimum of tools to be able to use them at best in complicated cases.
  • the metacharacters in regular expressions.

On the second part of this article I will insert a small “cookbook” of some of the regular expressions that may happen to you a few times.

The softwares that use the regexps are of various types and therefore also the syntax of the regexes can vary slightly between the various applications.
Many languages ​​such as Perl, PHP, Java, Python, JavaScript, Apache (with its configurations) implement regular expressions, with some differences, however, regarding the supported metacharacters. The implementation provided by Perl, PCRE (Perl Compatible Regular Expressions), is the largest and most popular, so we will use this in this article. I advise you to always check the regular expressions created, with the correct engine, so as not to run into problems due to language differences.

Regular expressions – metacharacters ^ and $

Some of the simpler metacharacters are ^ and $, which indicate a position.
^ represents the beginning of an input text, while $ represents the end of it.
Suppose we have a line of text like this:

text is in context

If I use the regex ^text I find the “text” at the beginning of the line.
If I use the text$ regex I find the “text” at the end of the line.
If instead I use the regex text without metacharacters, I find all the occurrences of “text”.

A special case is ^text$: it means looking for a line that begins and ends with text, that is a line whose only content is “text” and nothing else, neither spaces nor punctuation.

Character class: []

The square brackets define a character class, a list of the possible characters we want to include in the search, taken one at a time.
Within a class, the order in which the characters appear does not matter.

Let’s clarify with an example: we are examining an HTML code of a web page and we want to find all the title tags, that is: <h1>, <h2>, <h…>, <h6>. A regex to accomplish this is <h[123456]>, which translated means: “I want to search every <h1>, every <h2>, every <h3>, …, every <h6> present in the code”.
But that’s not the only possible regex. In fact, to indicate a range of values ​​we can use the metacharacter dash (or minus sign) - inside [].
The regex of the previous example then becomes: <h[1-6]>, shorter to write.

Other very common examples with the hyphen metacharacter are:

  • [0-9] to indicate digits 0 to 9.
  • [a-z] to indicate lowercase letters.

You can then use multiple ranges:
[3-6M-P] is equivalent to [3456MNOP].
Ranges and literal characters can be combined together:
[A-Z0-9!$] stands for any capital letter or any digit or a “!” or a “$“. Note that a metacharacter (the “$” in this example), when inside a character class, summarizes its literal meaning.

Note that if we want to include the hyphen among the literal characters to search for, we can, as long as we cannot confuse the ‘literal hyphen’ with the ‘metacharacter hyphen’ indicating a range.
To avoid any possible misunderstanding, it is therefore sufficient to put the dash immediately after the open square bracket or immediately before the closed one, because in these positions it can no longer indicate an interval:
[-aeiou] or [aeiou-] are two equivalent regexes that match any occurrence of a hyphen or a vowel.
The hyphen is considered as a metacharacter only when it is used to indicate a range.

^ inside []

We have already seen the metacharacter ^. It takes on a completely different meaning if we use it inside a character class immediately after [: [^a-e] it means that we are looking for any character that is not a, b, c, d or e.

^ immediately after [ thus acquires the function of negating everything that follows it. It is often much more convenient to do this than to list everything you want to find.
For example the regex photo [^13579] will look for “photo” followed by any character other than an odd number.

If ^ is inside the class character [], but not exactly behind [, then it returns to have no particular meaning other than that of a common literal character.
In general, within a character class, every metacharacter reverts to being a literal character, unless its position indicates otherwise.

Regular expressions metacharacters – the . (dot)

Sometimes it may happen that you have to search for a sequence of characters but regardless of one of them: instead of that character, that is, there can be any other character.
The metacharacter . has just this function of “placeholder”.

For example, the AB.C regex will find “AB3C”, “ABUC”, “AB:C”, …, and even “AB C”. Space is also a character.
The metacharacter . can be repeated.

The metacharacter | (pipe)

The logical “OR” operator is rendered in the regexes with the | metacharacter.
For example, if we need to find the occurrences of house or home, we can use the regex house|home.
Of course, since this is still a somewhat “rough” regex, it will also give results of this kind: farmhouse, homes, homeless, etc.

Another example. We do not remember well whether a certain square is “Heroes of 1945” or “Heroes of 1915” in our address list.
Let’s go step by step: as far as we’ve seen so far the regex may be 1945|1915, so we’re sure one or the other or both pop up.
However, if we notice that the two possible names differ only for that 4 or that 1, the regex can be abbreviated as follows: 19(4|1)5.
The round brackets () are necessary to correctly grasp the alternative within the regex. If we had written 194|15 we would have found 194 or 15.

The quantifier metacharacter ?

Sometimes you are not sure whether a character exists within a string or not, or you want to find certain words that contain that character or not.
Does it exist for this purpose? which, placed after a character, makes it optional.
In practice o? means “no time or once the character o”.

If from our directory we want to get both “Potter” and “Potters”, we observe that the s is optional and therefore the regex to use is Potters? . We remember that ? always refers only to the immediately preceding character.
If there are several consecutive optional characters, we group them with round brackets: (inter)?national finds both “national” and “international”.
(Mr )?Bond finds both “Mr Bond” and “Bond”.

The quantifier metacharacters + and *

+ and * somehow increase the range of ?. Indeed: + finds strings that contain at least one occurrence of the preceding character.
For example ah+ can find the exclamations “ah”, “ahh”, “ahhh”, … but it doesn’t find “a”. + finds strings that contain as many (even zero) occurrences of the preceding character as you want. So in this case:
ah* can find “ah”, “ahh”, “ahhh”, … and also “a”.
The metacharacter + by itself would not be indispensable, in fact:
“(something)+” is equivalent to “(something)(something)*”;
but as you can see it is a useful shortcut.

Let’s look at another example.
We need to make sure that a user can only enter numbers, but not a number that starts with zero. The regex ^[1-9][0-9]*$ serves the purpose: the metacharacters ^ and $ at the beginning and end of the line mean that the user input can only contain what is indicated inside, ^[1-9] indicates that at the beginning the allowed characters are those ranging from 1 to 9, therefore zero is excluded; [0-9]* finally indicates that as many repetitions as desired (even none) of the characters ranging from 0 to 9 are allowed.
The combination .* forms a simple regex to find an entire line of text.

The quantifier metacharacter {}

If we know for certain we want to find “ahhhhhhhh” with eight “h”, we can more briefly write the regex like this: ah{8}.
Wanting instead to find the exclamations that have from 2 to 6 “h” we write: ah{2,6}.

{min, max} is called an interval quantifier and allows you to refine the function of the metacharacter *, indicating the minimum and maximum number of occurrences.
As we have seen, if n = min = max we can write {n}. If min is known and max is indeterminate, {min,} can be written.

From what we said it should be clear that:

  • {0,1} is equivalent to ?
  • {0,} is equivalent to *
  • {1,} is equivalent to +

The first number of the interval must always be specified.
So {min,} is fine, while {,max} is not fine.
Finally, if we not use them in these intended forms, the braces return to their literal meaning.

This was the first part on regular expressions metacharacters, the core of this meta-language. You can find the second part of this article at the following link:

That’s all for regular expressions metacharacters, for today.
Try it at home!

0
Be the first one to like this.
Please wait...

One thought on...
Leave a Reply

Thanks for choosing to leave a comment.
Please keep in mind that all comments are moderated according to our comment policy, and your email address will NOT be published.
Please do NOT use keywords in the name field. Let's have a personal and meaningful conversation.