Regular expressions – insights (part 2)


Difficulty

The first part for regular expressions of this article is the following, which I recommend you read, before this article:

https://blog.sandbay.it/news/php/regular-expressions/

Now let’s see other structures and details on what the world of regular expressions is.

The special metacharacter escape \

What if I want to search for a character that we see as a wildcard? For example, if I want to search for the web address of the website “google.it“, using the google.it regex it could happen that I get results like this: “googledit”.
In fact, in the regex, the point is a metacharacter, with the precise meaning of representing any other character, including space.

So how can we do it? We must indicate in the regex that we now want to use a character in its literal meaning, taking away any ‘power’ of metacharacters. This is done thanks to the metacharacter escape (in the meaning of ‘leak’, ‘loss’) represented by the backslash: \. When a metacharacter is preceded by \, it loses its metacharacter meaning, returning to its literal meaning, and is said to be escaped.
So, as far as the point is concerned, the previous example should be written google\.it.

Of course, the \ metacharacter can also be escaped.
If we want to find files like “\my.txt”, we will write: \\my\.txt.

In \* the asterisk is escaped, in \\* no, only the backslash is.

If you put “\” in front of a literal character, “\” is ignored, unless that character becomes a special character (see special characters further on).
So \v, while not making much sense, is equivalent to “v”.

Let’s look at a somewhat complicated example.
I have to limit a user’s input to only numbers, even decimal (using the dot as a decimal separator). That is, the regex must indicate as acceptable only values ​​of the type: 0.24, 642, 87.06, 0, 9.5, 7.006, etc.
Having to limit the input, the regex contains the start and end line metacharacters: ^ and $.
Then we consider the number divided into two groups: an integer part and a (possible) decimal part. For the whole part, I indicate one or more occurrences of the numeric characters: ([0-9]+).
It seems to be fine, but it also accepts numbers starting with one or more zeros, such as “000”, “008”, or “04”. Wanting to avoid this, we find an alternative: a zero by itself, or a number of one or more digits that does not start with zero: (0|[1-9][0-9]*).
Now let’s consider the decimal part: it is a group of characters that begins with the decimal point and continues with one or more digits; being an optional group we make it followed by the question mark: (.[0-9]+)? .
In conclusion, the regex is:

^(0|[1-9][0-9]*)(.[0-9]+)?$

Special characters in regular expressions

We’re not done with the \ wildcard possibilities yet.
We initially saw that if put “\” followed by a metacharacter, it ceases to be a metacharacter and returns to its literal meaning.
Can we expect the vice versa? If I put any character after \, does this whatever character take on the powers of a metacharacter?
Sometimes yes, and we call it special character. Let’s see the cases:

  • \a an alarm character (the ASCII character “BEL”, with hexadecimal value 07).
  • \b a word boundary character; or backspace (the ASCII character “BS”, with hexadecimal value 08) if \b is inside a character class.
  • \B a non-word boundary character.
  • \d any character that is a digit; is equivalent to [0-9].
  • \D everything that is not \d; is equivalent to [^0-9] or [^\d].
  • \e an escape character (the ASCII “ESC” character, with hexadecimal value 1b).
  • \f a form feed character (the ASCII “FF” character, with hexadecimal value 0c).
  • \n a newline character (the ASCII “NL” character, with hexadecimal value 0a).
  • \r a carriage return character (the ASCII character “CR”, with hexadecimal value 0d) (enter).
  • \s any character that creates spaces (tab, space, new line, …); is equivalent to [\f\n\r\t].
  • \S everything that is not \s.
  • \t a tab character (the ASCII character “HT”, with hexadecimal value 09).
  • \w any alphanumeric character, including the underlined character;
  • is equivalent to [a-zA-Z0-9_].
  • \W anything that is not \w; is equivalent to [^a-zA-Z0-9_] or to [^\w].
  • \xnn the ASCII character whose hexadecimal code is “nn”.
  • \x{nnnn} the Unicode character whose hexadecimal code is “nnnn”.
  • \A text start indicator.
  • \Z end of text indicator.

Note that special characters can also be safely used within character classes.

\b indicates the starting point or the ending point of a word, indifferently.
Think of an imaginary, dimensionless point that has a \w on one side and a \W on the other.
With \b you can easily search for an entire word, where word means a sequence of alphanumeric characters; for example \bA1\b will only give:

The A1 and A14 motorways today are very …

Notice that we not find the last occurrence of “A1” because it is not a word of its own but is part of the word “A14”.
This meaning of \b would no longer make any sense within a character class [], so in this context [\b] takes on the meaning of a backspace character.

To end a line of text Windows uses \r\n, Unix only uses \n.

While \t represents only a tab character, \s represents a set of characters, all of which create whitespace, regardless.
A user input could contain an unspecified number of spaces or tabs: so instead of inserting an expression like [\t]* into the regex, we can more simply write \s*.

\w is very useful in the form \w+, to indicate any word.

We can also insert characters not present on the keyboard in the regex: if for example we want to insert the character of coyright ©, we use \xA9 where “A9” is the hexadecimal code of ©.

Quantifying metacharacters are greedy

The quantifiers in the regex are called greedy because they find the most they can find. Let’s clarify with an example:
“We are all greedy quantifiers”
Wanting to find strings starting with “ar” and ending with “e” we use the regex ar.*e. Here is the result: “are all greedy quantifie”.
The regex found everything that was possible: it could have stopped in the countryside, instead it went ahead, found another occurrence of “a”, and included everything in its result.
If this is not what we want, we need to be aware of it and act accordingly. In this example a solution could be to not accept other “a”: ar[^e]*e, but later (in more depth) we will see how the quantifiers can be made non-greedy.

How to make quantifiers non-greedy

We saw how the quantifiers ?, +, * And {} are greedy, because they find everything that is possible based on what is specified in the regex.
But sometimes you may want them to find as little as possible, making them non-greedy or, as is sometimes written, lazy.
This is done by simply adding a ? immediately after the quantifier used.

Let’s consider the regex (hi)+ and the text:

He began to giggle "hihihihihihi …" horribly

Being the quantifier + greedy, it will find the entire string “hihihihihihi”.
Instead with the regex (hi)+? you will only get the first “hi”.

Here are the quantifiers in their non-greedy form:

  • +?
  • *?
  • ??
  • {min, max}?

Let’s see with an example how they can be useful.
Suppose we want to locate the pieces of code enclosed by the <i> and </i> tags in an HTML file.
If we used <i>.*</i>, due to the greed of .* we would sometimes get unwanted results. In a line of code like this:

Remember, <i>Javascript</i> is not <i>Java</i>.

our regex would get:

<i>Javascript</i> is not <i>Java</i>

which is too much.
If the regex is non-greedy instead, .*? will stop at the first pair of <i> and </i> tags found: <i>Javascript</i> which is just what we wanted.

Negative lookahead (?!abc)

Sometimes it can be useful to check combinations of characters for which however a single combination does not have to be approved.
In these cases the negative lookahead comes in handy with this syntax: (?!abc)

We must necessarily associate this syntax with something that must be there, such as a combination of characters or simply everything else.

^(?!007)[0-9]{3}$

With this regular expression, for example, we establish that there can be any combination of 3 numbers less than 007. It can also be established more than one combination that must not exist.

^(?!(007|123))[0-9]{3}$

For instance in this case we are saying that there must be neither the combination 007 nor the combination 123 of numbers. Obviously, this negation also applies to any other combination of numbers or letters.

^(?!Potter)[a-zA-Z]+$

Sub-expressions with back references – regular expressions

We use the so-called backreferences in those cases in which it is necessary to recover an occurrence already found previously.
Having grouped parts of regex into sub-expressions with round brackets, we can refer to their respective results, in order, from left to right, with \1, \2, \3, etc.
It is important to note that when we write \1 we are referring to the result of the first subexpression, not the subexpression itself.
Suppose we want to find, in an HTML file, all the tags that have an opening and a closing, with their possible content. Considering for example the HTML code of the first sentence of this page, we want to write a regex capable of producing the following highlighted parts:

We have a link: <a href="https://www.google.com">Google</a> and we have a &lt;span&gt;:
<span class="aspan">a span</span>.

The first symbol to find is “<“, followed by a lowercase letter and any alphanumeric characters that make up the name of that particular tag. It is useful to store the tag name by creating a sub-expression, because we will need it again at the end of the regex to write the closing tag. So the start of our regex will be: <([a-z][a-z0-9]*). Now we have to insert the possibility of other characters, to define any attributes of the tag, and then the “>” character; remember that metacharacters are greedy: we must ensure that the regex cannot accept more than one “>” character, but must be satisfied with the first one it finds. This part of the regex will then be: [^>]*>.
Between the opening and closing tags there can be any text: we indicate this possibility with .*? .
We have arrived at the closing tag; it starts with “<” followed by “/” and the tag name itself.
The part of the regex that deals with the closing tag is therefore \1.

In conclusion, putting all the pieces together, the regex sought is:

<([a-z][a-z0-9]*)[^>]*>.*?</\1>


To check regular expressions, you can use sites that simulate a search: https://regex101.com/

This is a small table of some regular expressions that might happen to you:

RequestRegexp solution
Generic numeric string (with “+/-” optionals).^[-+]?[0-9]*$
Only valid numbers, even decimal (using the dot as a decimal separator). Ex. : 0.24, 642, 87.06, 0, 9.5, 7.006, etc.^(0|[1-9][0-9]*)(.[0-9]+)?$
Mobile phone (with nation prefix).^[+]?[\s0-9]{5,20}$
CAP number.^[0-9]{5}$
Birthday (with slash “/” format): mm/dd/yyyy^(0[1-9]|1[0-2])\/(0[1-9]|1[0-9]|2[0-9]|3[01])\/(19|20)[0-9]{2}$
Url with domain.^(https?:\/\/)?(www\.)?[-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_+.~#?&/=]*)?
Simple mail (there are more complex alternatives).^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
The string must be between 2 and 20 characters in length. No spaces or special characters are allowed except for: period “.”, hyphen “-” or underscore “_”.
The string cannot end with any special characters.
^[\\w\.-]{1,19}[a-zA-Z0-9]$
Every HTML tag opening and closing<([a-z][a-z0-9]*)[^>]*>.*?</\1>
An italian POD id. A string of 14 or 15 characters, that starts with the first 2 characters like “IT”, has from 3 ° to 5 ° numeric digits, has the 6 ° character like “E”, has the remaining 8/9 characters numeric except for PODs that start with “IT002” that must have the last character like “A”, all case insensitive (Javascript version)./^IT((?!002)[0-9]{3}E[0-9]{8,9})|(IT002E[0-9]{7,8}A)$/ig

That’s all for today.
Try it at home!

0
Be the first one to like this.
Please wait...

One thought on...
Leave a Reply

Thanks for choosing to leave a comment.
Please keep in mind that all comments are moderated according to our comment policy, and your email address will NOT be published.
Please do NOT use keywords in the name field. Let's have a personal and meaningful conversation.