Loading...
X

Finding multiline matches with PHP regular expressions

Multiline in PHP regular expressions

By default, regular expressions in PHP look for matches within the same string. And in this case, the symbol “.” (dot) which is usually described as “anything” actually means “anything but a new line”.

This default behavior can be overridden with a pattern modifier in case you need to find matches that extend beyond a single line, i.e. multi-line matches. How to do this for multi-line matches, and why the anchors “^” (beginning of line) and “$” (end of line) do not work – this article is devoted to considering all these issues of searching for multi-line matches with PHP regular expressions.

HTML example for parsing and testing regular expressions

In the following examples, I will use the following HTML code as the search text.

<!DOCTYPE html>
<html>

<head>
	<title>HTML example for ZaLinux.ru and Suay.Site</title>

	<link rel="stylesheet" href="highlightjs/vs.min.css">
	<script src="highlightjs/highlight.min.js"></script>
	<script>hljs.highlightAll();</script>
</head>

<body>
<h2>An Unordered HTML List</h2>
<ul>
	<li>Coffee
	<li>Tea
	<li>Milk
</ul>

<h2>An Ordered HTML List</h2>
<ol>
	<li>Coffee
	<li>Tea
	<li>Milk
</ol>

<h2>HTML styles</h2>
<p>I am normal
<p style="color:red;">I am red
<p style="color:blue;">I am blue
<p style="font-size:50px;">I am big

<h2>Source code of this page:</h2>
<pre><code>
&lt;!DOCTYPE html&gt;
&lt;html&gt;

&lt;head&gt;
	&lt;title&gt;HTML example for ZaLinux.ru and Suay.Site&lt;/title&gt;

	&lt;link rel=&quot;stylesheet&quot; href=&quot;highlightjs/vs.min.css&quot;&gt;
	&lt;script src=&quot;highlightjs/highlight.min.js&quot;&gt;&lt;/script&gt;
	&lt;script&gt;hljs.highlightAll();&lt;/script&gt;

&lt;/head&gt;



&lt;body&gt;

&lt;h2&gt;An Unordered HTML List&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Coffee&lt;/li&gt;
  &lt;li&gt;Tea&lt;/li&gt;
  &lt;li&gt;Milk&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;An Ordered HTML List&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;Coffee&lt;/li&gt;
  &lt;li&gt;Tea&lt;/li&gt;
  &lt;li&gt;Milk&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;HTML styles&lt;/h2&gt;

&lt;p&gt;I am normal&lt;/p&gt;
&lt;p style=&quot;color:red;&quot;&gt;I am red&lt;/p&gt;
&lt;p style=&quot;color:blue;&quot;&gt;I am blue&lt;/p&gt;
&lt;p style=&quot;font-size:50px;&quot;&gt;I am big&lt;/p&gt;

&lt;h2&gt;Something else&lt;/h2&gt;

&lt;div class=&quot;maincontent&quot;&gt;
	&lt;p&gt;The most useful content is here.
&lt;/div&gt;

&lt;div class=&quot;maincontent&quot;&gt;&lt;p&gt;One-liner DIV.&lt;/div&gt;

&lt;div class=&quot;maincontent&quot;&gt;
	&lt;p&gt;Multi line
	&lt;p&gt;DIV content
	&lt;p&gt;here.
&lt;/div&gt;

&lt;/body&gt;
</code></pre>

<h2>Something else</h2>
<div class="maincontent">
	<p style="color:red;">The most useful content is here.
</div>

<div class="maincontent"><p><i>One-liner DIV</i></div>

<div class="maincontent">
	<p><b>Multi line DIV</b></p>
	<p><b> content here.</b></p>
	<p><b>I have a P ending tag for some reason.</b></p>
</div>

</body>
</html>

This code doesn't make much sense, but contains various multi-line elements that are well suited for the purposes of this article.

Finding matches that include a newline

Let's try to find the HTML DIV tags. Next example:

<?php

$html = file_get_contents('demo3.htm');
preg_match_all ('#<div class="maincontent">.*?</div>#', $html, $result);
print_r ($result);

Finds only one DIV:

This regular expression uses “.*” (dot and asterisks), dot means “anything” and asterisk “any number of times”, and together they mean “anything any number of times, including zero times”. That is, despite the use of this construct, multi-line DIV tags were not found. As already mentioned, this is due to the fact that newlines and line breaks are not included in the concept of “anything” by default.

The question mark in this case means “make the regular expression lazy”, that is, the smallest matching part will be found.

How to add a new line in PHP regular expression

A newline (hex 0A) is denoted as:

\n

There is also a carriage return (hex 0D) notation:

\r

Let's add newlines to our regular expression:

$html = file_get_contents('demo3.htm');
preg_match_all ('#<div class="maincontent">\n.*?\n</div>#', $html, $result);
print_r ($result);

As you can see, another three-line DIV tag is now found.

But single-line and multi-line DIV tags were not found.

PHP regex to search all text without line splitting

For a multi-line regular expression search, add the “s” pattern modifier. If you use this modifier, then the metacharacter “.” really starts to mean “any character”, including line breaks.

preg_match_all ('#<div class="maincontent">.*?</div>#s', $html, $result);

Result:

That is, all three DIV tags were found, regardless of the number of lines in them:

    [0] => <div class="maincontent">
	<p style="color:red;">The most useful content is here.
</div>
    [1] => <div class="maincontent"><p><i>One-liner DIV</p></i></div>
    [2] => <div class="maincontent">
	<p><b>Multi line DIV</b></p>
	<p><b> content here.</b></p>
	<p><b>I have a P ending tag for some reason.</b></p>
</div>

How to mark end and start of string in PHP regular expression

Consider the following example – in it we are trying to find all HTML P tags. Moreover, the HTML code has start P tags, but does not have end P tags – this is not entirely correct, but in a web browser such HTML markup is still displayed correctly. That is, we cannot search for lines like

<p>…….</p>

In this case, you need to look for lines like

<p>…….$

Where the symbol “$” means “end of line”.

So, the regular expression for searching for P tags is:

<?php

$html = file_get_contents('demo3.htm');
preg_match_all ('#(<p>.*?$)|(<p .*?$)#', $html, $result);
print_r ($result[0]);

Contrary to expectations, nothing was found:

The fact is that the “^” and “$” metacharacters correspond to the beginning of all processed text and the end of all processed text – even if the text is divided into several lines. To change this default behavior, you must use the “m” pattern modifier. If this modifier is used, then the “^” and “$” metacharacters work as intended: they mean the beginning of lines and the end of lines. If the text being processed does not contain newlines, or if the pattern does not contain “^” or “$” metacharacters, this modifier has no effect.

Let's try:

$html = file_get_contents('demo3.htm');
preg_match_all ('#(<p>.*?$)|(<p .*?$)#m', $html, $result);
print_r ($result[0]);

Now all P tags are found:

How to use “s” and “m” pattern modifiers together in PHP regular expressions

The question arises if, when using the “s” modifier, all the text for the search is sort of combined into one line, then what will happen in this case if the “m” modifier is also added?

In fact, the effect of modifiers does not interfere with each other. If both modifiers are used in the same regular expression, the following effect will be achieved:

  • metacharacter “.” (dot) starts to mean “any character, including newline”
  • metacharacters “^” and “$” start to mean “beginning of line” and “end of line”, respectively.

Let's try to find all OL, UL tags (including multi-line ones), as well as all P tags:

preg_match_all ('#(<p>.*?$)|(<p .*?$)|(<ol.*?</ol>)|(<ul.*?</ul>)#ms', $html, $result);

The result is what we expected:

What is the difference between the newline character “\n” and the metacharacters “^” and “$” in PHP regular expressions

The differences are:

1. The “\n” character stands for “newline” even without the “m” pattern modifier. Slightly modified previous example: removed the pattern modifier “m” and replaced the “$” characters with “\n”:

preg_match_all ('#(<p>.*?\n)|(<p .*?\n)|(<ol.*?</ol>)|(<ul.*?</ul>)#s', $html, $result);

Result:

On the one hand, the result is exactly the same – the same 11 matches were found. But, on the other hand, the results look somehow different, the list looks like it has more spaces in it. The explanation is given in the second paragraph.

2. The “\n” character adds line break symbols to the found matches. That is, at the end of the found lines there is a “newline” character.

3. Metacharacters “^” and “$” when using the pattern modifier “m” work the same for texts created in different operating systems. As for the “\n” escape sequence, it will only work as you would expect for texts created in Linux. On Windows and Mac OS operating systems, the characters “\r\n” or “\r” can be used to indicate line breaks.

Conclusion

So, if you want the regular expression to be searched not on individual lines, but on the entire text, then use the “s” pattern modifier.

To make the “^” and “$” metacharacters start to mean beginning and end the string, add the “m” pattern modifier to the regular expression.

If you want to match a newline character at a specific location in a regular expression pattern, then you can use the “\n” escape sequence, or the “^” and “$” metacharacters with the addition of the “m” pattern modifier.


Leave Your Observation

Your email address will not be published. Required fields are marked *