PCRE Pattern - Extracting URLs
To complete my BBCode-parsing PHP class file, I had to figure out a way to collect 'raw' URLs from user inputs and and parse them into links somehow. Deciding which regular expression function and syntax to use was not the problem. Ultimately I decided on using the PCRE function preg_match() instead of the POSIX function ereg().
Why choose preg_match() and not ereg()?
Well, that's actually a 'no-brainer', if you look up the function ereg() in the PHP Manual, this is what they say right at the top:
preg_match(), which uses a Perl-compatible regular expression syntax, is often a faster alternative to ereg()
.
Frankly, that was the EASY part... I still had to work out a decent pattern! It took me some time and many tests with all the different URLs I could think of and came up with this.
The pattern syntax
To help me extract urls easily, this little PCRE pattern does the job well enough, it's probably not perfect but seriously handles collecting 'raw' URLs even better than some popular Bulletin Boards and Forums. Anyway, let's quickly look at the PCRE pattern syntax:
<?php
$urlpattern = '/((http|https|ftp):\/\/|www)' // line 1
.'[a-z0-9\-\._]+\/?[a-z0-9_\.\-\?\+\/~=&#;,]*' // line 2
.'[a-z0-9\/]{1}/si' // line 3
?>
I've broken that one long pattern into 3 lines just so that it would fit in this web page and I can explain each line at a time; so, if you decide to test this pattern out, you can keep them all in one line, whichever way you use it.
OK, explain this PCRE pattern to me
What the pattern in Line 1 does is to match any string that starts with either:
- the word, www OR
- one of the following conditions
- the word, http or
- the word, https or
- the word, ftp
- and either of the 3 patterns above must be followed by a colon ( : ) and 2 slashes ( // )
The pattern in Line 2 just matches that the next 1 or more characters in the match is either alphanumeric or hyphen or dot or underscore followed by a possible slash. Next, it matches that the next 0 or more characters in the match is once again off the same set of 'allowed characters' as before only this time a few more special characters are allowed as well, like ?, +, /, ~, =, &, #, ; and even the ,.
Finally, Line 3 just ensures that our URL matching ends at either an alphanumeric character or a slash - really helpful, if you don't want the possibility of a punctuation mark slipping into our URL or link! The {1} bit is there to check only the last character off this match.
Hey! What about that '/si' bit?
Oh! That's coming soon in another article... ![]()
