python-regex


python provide module re to deal with regular expressions. It can be used for string search, text manipulation or web scraping. re module provides features which are very similar to perl.

re module provides several useful functions some are: search, match, find, findall, split, sub.

Example (search):

#!/usr/bin/python
import re
x="this is bitarray.io website, remember io!"
z=re.search(r'(.*) web(.*)',x)
print z.group(1)
print z.group(2)

===output====
this is bitarray.io
site, remember io!

In the above example, we create an object “z” with the search function. See the groups created by search function they are quite useful in parsing lines and extracting certain patterns.

 


search vs match

search function: find something anywhere in the string.

match function: find something at the beginning of the string.

Example (match):

#!/usr/bin/python
import re
x="this is bitarray.io website, remember io!"
z=re.match(r'web(.*)',x)
print type(z)
print z.group(1)

======output==========
<type 'NoneType'>
Traceback (most recent call last):
  File "./p1", line 6, in <module>
    print z.group(1)
AttributeError: 'NoneType' object has no attribute 'group'

it fails because the regex doesn’t match from the beginning of the line!

below example will work. (That’s a big difference between match vs search! – > Important Interview question!)

#!/usr/bin/python
import re
x="this is bitarray.io website, remember io!"
z=re.match(r'(.*)web(.*)',x)
print type(z)
print z.group(1)

===output===
<type '_sre.SRE_Match'>
this is bitarray.io 

 


special characters

Special Character

How it works?

^matches start of the string
$matches end of the string
*matches repetitions of previous pattern 0 or more times.
+matches repetitions of previous pattern 1 or more times
?matches repetitions of previous pattern 0 or 1 time.
[]used to list set of characters.
(...)used to represent a group.
\dmatches digits, like \d+ will match continuous digits
\DMatches any non-digit characters
\smatches any white space character, \s+ for multiple whitespaces.
\Smatches non-white space characters.
.Matches any character (NOT newline)

 


split example

>>> import re
>>> x="this is a 56th test" 
>>> d=re.split("5\d",x)
>>> d[0]
'this is a '
>>> d[1]
'th test'

Note: There are some methods available for string types like find, replace, split etc. These methods DO NOT take regex. if you need regex you need to import re module.

 


sub example

what if you want to replace AA with BB in a string?

>>> import re
>>> x="this is AA string"
>>> x=re.sub("AA","BB",x)
>>> print (x)
this is BB string

What if the string had multiple occurrences of AA

>>> x="this is AA string AA"
>>> x=re.sub("AA","BB",x)
>>> print (x)
this is BB string BB
>>> 

to replace the first occurrence only

>>> import re
>>> x="this is AA string AA"
>>> x=re.sub("AA","BB",x,1)
>>> print (x)
this is BB string AA

 


Match Objects

regexobject=re.compile(pattern)

This will create regex object, it can be re-used within code if same type of match/search is required multiple times (regex objects are more efficient). Their behaviour can be altered using flags (like Ignore Case, Locale , Multiline etc).

>>> import re
>>> x="This is a 1st test,#1 test for regex"
>>> rexobj=re.compile("test")
>>> rexobj.search(x)
<re.Match object; span=(14, 18), match='test'>
>>> if(rexobj.search(x)): print ("Found")
... 
Found
>>>