Python Regex Notes

Before you can use Regex, or Regular Expression, in Python, you must:

import re

First useful stuff is re.sub(), replacing parts of a string based on some pattern. For example:

text = 'dog boot. shoot root o hoodie'
text = re.sub('o', 'e', text)
print text'

This will change text to ‘deg beet. sheet reet e heedie’. So every character ‘o’ is replaced by ‘e’. Changed the pattern to:

re.sub('o|s', 'e', text)

Result will be ‘deg beet. eheet reet e heedie’, so both ‘o’ and ‘s’ is replaced by ‘e’. Now try a new pattern:

re.sub('o|s|.', 'e', text)

It returns ‘eeeeeeeeeeeeeeeeeeeeeeeeeeeee’, what’s wrong? Because ‘.’ is a special characters in the expression(pattern), it must be used with escape:

re.sub('o|s|\.', 'e', text)

Now it gets ‘deg beete eheet reet e heedie’. There are a few more special characters, like ‘\’, ‘|’. For example, to replace ‘\’ with ‘/’:

re.sub('\\\\', '/', text)

to insert ‘\’ before ‘\’:

re.sub('\\\\', '\\\\\\\\', text)

to insert ‘\’ before ‘.’:

re.sub('\.', '\\\.', text)

to insert ‘\’ before ‘|’:

re.sub('\|', '\\\|', text)

Inserting escape before those special characters will make sure regex will treat them as regular characters instead of character with special meanings. This will be useful in search(). For example, to find the sub-string right after r’\\dump.jianzhang.fr\Public’:

text = '\\\\dump.jianzhang.fr\\Public//abc/foo.txt'
m = re.search('(?<=\\\\\\\\dump\.jianzhang\.fr\\\\Public).+', text)
print m.group()

(?<=something) means matches if the position is preceded by something, so something here is  ‘\\\\dump.jianzhang.fr\\Public’, which equals r’\\dump.jianzhang.fr\Public’

.+ means everything from the matched position until it hits a new line, ‘\n’. It returns ‘//abc/foo.txt’.  Now try some more detailed pattern like:

text = r'\\dump.jianzhang.fr\Public//abc/702/foo.txt'
m = re.search(r'(?<=\\\\dump\.jianzhang\.fr\\Public)./{1,2}.[a-z]./', text)
print m.group()

It returns ‘//abc/702’. ‘./{1,2}.[a-z]./’ means right after the matched block, there should be one or two ‘/’, followed by a block contains lowercase characters, and followed by another ‘/’. To add one more condition:

re.search(r'(?<=\\\\dump\.jianzhang\.fr\\Public)./{1,2}.[a-z]./\d{1,3}/', text)

it returns ‘//abc/702/’, means after that ‘/’, there must be 1,2, or 3 numbers, followed by another ‘/’.  We can extract that integer by named group:

m = re.search(r'(?<=\\\\dump\.jianzhang\.fr\\Public)./{1,2}.[a-z]./(?P<foo_loc>\d{1,3})/', text)
print m.group('foo_loc')

it returns 702.

Regex is quite a complex topic. There are still a lot unknowns, so maybe I can clear a few more points next time.

 

 

 

Advertisements

Leave a comment

Filed under Python

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s