Thursday, July 23, 2015

Regular Expressions 6

Specifying HOW MANY Repetitions

We examine ways of specifying HOW MANY repetitions of a given pattern we want to match.
Other aspects of pattern matching will be kept constant throughout the tutorial, namely:
"WHAT" will be a single character?
"WHERE" will be from the first character in the target onward
"WHICH" occurrence will be the first occurrence?
Quantifiers, Greedy, and Non-greedy Matching
To specify how many repetitions of a given character should be matched, Oracle supports quantifiers like '*'. For example, while the 'b' in the regular expression 'abc' matches one and only one repetition of the letter 'b', an expression like 'ab*c' matches zero more repetitions of 'b', as in the targets 'ac', 'abc', or 'abbc'.
In certain situations, a pattern that uses quantifiers could conceivably match two or more substrings within a target. For example, given a target string like 'abcbcbc' the pattern 'a.*c' could match either 'abc', 'abcbc', or 'abcbcbc'. A match to the largest possible result string, 'abcbcbc', is called a "greedy" match. A match to the smallest possible result string, 'abc', is called a "non-greedy" match. POSIX regular expressions only support greedy matching. Fortunately, Oracle supports some PERL influenced extensions to the POSIX specification that support non-greedy matching. To specify a non-greedy match a '?' character is appended to the POSIX quantifier.
The following table summarizes the available quantifiers and indicates whether they are greedy or not.
Quantifier
Meaning
Greediness
?
zero or one occurrence
greedy
??
zero or one occurrence
non-greedy
*
zero or more occurrences
greedy
*?
zero or more occurrences
non-greedy
+
one or more occurrences
greedy
+?
one or more occurrences
non-greedy
{m}
exactly m occurrences
greedy
{m}?
exactly m occurrences
non-greedy
{m,n}
at least m, but not more than n occurrences
greedy
{m,n}?
at least m, but not more than n occurrences
non-greedy
{m,}
m or more occurrences
greedy
{m,}?
m or more occurrences
non-greedy
"There Can Be Only One"
To match one and only one instance of a character, simply use the character itself in the regular expression pattern.
execute set_pattern( 'b' )
execute set_target( 'aaa'  )
execute add_target( 'abc'  )
execute add_target( 'abbc' )
execute add_target( 'acc'  )
execute add_target( 'bbb'  )
execute add_target( 'bbbb' )
select * from test_results ; 
PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
b
aaa
N

0
b
abbc
Y
b
2
b
abc
Y
b
2
b
acc
N

0
b
bbb
Y
b
1
b
bbbb
Y
b
1
Zero or One
To match zero or one repetitions of a character use the '?' quantifier
For example, to match zero or one instances of the letter 'b' use 'b?'.
execute set_pattern( 'b?' )
select * from test_results ; 

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
b?
aaa
Y

1
b?
abbc
Y

1
b?
abc
Y

1
b?
acc
Y

1
b?
bbb
Y
b
1
b?
bbbb
Y
b
1
These results may seem strange at first.
For example, one might expect the MATCHED_VALUE for target 'abc' to be 'b', not null. To understand why this is not the case we first have to understand a few underlying principles.
"zero" instances of a character is equivalent to a single empty string
all target strings are considered to have empty strings within them; e.g. 'abc' is treated like this
(empty string) a (empty string) b (empty string) c (empty string) 
Oracle displays empty strings as nulls.
Now let's consider the results. The pattern 'b?' is supposed to match zero or one occurrence of 'b', which is equivalent to saying it matches a single empty string or the character 'b'. When Oracle examines the target 'abc' for a match, it scans the equivalent string
(empty string) a (empty string) b (empty string) c (empty string) 
and finds that the first empty string is a match for the pattern. It therefore reports a successful match at location 1 and returns the empty string as the result (displayed as '(null)').
The same thing happens with targets 'aaa', 'abbc', and 'acc'. Oracle never finds the 'b' at position 2 of 'abbc' because it stops looking when it finds the empty string at the beginning of the target.
For the target 'bbb', Oracle could either match the first empty string in the target or the 'b' at position 1. Since 'b?' is a greedy pattern and since 'b' is a larger result than an empty string, Oracle returns the letter 'b' as the match result.
For another perspective on empty strings, consider the following example where all empty strings in the target 'abc' are replaced with an 'X' to disclose their locations.
select  'abc' as target ,  regexp_replace( 'abc', 'd?', 'X' ) as replaced_target from  dual; 
TARGET        REPLACED_TARGET
---------- -----------------------------------
abc                   XaXbXcX
In the WHERE to Look tutorial in this section we will see how to use the '^' and '$' characters to match the empty strings at the beginning and end, respectively, of a target.
While trivial patterns such as 'b?' are of little use, non-trivial patterns such as 'a.?c' can be very useful and produce clearer results than 'b?'.
execute set_pattern( 'a.?c' )
select * from test_results ; 

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
a.?c
aaa
N

0
a.?c
abbc
N

0
a.?c
abc
Y
abc
1
a.?c
acc
Y
acc
1
a.?c
bbb
N

0
a.?c
bbbb
N

0

Non-greedy matches of zero or more occurrences are specified with the '??' quantifier.
execute set_pattern( 'a.??c' )

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
a.??c
aaa
N

0
a.??c
abbc
N

0
a.??c
abc
Y
abc
1
a.??c
acc
Y
ac
1
a.??c
bbb
N

0
a.??c
bbbb
N

0
Zero or More
A zero or more greedy match is specified with the '*' quantifier, its non-greedy counterpart is '*?'.
execute set_pattern( 'a.*c' ) ;
select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
a.*c
Aaa
N

0
a.*c
Abbc
Y
abbc
1
a.*c
Abc
Y
abc
1
a.*c
Acc
Y
acc
1
a.*c
Bbb
N

0
a.*c
Bbbb
N

0

execute set_pattern( 'a.*?c' ) ;
select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
a.*?c
aaa
N

0
a.*?c
abbc
Y
abbc
1
a.*?c
abc
Y
abc
1
a.*?c
acc
Y
ac
1
a.*?c
bbb
N

0
a.*?c
bbbb
N

0
One or More
A one or more greedy match is specified with the '+' quantifier, its non-greedy counterpart is '+?'.
execute set_pattern( 'b+' ) ;
select * from test_results ; 

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
b+
aaa
N

0
b+
abbc
Y
bb
2
b+
abc
Y
b
2
b+
acc
N

0
b+
bbb
Y
bbb
1
b+
bbbb
Y
bbbb
1

execute set_pattern( 'b+?' ) ;
select * from test_results ; 

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
b+?
Aaa
N

0
b+?
Abbc
Y
b
2
b+?
Abc
Y
b
2
b+?
Acc
N

0
b+?
Bbb
Y
b
1
b+?
Bbbb
Y
b
1
Enumerated Occurrences
To specify an exact number or a numeric range of occurrences use the '{}' quantifiers.
execute set_pattern( 'b{2}'    )
execute add_pattern( 'b{2}?'   )
execute add_pattern( 'b{2,3}'  )
execute add_pattern( 'b{2,3}?' )
execute add_pattern( 'b{2,}'   )
execute add_pattern( 'b{2,}?'  )

set pagesize 39
select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
b{2}
aaa
N

0
b{2}
abbc
Y
bb
2
b{2}
abc
N

0
b{2}
acc
N

0
b{2}
bbb
Y
bb
1
b{2}
bbbb
Y
bb
1
b{2}?
aaa
N

0
b{2}?
abbc
Y
bb
2
b{2}?
abc
N

0
b{2}?
acc
N

0
b{2}?
bbb
Y
bb
1
b{2}?
bbbb
Y
bb
1
b{2,3}
aaa
N

0
b{2,3}
abbc
Y
bb
2
b{2,3}
abc
N

0
b{2,3}
acc
N

0
b{2,3}
bbb
Y
bbb
1
b{2,3}
bbbb
Y
bbb
1
b{2,3}?
aaa
N

0
b{2,3}?
abbc
Y
bb
2
b{2,3}?
abc
N

0
b{2,3}?
acc
N

0
b{2,3}?
bbb
Y
bb
1
b{2,3}?
bbbb
Y
bb
1
b{2,}
aaa
N

0
b{2,}
abbc
Y
bb
2
b{2,}
abc
N

0
b{2,}
acc
N

0
b{2,}
bbb
Y
bbb
1
b{2,}
bbbb
Y
bbbb
1
b{2,}?
aaa
N

0
b{2,}?
abbc
Y
bb
2
b{2,}?
abc
N

0
b{2,}?
acc
N

0
b{2,}?
bbb
Y
bb
1
b{2,}?
bbbb
Y
bb
1

Note that the non-greedy patterns 'b{2}?', 'b{2,3}?', and 'b{2,}?' are of little practical use since they all yield the same result set as 'b{2}'.
Repetitions of Strings with Two or More Characters
In the previous examples we saw how to apply quantifiers to a single character. To apply them to a string of characters simply enclose the string with '(' and ')' and apply the quantifier after the ')'.
execute set_pattern( '(ab)*'   )
execute add_pattern( '(ab)+'   )
execute add_pattern( '(ab){2}' )
execute set_target( 'a'      )
execute add_target( 'ab'     )
execute add_target( 'abab'   )
execute add_target( 'ababab' )
select * from test_results ;
PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
(ab)*
a
Y

1
(ab)*
ab
Y
ab
1
(ab)*
abab
Y
abab
1
(ab)*
ababab
Y
ababab
1
(ab)+
a
N

0
(ab)+
ab
Y
ab
1
(ab)+
abab
Y
abab
1
(ab)+
ababab
Y
ababab
1
(ab){2}
a
N

0
(ab){2}
ab
N

0
(ab){2}
abab
Y
abab
1
(ab){2}
ababab
Y
abab
1
 

Specifying WHERE to Look

In this tutorial we examine ways of specifying WHERE in the source string we should look for a match.
Other aspects of pattern matching will be kept constant throughout the tutorial, namely:
"WHAT" will be a single character
"HOW MANY" will be one repetition of the text
"WHICH" occurrence will be the first occurrence.
Regular Expressions versus LIKE
Before regular expressions, pattern matching was typically done with the LIKE operator. When using this operator a pattern like 'abc' would only match 'abc', not '---abc' or '---abc---'. To match a string that contains 'abc' anywhere inside it, the pattern '%abc%' would be used.
execute set_target( 'abc'  )
execute add_target( '---abc' )
execute add_target( '---abc---' )

select 'abc' as pattern, target from   targets where  target like 'abc';

PATTERN TARGET
--------------- ----------
abc             abc
select '%abc%' as pattern, target from   targets where  target like '%abc%';
PATTERN         TARGET
--------------- ----------
%abc%           abc
                ---abc
                ---abc---
Regular expressions take the opposite approach. The pattern 'abc' matches any target containing 'abc' and a special pattern like '^abc$' must be used to match only the target 'abc'.
select 'abc' as pattern, target from   targets where  regexp_like( target, 'abc' ); 
PATTERN         TARGET
--------------- ----------
abc             abc
                ---abc
                ---abc---
select '^abc$' as pattern, target from   targets where  regexp_like( target, '^abc$' ); 
PATTERN         TARGET
--------------- ----------
^abc$           abc 
Matching the Beginning or End of a Line
As we saw in the tutorial HOW MANY, a target string like 'abc' is treated as a sequence of empty strings and characters.
(empty string) a (empty string) b (empty string) c (empty string) 
A multi-line string with a Linefeed (LF) in it, like 'abc(LF)def' is treated as follows.
(empty string) a (empty string) b (empty string) c (empty string) (LF)
(empty string) d (empty string) e (empty string) f (empty string) 
This treatment enables us to specify two special locations within the target, the beginning of lines and the end of lines.
By default the character '^' matches the first empty string in a target and the character '$' matches the last empty string in a target. In multi-line mode (initiated using the 'm' parameter in REGEXP conditions and functions) '^' matches the first empty string in the target *and* the first empty string after any Linefeed characters. The character '$' performs similarly for the last empty string in a target or line. The following examples illustrate this. The symbol (^) indicates an empty string matched by '^' and the symbol ($) indicates an empty string matched by '$'.
Default Mode:
(^)   a (empty string) b (empty string) c (empty string) (LF)
(empty string) d (empty string) e (empty string) f ($) 
Multi-line Mode:
(^) a (empty string) b (empty string) c ($) (LF)
(^) d (empty string) e (empty string) f ($) 
The following examples show how '^' and '$' operate in default mode. In the target data, 'unistr('\000A')' represents a Linefeed character and the target value 'abc(LF)def' appears as 'abc' on one line and 'def' on the next.
execute set_pattern( '^'  )
execute add_pattern( '$'  )
execute add_pattern( '^.' )
execute add_pattern( '.$' )
execute add_pattern( 'a' )
execute add_pattern( '^a' )
execute add_pattern( '^d' )
execute add_pattern( 'c$' )
execute add_pattern( 'f' )
execute add_pattern( 'f$' )
execute set_target( 'abcdef'  )
execute add_target( 'abc' || unistr('\000A') || 'def' )

set pagesize 60
set recsep   each
break on pattern
select   pattern ,  target, regexp_substr( target, pattern ) as matched_value ,
 regexp_instr( target, pattern ) as position
from  patterns ,  targets
order by  sort_key ,  target desc; 

PATTERN
TARGET
MATCHED_VALUE
POSITION
^
abcdef

1
^
abc
def


1
$
abcdef

7
$
abc
def


8
^.
abcdef
a
1
^.
abc
def

a
1
.$
abcdef
f
6
.$
abc
def

f
7
a
abcdef
a
1
a
abc
def

a
1
^a
abcdef
a
1
^a
abc
def

a
1
^d
abcdef

0
^d
abc
def


0
c$
abcdef

0
c$
abc
def


0
f
abcdef
f
6
f
abc
def

f
7
f$
abcdef
f
6
f$
abc

def

f
7
Using the same patterns in multi-line mode gives us these results.
Select   pattern ,  target ,  regexp_substr( target, pattern, 1, 1, 'm' ) as matched_value ,
  regexp_instr( target, pattern, 1, 1, 0, 'm' ) as position
from  patterns ,  targets
order by  sort_key ,  target desc;
PATTERN
TARGET
MATCHED_VALUE
POSITION
^
abcdef

1
^
abc
def


1
$
abcdef

7
$
abc
def


4
^.
abcdef
a
1
^.
abc
def

a
1
.$
abcdef
f
6
.$
abc
def

c
3
a
abcdef
a
1
a
abc
def

a
1
^a
abcdef
a
1
^a
abc
def

a
1
^d
abcdef

0
^d
abc
def

d
5
c$
abcdef

0
c$
abc
def

c
3
f
abcdef
f
6
f
abc
def

f
7
f$
abcdef
f
6
f$
abc
def

f
7
set recsep wrapped
break on pattern skip 1
For the six character target 'abcdef' note that REGEXP_INSTR returns '1' for both '^' and 'a'. Contrast this with the results for 'f' and '$' which are '6' and '7' respectively. One might expect REGEXP_INSTR to return "0" for '^', to be consistent with the behaviour of "$", but it cannot because '0' is used to indicate that no match was found. Be aware of this difference when using REGEXP_INSTR with '^' and '$'.
Starting the Search From Somewhere Other Than Position One
By default Oracle searches the target string starting from the first character in the target string. To start from a different position, REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR accept a "position" parameter to set the starting point.
Starting from position 1 (no position parameter is used):
execute set_target( 'abc1--abc2'  )

select   target ,  'abc.' as pattern ,  regexp_substr( target, 'abc.' ) as matched_value ,
  regexp_instr( target, 'abc.' ) as position
from  targets; 

TARGET     PATTERN         MATCHED_VALUE   POSITION
---------- --------------- --------------- --------
abc1--abc2 abc.            abc1                   1
Starting from position 5 (position parameter is set to "5"):
Select   target ,  'abc.' as pattern ,  regexp_substr( target, 'abc.', 5 ) as matched_value ,
  regexp_instr( target, 'abc.', 5 ) as position
from  targets; 

TARGET     PATTERN         MATCHED_VALUE   POSITION
---------- --------------- --------------- --------
abc1--abc2 abc.            abc2                   7
Note that, regardless of which location you start the search at, REGEXP_INSTR always returns a position relative to the start of the entire target.
Unlike the three REGEXP functions, the REGEXP_LIKE condition does not accept a position parameter. The following examples show how to work around this. They all search for targets that have the string 'abc' anywhere at or after the fifth character in a string.
execute set_target( 'abc'  )
execute add_target( 'abc---def'  )
execute add_target( 'def---abc'  )

select target from   targets where  regexp_like( target, '^.{4}.*abc' ) ;

TARGET
----------
def---abc

select target
from   targets
where  regexp_like( substr( target, 5 ), 'abc' ) ;

TARGET
----------
def---abc 

Specifying WHICH Occurrence

In this tutorial we examine ways of specifying WHICH occurrence of a pattern we wish to match.
Other aspects of pattern matching will be kept constant throughout the tutorial, namely:
"WHAT" will be a single character
"HOW MANY" will be one repetition of the text
"WHERE" will be from the first character in the target onward
REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR
By default a regular expression always matches the first occurrence of a pattern. To match the second, third, etc. occurrence of a pattern the REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR functions accept an "occurrence" parameter.
This example shows the default behavior (matching the first occurrence) when no occurrence parameter is specified.
execute set_pattern( 'ab.' )
execute set_target( 'ab1-ab2'         )
execute add_target( 'ab1-ab2-ab3-ab4' )
execute add_target( 'def-def-def'     )
execute add_target( 'def-def-def-def' )

column target format a15

select  'ab.' as pattern ,  target ,  regexp_substr( target, 'ab.' ) as matched_value ,
  regexp_instr( target, 'ab.' ) as position
from  targets; 

PATTERN         TARGET          MATCHED_VALUE   POSITION
--------------- --------------- --------------- --------
ab.             ab1-ab2         ab1                    1
                ab1-ab2-ab3-ab4 ab1                    1
                def-def-def     (null)                 0
                def-def-def-def (null)                 0
This example shows how to match the third occurrence of a pattern. The same technique can be applied to any number of occurrences.
select  'ab.' as pattern ,  target ,  regexp_substr( target, 'ab.', 1, 3 ) as matched_value ,
  regexp_instr( target, 'ab.', 1, 3 ) as position
from  targets;

PATTERN         TARGET          MATCHED_VALUE   POSITION
--------------- --------------- --------------- --------
ab.             ab1-ab2         (null)                 0
                ab1-ab2-ab3-ab4 ab3                    9
                def-def-def     (null)                 0
                def-def-def-def (null)                 0

REGEXP_LIKE
Unlike the three REGEXP functions we just examined, REGEXP_LIKE does not accept an occurrence parameter. The following examples show ways of working around this.
This example searches for targets with *exactly* three occurrences of 'def'
select target from   targets where  regexp_like( target, '(def.*){3}' ) and    not regexp_like( target, '(def.*){4}') ;

TARGET
---------------
def-def-def
This example searches for targets with *at least* three occurrences of 'def'
select target from   targets where  regexp_like( target, '(def.*){3}' ) ; 
TARGET
---------------
def-def-def
def-def-def-def
This example searches for targets with at least three occurrences of 'ab.'
select target from   targets where  regexp_like( target, '(ab..*){3}' ) ; 
TARGET
---------------
ab1-ab2-ab3-ab4
Of course, some of the requirements above could also be met without using REGEXP_LIKE by simply using a REGEXP function, as in the following examples.
Targets with at least three occurrences of 'def':
select target from   targets where  regexp_instr( target, 'def', 1, 3 ) != 0 ;  
TARGET
---------------
def-def-def
def-def-def-def

Targets with at least three occurrences of 'ab.'
select target from   targets where  regexp_instr( target, 'ab.', 1, 3 ) != 0 ;

TARGET
---------------
ab1-ab2-ab3-ab4
 

Cleanup

Run the code on this page to drop the sample tables, procedures, etc. created in earlier parts of this section. To clear session state changes (e.g. those made by SET, COLUMN, and VARIABLE commands) exit your SQL*Plus session after running these cleanup commands.

drop procedure add_target   ;
drop procedure set_target   ;
drop procedure add_pattern  ;
drop procedure set_pattern  ;
drop view      test_results ;
drop table     targets      ;
drop table     patterns     ;

drop table character_class_map ;

exit

No comments :

Post a Comment