Specifying HOW MANY Repetitions
We examine
ways of specifying HOW MANY repetitions of a
given pattern we want to match.
Other
aspects of pattern matching will be kept constant throughout the tutorial,
namely:
"WHAT" will be a single character?
"WHERE" will be from the first character in the target onward
"WHICH" occurrence will be the first occurrence?
Quantifiers,
Greedy, and Non-greedy Matching
To specify how many repetitions of a given
character should be matched, Oracle supports quantifiers like '*'. For example,
while the 'b' in the regular expression 'abc' matches one and only one
repetition of the letter 'b', an expression like 'ab*c' matches zero more
repetitions of 'b', as in the targets 'ac', 'abc', or 'abbc'. In certain situations, a pattern that uses quantifiers could conceivably match two or more substrings within a target. For example, given a target string like 'abcbcbc' the pattern 'a.*c' could match either 'abc', 'abcbc', or 'abcbcbc'. A match to the largest possible result string, 'abcbcbc', is called a "greedy" match. A match to the smallest possible result string, 'abc', is called a "non-greedy" match. POSIX regular expressions only support greedy matching. Fortunately, Oracle supports some PERL influenced extensions to the POSIX specification that support non-greedy matching. To specify a non-greedy match a '?' character is appended to the POSIX quantifier.
The following table summarizes the available quantifiers and indicates whether they are greedy or not.
Quantifier |
Meaning |
Greediness |
? |
zero or one occurrence |
greedy |
?? |
zero or one occurrence |
non-greedy |
* |
zero or more occurrences |
greedy |
*? |
zero or more occurrences |
non-greedy |
+ |
one or more occurrences |
greedy |
+? |
one or more occurrences |
non-greedy |
{m} |
exactly m occurrences |
greedy |
{m}? |
exactly m occurrences |
non-greedy |
{m,n} |
at least m, but not more than n occurrences |
greedy |
{m,n}? |
at least m, but not more than n occurrences |
non-greedy |
{m,} |
m or more occurrences |
greedy |
{m,}? |
m or more occurrences |
non-greedy |
To match one and only one instance of a character, simply use the character itself in the regular expression pattern.
execute
set_pattern( 'b' )
execute
set_target( 'aaa' )
execute
add_target( 'abc' )
execute
add_target( 'abbc' )
execute
add_target( 'acc' )
execute
add_target( 'bbb' )
execute
add_target( 'bbbb' )
select * from test_results ;
PATTERN |
TARGET |
MATCH |
MATCHED_VALUE |
POSITION |
b |
aaa |
N |
|
0 |
b |
abbc |
Y |
b |
2 |
b |
abc |
Y |
b |
2 |
b |
acc |
N |
|
0 |
b |
bbb |
Y |
b |
1 |
b |
bbbb |
Y |
b |
1 |
To match zero or one repetitions of a character use the '?' quantifier
For example, to match zero or one instances of the letter 'b' use 'b?'.
execute
set_pattern( 'b?' )
select *
from test_results ;
PATTERN |
TARGET |
MATCH |
MATCHED_VALUE |
POSITION |
b? |
aaa |
Y |
|
1 |
b? |
abbc |
Y |
|
1 |
b? |
abc |
Y |
|
1 |
b? |
acc |
Y |
|
1 |
b? |
bbb |
Y |
b |
1 |
b? |
bbbb |
Y |
b |
1 |
For example, one might expect the MATCHED_VALUE for target 'abc' to be 'b', not null. To understand why this is not the case we first have to understand a few underlying principles.
"zero" instances of a character is equivalent to a single empty string
all target strings are considered to have empty strings within them; e.g. 'abc' is treated like this
(empty string) a (empty string) b (empty string) c (empty string)
Oracle displays empty strings as nulls.
Now let's consider the results. The pattern 'b?' is supposed to match zero or one occurrence of 'b', which is equivalent to saying it matches a single empty string or the character 'b'. When Oracle examines the target 'abc' for a match, it scans the equivalent string
(empty string) a (empty string) b (empty string) c (empty string)
and finds that the first empty string is a match for the pattern. It therefore reports a successful match at location 1 and returns the empty string as the result (displayed as '(null)').
The same thing happens with targets 'aaa', 'abbc', and 'acc'. Oracle never finds the 'b' at position 2 of 'abbc' because it stops looking when it finds the empty string at the beginning of the target.
For the target 'bbb', Oracle could either match the first empty string in the target or the 'b' at position 1. Since 'b?' is a greedy pattern and since 'b' is a larger result than an empty string, Oracle returns the letter 'b' as the match result.
For another perspective on empty strings, consider the following example where all empty strings in the target 'abc' are replaced with an 'X' to disclose their locations.
select 'abc' as target , regexp_replace( 'abc', 'd?', 'X' ) as replaced_target from dual;
TARGET REPLACED_TARGET
----------
-----------------------------------
abc XaXbXcX
In the WHERE to Look tutorial in this section we will see how to use the '^' and
'$' characters to match the empty strings at the beginning and end,
respectively, of a target. While trivial patterns such as 'b?' are of little use, non-trivial patterns such as 'a.?c' can be very useful and produce clearer results than 'b?'.
execute
set_pattern( 'a.?c' )
select *
from test_results ;
PATTERN |
TARGET |
MATCH |
MATCHED_VALUE |
POSITION |
a.?c |
aaa |
N |
|
0 |
a.?c |
abbc |
N |
|
0 |
a.?c |
abc |
Y |
abc |
1 |
a.?c |
acc |
Y |
acc |
1 |
a.?c |
bbb |
N |
|
0 |
a.?c |
bbbb |
N |
|
0 |
Non-greedy
matches of zero or more occurrences are specified with the '??' quantifier.
execute
set_pattern( 'a.??c' )
PATTERN |
TARGET |
MATCH |
MATCHED_VALUE |
POSITION |
a.??c |
aaa |
N |
|
0 |
a.??c |
abbc |
N |
|
0 |
a.??c |
abc |
Y |
abc |
1 |
a.??c |
acc |
Y |
ac |
1 |
a.??c |
bbb |
N |
|
0 |
a.??c |
bbbb |
N |
|
0 |
A zero or
more greedy match is specified with the '*' quantifier, its non-greedy
counterpart is '*?'.
execute
set_pattern( 'a.*c' ) ;
select *
from test_results ;
PATTERN |
TARGET |
MATCH |
MATCHED_VALUE |
POSITION |
a.*c |
Aaa |
N |
|
0 |
a.*c |
Abbc |
Y |
abbc |
1 |
a.*c |
Abc |
Y |
abc |
1 |
a.*c |
Acc |
Y |
acc |
1 |
a.*c |
Bbb |
N |
|
0 |
a.*c |
Bbbb |
N |
|
0 |
execute
set_pattern( 'a.*?c' ) ;
select *
from test_results ;
PATTERN |
TARGET |
MATCH |
MATCHED_VALUE |
POSITION |
a.*?c |
aaa |
N |
|
0 |
a.*?c |
abbc |
Y |
abbc |
1 |
a.*?c |
abc |
Y |
abc |
1 |
a.*?c |
acc |
Y |
ac |
1 |
a.*?c |
bbb |
N |
|
0 |
a.*?c |
bbbb |
N |
|
0 |
A one or more greedy match is specified with the '+' quantifier, its non-greedy counterpart is '+?'.
execute
set_pattern( 'b+' ) ;
select *
from test_results ;
PATTERN |
TARGET |
MATCH |
MATCHED_VALUE |
POSITION |
b+ |
aaa |
N |
|
0 |
b+ |
abbc |
Y |
bb |
2 |
b+ |
abc |
Y |
b |
2 |
b+ |
acc |
N |
|
0 |
b+ |
bbb |
Y |
bbb |
1 |
b+ |
bbbb |
Y |
bbbb |
1 |
execute
set_pattern( 'b+?' ) ;
select *
from test_results ;
PATTERN |
TARGET |
MATCH |
MATCHED_VALUE |
POSITION |
b+? |
Aaa |
N |
|
0 |
b+? |
Abbc |
Y |
b |
2 |
b+? |
Abc |
Y |
b |
2 |
b+? |
Acc |
N |
|
0 |
b+? |
Bbb |
Y |
b |
1 |
b+? |
Bbbb |
Y |
b |
1 |
To specify an exact number or a numeric range of occurrences use the '{}' quantifiers.
execute
set_pattern( 'b{2}' )
execute
add_pattern( 'b{2}?' )
execute
add_pattern( 'b{2,3}' )
execute
add_pattern( 'b{2,3}?' )
execute
add_pattern( 'b{2,}' )
execute
add_pattern( 'b{2,}?' )
set
pagesize 39
select *
from test_results ;
PATTERN |
TARGET |
MATCH |
MATCHED_VALUE |
POSITION |
b{2} |
aaa |
N |
|
0 |
b{2} |
abbc |
Y |
bb |
2 |
b{2} |
abc |
N |
|
0 |
b{2} |
acc |
N |
|
0 |
b{2} |
bbb |
Y |
bb |
1 |
b{2} |
bbbb |
Y |
bb |
1 |
b{2}? |
aaa |
N |
|
0 |
b{2}? |
abbc |
Y |
bb |
2 |
b{2}? |
abc |
N |
|
0 |
b{2}? |
acc |
N |
|
0 |
b{2}? |
bbb |
Y |
bb |
1 |
b{2}? |
bbbb |
Y |
bb |
1 |
b{2,3} |
aaa |
N |
|
0 |
b{2,3} |
abbc |
Y |
bb |
2 |
b{2,3} |
abc |
N |
|
0 |
b{2,3} |
acc |
N |
|
0 |
b{2,3} |
bbb |
Y |
bbb |
1 |
b{2,3} |
bbbb |
Y |
bbb |
1 |
b{2,3}? |
aaa |
N |
|
0 |
b{2,3}? |
abbc |
Y |
bb |
2 |
b{2,3}? |
abc |
N |
|
0 |
b{2,3}? |
acc |
N |
|
0 |
b{2,3}? |
bbb |
Y |
bb |
1 |
b{2,3}? |
bbbb |
Y |
bb |
1 |
b{2,} |
aaa |
N |
|
0 |
b{2,} |
abbc |
Y |
bb |
2 |
b{2,} |
abc |
N |
|
0 |
b{2,} |
acc |
N |
|
0 |
b{2,} |
bbb |
Y |
bbb |
1 |
b{2,} |
bbbb |
Y |
bbbb |
1 |
b{2,}? |
aaa |
N |
|
0 |
b{2,}? |
abbc |
Y |
bb |
2 |
b{2,}? |
abc |
N |
|
0 |
b{2,}? |
acc |
N |
|
0 |
b{2,}? |
bbb |
Y |
bb |
1 |
b{2,}? |
bbbb |
Y |
bb |
1 |
Note that the non-greedy patterns 'b{2}?', 'b{2,3}?', and 'b{2,}?' are of little practical use since they all yield the same result set as 'b{2}'.
Repetitions of Strings with Two or More Characters
In the previous examples we saw how to apply quantifiers to a single character. To apply them to a string of characters simply enclose the string with '(' and ')' and apply the quantifier after the ')'.
execute
set_pattern( '(ab)*' )
execute
add_pattern( '(ab)+' )
execute
add_pattern( '(ab){2}' )
execute
set_target( 'a' )
execute
add_target( 'ab' )
execute
add_target( 'abab' )
execute
add_target( 'ababab' )
select * from test_results ;
PATTERN |
TARGET |
MATCH |
MATCHED_VALUE |
POSITION |
(ab)* |
a |
Y |
|
1 |
(ab)* |
ab |
Y |
ab |
1 |
(ab)* |
abab |
Y |
abab |
1 |
(ab)* |
ababab |
Y |
ababab |
1 |
(ab)+ |
a |
N |
|
0 |
(ab)+ |
ab |
Y |
ab |
1 |
(ab)+ |
abab |
Y |
abab |
1 |
(ab)+ |
ababab |
Y |
ababab |
1 |
(ab){2} |
a |
N |
|
0 |
(ab){2} |
ab |
N |
|
0 |
(ab){2} |
abab |
Y |
abab |
1 |
(ab){2} |
ababab |
Y |
abab |
1 |
Specifying WHERE to Look
In this
tutorial we examine ways of specifying WHERE in
the source string we should look for a match.
Other
aspects of pattern matching will be kept constant throughout the tutorial,
namely:
"WHAT" will be a single character
"HOW MANY" will be one repetition of the text
"WHICH" occurrence will be the first occurrence.
Regular
Expressions versus LIKE Before regular expressions, pattern matching was typically done with the LIKE operator. When using this operator a pattern like 'abc' would only match 'abc', not '---abc' or '---abc---'. To match a string that contains 'abc' anywhere inside it, the pattern '%abc%' would be used.
execute
set_target( 'abc' )
execute
add_target( '---abc' )
execute
add_target( '---abc---' )
select
'abc' as pattern, target from targets where target like 'abc';
PATTERN TARGET
---------------
----------
abc abc
select '%abc%' as pattern, target from targets where target like '%abc%';
PATTERN TARGET
---------------
----------
%abc% abc
---abc
---abc---
Regular expressions take the opposite approach. The
pattern 'abc' matches any target containing 'abc' and a special pattern like
'^abc$' must be used to match only the target 'abc'. select 'abc' as pattern, target from targets where regexp_like( target, 'abc' );
PATTERN TARGET
---------------
----------
abc abc
---abc
---abc---
select '^abc$' as pattern, target from targets where regexp_like( target, '^abc$' );
PATTERN TARGET
---------------
----------
^abc$ abc
Matching the Beginning or End of a Line As we saw in the tutorial HOW MANY, a target string like 'abc' is treated as a sequence of empty strings and characters.
(empty string) a (empty string) b (empty string) c (empty string)
A multi-line string with a Linefeed (LF) in it, like 'abc(LF)def' is treated as follows.
(empty string) a (empty string) b (empty string) c (empty string) (LF)
(empty string) d (empty string) e (empty string) f (empty string)
This treatment enables us to specify two special locations within the target, the beginning of lines and the end of lines.
By default the character '^' matches the first empty string in a target and the character '$' matches the last empty string in a target. In multi-line mode (initiated using the 'm' parameter in REGEXP conditions and functions) '^' matches the first empty string in the target *and* the first empty string after any Linefeed characters. The character '$' performs similarly for the last empty string in a target or line. The following examples illustrate this. The symbol (^) indicates an empty string matched by '^' and the symbol ($) indicates an empty string matched by '$'.
Default Mode:
(^) a (empty string) b (empty string) c (empty string) (LF)
(empty string) d (empty string) e (empty string) f ($)
Multi-line Mode:
(^) a
(empty string) b (empty string) c ($) (LF)
(^) d
(empty string) e (empty string) f ($)
The following examples show how '^' and '$' operate
in default mode. In the target data, 'unistr('\000A')' represents a Linefeed
character and the target value 'abc(LF)def' appears as 'abc' on one line and
'def' on the next.
execute
set_pattern( '^' )
execute
add_pattern( '$' )
execute
add_pattern( '^.' )
execute
add_pattern( '.$' )
execute
add_pattern( 'a' )
execute
add_pattern( '^a' )
execute
add_pattern( '^d' )
execute
add_pattern( 'c$' )
execute
add_pattern( 'f' )
execute
add_pattern( 'f$' )
execute
set_target( 'abcdef' )
execute
add_target( 'abc' || unistr('\000A') || 'def' )
set
pagesize 60
set
recsep each
break on
pattern
select pattern ,
target, regexp_substr( target, pattern ) as matched_value ,
regexp_instr( target, pattern ) as position
from patterns ,
targets
order
by sort_key , target desc;
PATTERN |
TARGET |
MATCHED_VALUE |
POSITION |
^ |
abcdef |
|
1 |
^ |
abc def |
|
1 |
$ |
abcdef |
|
7 |
$ |
abc def |
|
8 |
^. |
abcdef |
a |
1 |
^. |
abc def |
a |
1 |
.$ |
abcdef |
f |
6 |
.$ |
abc def |
f |
7 |
a |
abcdef |
a |
1 |
a |
abc def |
a |
1 |
^a |
abcdef |
a |
1 |
^a |
abc def |
a |
1 |
^d |
abcdef |
|
0 |
^d |
abc def |
|
0 |
c$ |
abcdef |
|
0 |
c$ |
abc def |
|
0 |
f |
abcdef |
f |
6 |
f |
abc def |
f |
7 |
f$ |
abcdef |
f |
6 |
f$ |
abc def |
f |
7 |
Select pattern ,
target , regexp_substr( target,
pattern, 1, 1, 'm' ) as matched_value ,
regexp_instr( target, pattern, 1, 1, 0, 'm' )
as position
from patterns ,
targets
order
by sort_key , target desc;
PATTERN |
TARGET |
MATCHED_VALUE
|
POSITION |
^ |
abcdef |
|
1 |
^ |
abc def |
|
1 |
$ |
abcdef |
|
7 |
$ |
abc def |
|
4 |
^. |
abcdef |
a
|
1 |
^. |
abc def |
a
|
1 |
.$ |
abcdef |
f
|
6 |
.$ |
abc def |
c
|
3 |
a |
abcdef |
a
|
1 |
a |
abc def |
a
|
1 |
^a |
abcdef |
a
|
1 |
^a |
abc def |
a
|
1 |
^d |
abcdef |
|
0 |
^d |
abc def |
d
|
5 |
c$ |
abcdef |
|
0 |
c$ |
abc def |
c
|
3 |
f |
abcdef |
f
|
6 |
f |
abc def |
f
|
7 |
f$ |
abcdef |
f
|
6 |
f$ |
abc def |
f
|
7 |
break on pattern skip 1
For the six character target 'abcdef' note that REGEXP_INSTR returns '1' for both '^' and 'a'. Contrast this with the results for 'f' and '$' which are '6' and '7' respectively. One might expect REGEXP_INSTR to return "0" for '^', to be consistent with the behaviour of "$", but it cannot because '0' is used to indicate that no match was found. Be aware of this difference when using REGEXP_INSTR with '^' and '$'.
Starting the Search From Somewhere Other Than Position One
By default Oracle searches the target string starting from the first character in the target string. To start from a different position, REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR accept a "position" parameter to set the starting point.
Starting from position 1 (no position parameter is used):
execute
set_target( 'abc1--abc2' )
select target ,
'abc.' as pattern ,
regexp_substr( target, 'abc.' ) as matched_value ,
regexp_instr( target, 'abc.' ) as position
from targets;
TARGET PATTERN MATCHED_VALUE POSITION
----------
--------------- --------------- --------
abc1--abc2
abc. abc1 1
Starting from position 5 (position parameter is set
to "5"):
Select target ,
'abc.' as pattern ,
regexp_substr( target, 'abc.', 5 ) as matched_value ,
regexp_instr( target, 'abc.', 5 ) as position
from targets;
TARGET PATTERN MATCHED_VALUE POSITION
----------
--------------- --------------- --------
abc1--abc2
abc. abc2 7
Note that, regardless of which location you start
the search at, REGEXP_INSTR always returns a position relative to the start of
the entire target. Unlike the three REGEXP functions, the REGEXP_LIKE condition does not accept a position parameter. The following examples show how to work around this. They all search for targets that have the string 'abc' anywhere at or after the fifth character in a string.
execute
set_target( 'abc' )
execute
add_target( 'abc---def' )
execute
add_target( 'def---abc' )
select
target from targets where regexp_like( target, '^.{4}.*abc' ) ;
TARGET
----------
def---abc
select
target
from targets
where regexp_like( substr( target, 5 ), 'abc' ) ;
TARGET
----------
def---abc
Specifying WHICH Occurrence
In this tutorial we examine ways of specifying WHICH occurrence of a pattern we wish to match.
Other
aspects of pattern matching will be kept constant throughout the tutorial,
namely:
"WHAT" will be a single character
"HOW MANY" will be one repetition of the text
"WHERE" will be from the first character in the target onward
REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR
By default a regular expression always matches the
first occurrence of a pattern. To match the second, third, etc. occurrence of a
pattern the REGEXP_INSTR, REGEXP_REPLACE, and REGEXP_SUBSTR functions accept an
"occurrence" parameter. This example shows the default behavior (matching the first occurrence) when no occurrence parameter is specified.
execute
set_pattern( 'ab.' )
execute
set_target( 'ab1-ab2' )
execute
add_target( 'ab1-ab2-ab3-ab4' )
execute
add_target( 'def-def-def' )
execute
add_target( 'def-def-def-def' )
column
target format a15
select 'ab.' as pattern , target ,
regexp_substr( target, 'ab.' ) as matched_value ,
regexp_instr( target, 'ab.' ) as position
from targets;
PATTERN TARGET MATCHED_VALUE POSITION
---------------
--------------- --------------- --------
ab. ab1-ab2 ab1 1
ab1-ab2-ab3-ab4 ab1 1
def-def-def (null) 0
def-def-def-def (null) 0
This example shows how to match the third
occurrence of a pattern. The same technique can be applied to any number of
occurrences.
select 'ab.' as pattern , target ,
regexp_substr( target, 'ab.', 1, 3 ) as matched_value ,
regexp_instr( target, 'ab.', 1, 3 ) as
position
from targets;
PATTERN TARGET MATCHED_VALUE POSITION
---------------
--------------- --------------- --------
ab. ab1-ab2 (null) 0
ab1-ab2-ab3-ab4 ab3 9
def-def-def (null)
0
def-def-def-def (null) 0
Unlike the three REGEXP functions we just examined, REGEXP_LIKE does not accept an occurrence parameter. The following examples show ways of working around this.
This example searches for targets with *exactly* three occurrences of 'def'
select
target from targets where regexp_like( target, '(def.*){3}' ) and not regexp_like( target, '(def.*){4}') ;
TARGET
---------------
def-def-def
This example searches for targets with *at least*
three occurrences of 'def' select target from targets where regexp_like( target, '(def.*){3}' ) ;
TARGET
---------------
def-def-def
def-def-def-def
This example searches for targets with at least
three occurrences of 'ab.' select target from targets where regexp_like( target, '(ab..*){3}' ) ;
TARGET
---------------
ab1-ab2-ab3-ab4
Of course,
some of the requirements above could also be met without using REGEXP_LIKE by
simply using a REGEXP function, as in the following examples.
Targets with at least three occurrences of 'def': select target from targets where regexp_instr( target, 'def', 1, 3 ) != 0 ;
TARGET
---------------
def-def-def
def-def-def-def
Targets
with at least three occurrences of 'ab.'
select
target from targets where regexp_instr( target, 'ab.', 1, 3 ) != 0 ;
TARGET
---------------
ab1-ab2-ab3-ab4
Cleanup
Run the
code on this page to drop the sample tables, procedures, etc. created in
earlier parts of this section. To clear session state changes (e.g. those made
by SET, COLUMN, and VARIABLE commands) exit your SQL*Plus session after running
these cleanup commands.
drop
procedure add_target ;
drop
procedure set_target ;
drop
procedure add_pattern ;
drop
procedure set_pattern ;
drop
view test_results ;
drop
table targets ;
drop
table patterns ;
drop table
character_class_map ;
exit
No comments :
Post a Comment