Thursday, July 23, 2015

Regular Expressions 3

Specifying WHAT to Match - Part 1

In this tutorial we begin examining ways to specify WHAT text to match by exploring patterns for individual characters.
Matching Individual Characters
To match a single character string, e.g. the letter 'b', simply use it as the regular expression pattern.
execute set_pattern( 'b' )

execute set_target( 'aaa'  )
execute add_target( 'abc'  )
execute add_target( 'bbb'  )
execute add_target( 'bcd'  )
execute add_target( 'cde'  )
execute add_target( 'def'  )
execute add_target( 'efg'  )

create or replace view test_results as
select  pattern , target,  case  when regexp_like( target, pattern ) then 'Y'  else 'N' end    as match,
  regexp_substr( target, pattern ) as matched_value ,   regexp_instr( target, pattern ) as position
from   patterns,  targets
order by   patterns.sort_key,  target;

column pattern         format a15
column target          format a10
column match           format a5
column matched_value   format a15
column modifier        format a8
column replaced_target format a15
column position        format 999
column substring       format a15
break on pattern skip 1

set null     '(null)'
set linesize 100

select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
b
aaa
N

0
b
abc
Y
b
2
b
bbb
Y
b
1
b
bcd
Y
b
1
b
cde
N

0
b
def
N

0
b
efg
N

0
To match any one of a set of individual characters use character list syntax  '[ ]'.
The following example matches either 'a', 'c', or 'g'.
execute set_pattern( '[acg]' )
select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
[acg]
aaa
Y
a
1
[acg]
abc
Y
a
1
[acg]
bbb
N

0
[acg]
bcd
Y
c
2
[acg]
cde
Y
c
1
[acg]
def
N

0
[acg]
efg
Y
g
3
To exclude characters place '^' after the '['.
execute set_pattern( '[^acg]' )
select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
[^acg]
aaa
N

0
[^acg]
abc
Y
b
2
[^acg]
bbb
Y
b
1
[^acg]
bcd
Y
b
1
[^acg]
cde
Y
d
2
[^acg]
def
Y
d
1
[^acg]
efg
Y
e
1

Matching any one of a range of characters, e.g. 'a' through 'c', can be done with the range operator '-'.
execute set_pattern( '[a-c]' )

select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
[a-c]
aaa
Y
a
1
[a-c]
abc
Y
a
1
[a-c]
bbb
Y
b
1
[a-c]
bcd
Y
b
1
[a-c]
cde
Y
c
1
[a-c]
def
N

0
[a-c]
efg
N

0
Be careful when working with alphabets other than English since a range like '[a-z]' may not include all characters in your alphabet.

execute set_pattern( '[a-z]' )
execute set_target( 'aeiou'  )
execute add_target( 'àéîõü' )

select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
[a-z]
aeiou
Y
a
1
[a-z]
…‚Å’o
Y
o
4

POSIX character classes and equivalence classes (discussed in WHAT to Match - 2) offer a solution when working with alphabets other than English.

Case Insensitive Matching
By default, pattern matching with regular expressions is case sensitive. To perform case insensitive matches use the pattern matching modifier 'i' in REGEXP conditions and functions.
First, consider a case sensitive match.
execute set_target( 'a' )
execute add_target( 'b' )
execute add_target( 'c' )
execute add_target( 'A' )
execute add_target( 'B' )
execute add_target( 'C' )

select   '[b]' as pattern,  target ,  regexp_substr( target, '[b]' ) as matched_value from  targets;

PATTERN
TARGET
MATCHED_VALUE
[b]
a
 (null)
[b]
b
b
[b]
c
 (null)
[b]
A
 (null)
[b]
B
 (null)
[b]
C
 (null)

Now consider the same pattern in a case insensitive match. The 'i' modifier is specified in the fifth parameter of the REGEXP_SUBSTR function.

Select '[b]' as pattern, target, regexp_substr( target, '[b]', 1, 1, 'i' ) as matched_value from   targets

PATTERN
TARGET
MATCHED_VALUE
[b]
a
  (null)
[b]
b
b
[b]
c
 (null)
[b]
A
 (null)
[b]
B
B
[b]
C
  (null)

Escaping Special Characters
Attempting to match special characters, like '[', may cause problems if you fail to escape them. The pattern 'a[b', for example, will not work.

execute set_pattern( 'a[b' )
execute set_target( 'abc'   )
execute add_target( 'a[b[c' )
execute add_target( 'a\b\c' )

select * from test_results ;
select * from test_results
                          *
ERROR at line 1:
ORA-12726: unmatched bracket in regular expression
To escape special characters, prefix them with the '\' character.
execute set_pattern( 'a\[b' )

select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
a\[b
a[b[c
Y
a[b
1
a\[b
a\b\c
N
   (null)
0
a\[b
abc
N
   (null)
0

execute set_pattern( 'a\\b' )
select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
a\\b
a[b[c
N
   (null)
0
a\\b
a\b\c
Y
a\b
1
a\\b
abc
N
   (null)
0

Wildcard Character
To match any character at all, except null and the newline character, use the '.' character.
execute set_pattern( '.' );

execute set_target( 'abc' )
execute add_target( '#24' )
execute add_target( '123' )

select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
.
#24
Y
#
1
.
123
Y
1
1
.
abc
Y
a
1

Be careful with nulls, nuls, line feeds, and carriage returns. '.' does not match them all.
execute set_pattern( '.' );
begin set_target( chr(0) );  end; -- NUL character, do not confuse with NULL
begin add_target( chr(10) ); end; -- a Linefeed
begin add_target( chr(13) ); end; -- a Carriage Return
begin add_target( 'a' );     end;
begin add_target( null );    end;
begin add_target( '' );      end; -- an empty string is equivalent to NULL

select   pattern ,  ascii( target ) as target_ascii_code ,  match ,
  ascii( matched_value ) as matched_value_ascii_code ,  position
from   test_results;

PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
.
0
Y
0
1
.
10
N
    (null)
0
.
13
Y
13
1
.
97
Y
97
1
.
    (null)
N
    (null)
    (null)
.
    (null)
N
    (null)
    (null)

If you need '.' to match the newline character you can use the 'n' modifier in REGEXP conditions and functions.
Select   '.' as pattern ,   ascii( target ) as target_ascii_code ,
  case    when regexp_like( target, '.', 'n' ) then 'Y'    else 'N'    end as match ,
  regexp_instr( target, '.', 1, 1, 0, 'n' ) as position
from  targets;

PATTERN
TARGET_ASCII_CODE
MATCH
POSITION
.
0
Y
1
.
10
Y
1
.
13
Y
1
.
97
Y
1
.
     (null)
N
     (null)
.
     (null)
N
     (null)

Nulls
Using a null for either a regular expression pattern or a search target always yields an unknown result.
execute set_pattern( null )
select * from test_results ;
 PATTERN
TARGET
MATCH
MATCHED_VALUE
POSITION
(null)

N
(null)
(null)
(null)

N
(null)
(null)
(null)

N
(null)
(null)
(null)
a
N
(null)
(null)
(null)
(null)
N
(null)
(null)
(null)
(null)
N
(null)
(null)

execute set_pattern( '.' )
execute set_target( 'abc' )
execute add_target( null  )

select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE 
POSITION
.
abc
Y
a
1
.
(null)
N
(null)
(null)

Gotchas
You cannot specify multi-character strings in character lists. A pattern like 'a[(bc)]' does NOT mean 'a' followed by the string 'bc'. It has the same meaning as 'a[bc]'.
execute set_pattern( 'a[(bc)]' )

execute set_target( 'abc' )
execute add_target( 'ab'  )
execute add_target( 'ac'  )
execute add_target( 'bc'  )

select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE 
POSITION
a[(bc)]
ab
Y
ab
1
a[(bc)]
abc
Y
ab
1
a[(bc)]
ac
Y
ac
1
a[(bc)]
bc
N
 (null)
0

execute set_pattern( 'a[bc]' )
select * from test_results ;

PATTERN
TARGET
MATCH
MATCHED_VALUE 
POSITION
a[bc)]
ab
Y
ab
1
a[bc)]
abc
Y
ab
1
a[bc)]
ac
Y
ac
1
a[bc)]
bc
N
 (null)
0
You cannot use 'not' to turn an expression with an unknown result into an expression with a true result. For example, neither of the following two queries return any rows, though some might expect the second one would.

set feedback on
select * from   dual where  regexp_like( null, 'abc' ) ;
no rows selected

No comments :

Post a Comment