본문 바로가기
2_ 바삭바삭 프로그래밍/C# and Visual C++

[C#] 정규표현식 마스터! + 화이트 스페이스 없애기

by 준환이형님_ 2011. 8. 5.

아아악!!! 사람 뭐만 쫌 해보려면 만나는 이놈의 정규표현식. 이제 구글 레퍼런스고 자시고 내가 그냥 책 펴서 공부해서 확 다 외워버리려다가!!

포스팅을 하기로 한번만 더 마음을 바꿨지요..

그러다 덤으로 정규표현식을 이용한 [화이트스페이스 제거방법]도 찾았는데.. 화이트 스페이스.. 처음듣는 단어였지만.. 왠지 마음이 먼저 알아들음..

"trim()아 그동안 고마웠다. 형이 결벽증이 좀 있어서.. 문자열 가위질 할때마다 맨날 
찜찜했다. 트림이 나올것 같았어(몹쓸 '옛날개그'욕심)"




출처 : http://helloboy.tistory.com/entry/%EC%A0%95%EA%B7%9C-%ED%91%9C%ED%98%84%EC%8B%9D-%EC%98%88%EC%A0%9C1

텍스트내에 일치하는 패턴 : 기본형

1. Character literals

/a/


Mary haa little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

/Mary/

Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

2. "Escaped" characters literals

/.*/


Special characters must be escaped.*

/\.\*/
Special characters must be escaped.*

3. Positional special characters

/^Mary/


Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

/Mary$/

Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

4. The "wildcard" character

/.a/ 


Mary had a little lamb.
And everywhere thaMary
went, the lamb was sure
to go.

5. Grouping regular expressions

/(Mary)( )(had)/ 


Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

6. Character classes

/[a-z]a/ 


Mary had a little lamb.
And everywhere that Mary
went, the lamb was sure
to go.

7. Complement operator

/[^a-z]a/ 


Mary had a little lamb.
And everywhere that Mary went, the lamb was sureto go. 

8. Alternation of patterns

/cat|dog|bird/

The pet store sold cats, dogs, and birds.

/=first|second=/

=first first= # =second second= # =first= # =second=

/(=)(first)|(second)(=)/

=first first= # =second second= # =first= # =second=

/=(first|second)=/

=first first= # =second second= # =first= # =second=

9. The basic abstract quantifier

/@(=+=)*@/ 


Match with zero in the middle: @@
Subexpresion occurs, but...: @=+=ABC@
Lots of occurrences: @=+==+==+==+==+=@
Must repeat entire pattern: @=+==+=+==+=@


텍스트내에 일치하는 패턴 : 중간형


1. More abstract quantifiers

/A+B*C?D/


AAAD
ABBBBCD
BBBCD
ABCCD
AAABBBC

2. Numeric quantifiers

/a{5} b{,6} c{4,8}/


aaaaa bbbbb ccccc
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc

/a+ b{3,} c?/

aaaaa bbbbb ccccc
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc

/a{5} b{6,} c{4,8}/

aaaaa bbbbb ccccc
aaa bbb ccc
aaaaa bbbbbbbbbbbbbb ccccc


3. Backreferences

/(abc|xyz) \1/


jkl abc xyz
jkl xyz abc
jkl abc abc
jkl xyz xyz

/(abc|xyz) (abc|xyz)/

jkl abc xyz
jkl xyz abc
jkl abc abc
jkl xyz xyz

4. Don't match more than you want to

/th.*s/


-- I want to match the words that start
-- with 'th' and end with 's'.
this
thus
thistle
this line matches too much

5. Tricks for restraining matches

/th[^s]*./


-- I want to match the words that start
-- with 'th' and end with 's'.
this
thus
thistle
this line matches too much 


A literal-string modification example

s/cat/dog/g 

< The zoo had wild dogs, bobcats, lions, and other wild cats.
> The zoo had wild dogs, bobdogs, lions, and other wild dogs.

A pattern-match modification example

s/cat|dog/snake/g 

< The zoo had wild dogs, bobcats, lions, and other wild cats.
> The zoo had wild snakes, bobsnakes, lions, and other wild snakes.

s/[a-z]+i[a-z]*/nice/g 

< The zoo had wild dogs, bobcats, lions, and other wild cats.
> The zoo had nice dogs, bobcats, nice, and other nice cats.


Modification using backreferences

s/([A-Z])([0-9]{2,4}) /\2:\1 /g 

< A37 B4 C107 D54112 E1103 XXX
37:A B4 107:C D54112 1103:E XXX
    


고급 정규 표현식의 확장

Non-greedy quantifiers

/th.*s/

-- I want to match the words that start
-- with 'th' and end with 's'.
this line matches just right
this # thus # thistle

/th.*?s/

-- I want to match the words that start
-- with 'th' and end with 's'.
this # thus # thistle
this line matches just right

/th.*?s /

-- I want to match the words that start
-- with 'th' and end with 's'. (FINALLY!)
this # thus # thistle
this line matches just right
    

Pattern-match modifiers

/M.*[ise] /

MAINE # Massachusetts # Colorado #
mississippi # Missouri # Minnesota #

/M.*[ise] /i

MAINE # Massachusetts # Colorado #
mississippi # Missouri # Minnesota #

/M.*[ise] /gis

MAINE # Massachusetts # Colorado #
mississippi # Missouri 
# Minnesota #
    

Changing backreference behavior

s/([A-Z])(?:-[a-z]{3}-)([0-9]*)/\1\2/g

< A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93
A37 # B:abcd:42 # C66 # D93
    

Naming backreferences

import re
txt = "A-xyz-37 # B:abcd:142 # C-wxy-66 # D-qrs-93"
print re.sub("(?P<prefix>[A-Z])(-[a-z]{3}-)(?P<id>[0-9]*)",
             "\g<prefix>\g<id>", txt) 


A37 # B:abcd:42 # C66 # D93

Lookahead assertions

s/([A-Z]-)(?=[a-z]{3})([a-z0-9]* )/\2\1/g

< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93
xyz37A- # B-ab6142 # C-Wxy66 # qrs93D-

s/([A-Z]-)(?![a-z]{3})([a-z0-9]* )/\2\1/g

< A-xyz37 # B-ab6142 # C-Wxy66 # D-qrs93
> A-xyz37 # ab6142B- # Wxy66C- # D-qrs93

Making regular expressions more readable

/               # identify URLs within a text file
          [^="] # do not match URLs in IMG tags like:
                # <img src="http://mysite.com/mypic.png">
http|ftp|gopher # make sure we find a resource type
          :\/\/ # ...needs to be followed by colon-slash-slash
      [^ \n\r]+ # stuff other than space, newline, tab is in URL
    (?=[\s\.,]) # assert: followed by whitespace/period/comma
/

The URL for my site is: http://mysite.com/mydoc.html.  You
might also enjoy ftp://yoursite.com/index.html for a good
place to download files.