CL-PPCRE


Tags: ライブラリ, 正規表現, CL-PPCRE

CL-PPCREEdi Weitzが書いた、正規表現によるパターンマッチングのためのライブラリです。

という特徴があります。Common Lispで正規表現を処理する場合に良く使われます。



関連ページ


正規表現の表現

文字列とS式で表現できます。文字列で表された正規表現は、最終的にS式での表現に変換されて処理されます。S式での表現の文法は読み易いですが、必要な文字数が大幅に増えてしまう場合もあるため、手書きにはあまり向きません。

;; 文字列での「\d+」
"\\d+"

;; S式での「\d+」
'(:greedy-repetition 1 nil :digit-class)

なお、文字列での表現をS式での表現に変換したいときは、cl-ppcre:parse-stringを使います。

文字列での表現の注意点

文字列で正規表現を表す場合、既定の文字クラスやメタ文字のエスケープをするときに注意が必要です。Common Lispの文字列では、

If a single escape character is seen, the single escape character is discarded,
the next character is accumulated, and accumulation continues.

Common Lisp HyperSpec: 2.4.5 Double-Quote

と決められており、例えば、「\d」は単に「d」と同じ意味になってしまいます。そのため、二重にエスケープをする必要があり、「\\d」のように書かなければいけません。


レジスタと後方参照

CL-PPCREのドキュメントでは、後方参照のために保存する、グループ化されたパターンのマッチングの記録をレジスタ(原文register)と呼んでいます。CL-PPCREのAPIでは、各レジスタの開始位置と終了位置を配列で返したり、各レジスタの内容を文字列の配列で返したりしますが、後方参照に一番良く使われるのは、cl-ppcre:register-groups-bindでしょう。

(ppcre:register-groups-bind (entire a b c)
    ("((a)(b)(c))" "abc")
  (values entire a b c))
;=> "abc"
;   "a"
;   "b"
;   "c"

このマクロは各レジスタを、順番に、渡された名前の変数に束縛した上で、body部分を実行します。


高速化

CL-PPCREは、標準では省メモリ寄りに設定されています。スペシャル変数の値を変更することで、メモリを多く使う代わりに、より高速に動作するようになります。

なお、CL-PPCREはコンパイラマクロを多用しています。変更したスペシャル変数の値を参照させるためには注意が必要です。

cl-ppcre:*use-bmh-matchers*

Boyer-Moore法の派生であるHorspoolのアルゴリズムを使います。高速になりますが、メモリの使用量がメガバイト単位で増えます。標準ではnilに設定されています。

cl-ppcre:*optimize-char-classes*

文字クラスのルックアップを定数時間で行えるようになります。高速ですが、スキャナの生成が遅くなり、メモリの使用量が増えます。標準ではnilに設定されています。

*use-bmh-matchers*の効果の例

(declaim (optimize (speed 3) (debug 0) (safety 0)))

;;; メモリ使用量

(let* ((ppcre:*use-bmh-matchers* nil))
  (time (ppcre:create-scanner "commodo")))
;-> (CL-PPCRE:CREATE-SCANNER "commodo") took 0 milliseconds (0.000 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 0 milliseconds (0.000 seconds) were spent in user mode
;                       0 milliseconds (0.000 seconds) were spent in system mode
;    832 bytes of memory allocated.

(let* ((ppcre:*use-bmh-matchers* t))
  (time (ppcre:create-scanner "commodo")))
;-> (CL-PPCRE:CREATE-SCANNER "commodo") took 125 milliseconds (0.125 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 156 milliseconds (0.156 seconds) were spent in user mode
;                       0 milliseconds (0.000 seconds) were spent in system mode
;   94 milliseconds (0.094 seconds) was spent in GC.
;    4,457,288 bytes of memory allocated.

;;; 速度

(defvar *lipsum*
  "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.")

(let* ((ppcre:*use-bmh-matchers* nil)
       (scanner (ppcre:create-scanner "commodo")))
  (time (iter (repeat 100000) (ppcre:scan scanner *lipsum*))))
;-> (ITER (REPEAT 100000) (CL-PPCRE:SCAN SCANNER *LIPSUM*)) took 1,765 milliseconds (1.765 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 2,906 milliseconds (2.906 seconds) were spent in user mode
;                       16 milliseconds (0.016 seconds) were spent in system mode
;   500 milliseconds (0.500 seconds) was spent in GC.
;    64 bytes of memory allocated.

(let* ((ppcre:*use-bmh-matchers* t)
       (scanner (ppcre:create-scanner "commodo")))
  (time (iter (repeat 100000) (ppcre:scan scanner *lipsum*))))
;-> (ITER (REPEAT 100000) (CL-PPCRE:SCAN SCANNER *LIPSUM*)) took 422 milliseconds (0.422 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 579 milliseconds (0.579 seconds) were spent in user mode
;                       0 milliseconds (0.000 seconds) were spent in system mode
;   157 milliseconds (0.157 seconds) was spent in GC.
;    64 bytes of memory allocated.

*optimize-char-classes*の効果の例

(declaim (optimize (speed 3) (debug 0) (safety 0)))

;;; メモリ使用量

(let* ((ppcre:*optimize-char-classes* nil))
  (time (ppcre:create-scanner "[^\\x0-\\x10ffff]+")))
;-> (CL-PPCRE:CREATE-SCANNER "[^\\x0-\\x10ffff]+") took 0 milliseconds (0.000 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 0 milliseconds (0.000 seconds) were spent in user mode
;                       0 milliseconds (0.000 seconds) were spent in system mode
;    1,128 bytes of memory allocated.

(let* ((ppcre:*optimize-char-classes* :hash-table))
  (time (ppcre:create-scanner "[^\\x0-\\x10ffff]+")))
;-> (CL-PPCRE:CREATE-SCANNER "[^\\x0-\\x10ffff]+") took 1,687 milliseconds (1.687 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 1,828 milliseconds (1.828 seconds) were spent in user mode
;                       0 milliseconds (0.000 seconds) were spent in system mode
;   531 milliseconds (0.531 seconds) was spent in GC.
;    43,061,488 bytes of memory allocated.

(let* ((ppcre:*optimize-char-classes* :charset))
  (time (ppcre:create-scanner "[^\\x0-\\x10ffff]+")))
;-> (CL-PPCRE:CREATE-SCANNER "[^\\x0-\\x10ffff]+") took 375 milliseconds (0.375 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 547 milliseconds (0.547 seconds) were spent in user mode
;                       15 milliseconds (0.015 seconds) were spent in system mode
;   125 milliseconds (0.125 seconds) was spent in GC.
;    6,030,592 bytes of memory allocated.

(let* ((ppcre:*optimize-char-classes* :charmap))
  (time (ppcre:create-scanner "[^\\x0-\\x10ffff]+")))
;-> (CL-PPCRE:CREATE-SCANNER "[^\\x0-\\x10ffff]+") took 156 milliseconds (0.156 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 156 milliseconds (0.156 seconds) were spent in user mode
;                       0 milliseconds (0.000 seconds) were spent in system mode
;    1,264 bytes of memory allocated.

;;; 速度

(defvar *lipsum*
  "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum
dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit anim id est laborum.")

(let* ((ppcre:*optimize-char-classes* nil)
       (scanner (ppcre:create-scanner "[^\\x0-\\x10ffff]+")))
  (time (iter (repeat 100000) (ppcre:scan scanner *lipsum*))))
;-> (ITER (REPEAT 100000) (CL-PPCRE:SCAN SCANNER *LIPSUM*)) took 891 milliseconds (0.891 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 891 milliseconds (0.891 seconds) were spent in user mode
;                       0 milliseconds (0.000 seconds) were spent in system mode
;    32 bytes of memory allocated.

(let* ((ppcre:*optimize-char-classes* :hash-table)
       (scanner (ppcre:create-scanner "[^\\x0-\\x10ffff]+")))
  (time (iter (repeat 100000) (ppcre:scan scanner *lipsum*))))
;-> (ITER (REPEAT 100000) (CL-PPCRE:SCAN SCANNER *LIPSUM*)) took 562 milliseconds (0.562 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 562 milliseconds (0.562 seconds) were spent in user mode
;                       0 milliseconds (0.000 seconds) were spent in system mode
;    32 bytes of memory allocated.

(let* ((ppcre:*optimize-char-classes* :charset)
       (scanner (ppcre:create-scanner "[^\\x0-\\x10ffff]+")))
  (time (iter (repeat 100000) (ppcre:scan scanner *lipsum*))))
;-> (ITER (REPEAT 100000) (CL-PPCRE:SCAN SCANNER *LIPSUM*)) took 235 milliseconds (0.235 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 234 milliseconds (0.234 seconds) were spent in user mode
;                       0 milliseconds (0.000 seconds) were spent in system mode
;    32 bytes of memory allocated.

(let* ((ppcre:*optimize-char-classes* :charmap)
       (scanner (ppcre:create-scanner "[^\\x0-\\x10ffff]+")))
  (time (iter (repeat 100000) (ppcre:scan scanner *lipsum*))))
;-> (ITER (REPEAT 100000) (CL-PPCRE:SCAN SCANNER *LIPSUM*)) took 265 milliseconds (0.265 seconds) to run 
;                       with 2 available CPU cores.
;   During that period, 266 milliseconds (0.266 seconds) were spent in user mode
;                       0 milliseconds (0.000 seconds) were spent in system mode
;    32 bytes of memory allocated.

Last modified : 2011/09/03 19:28:45 JST
CC0 1.0
Powerd by WiLiKi 0.6.1 on Gauche 0.9