Not logged in. Login

Symbolic Data Processing Case Study: icgrep

Case Study: Extensive Symbolic Computing in icgrep

As an extended example of what you can do with symbolic computing techniques, consider icgrep, an SFU-developed command-line tool for regular expression search.

  • currently written in C++, but many components prototyped in Haskell
  • designed as a modern high-performance replacement for the traditional grep program
  • implemented using parallel techniques
    • bitwise data parallel regular expression matching
    • SIMD parallelism
    • multithread parallelism
  • designed for outstanding performance with complicated regular expressions and big data applications

The Symbolic Data Domains

  • RE: regular expressions
  • Pablo: tiny programming language for unbounded bit streams
  • LLVM IR: portable intermediate representation/high-level assembly language
  • ASM: assembler language of the target machine.

When using icgrep, you can always display the symbolic output of the various phases using the command flags

  • -ShowREs
  • -ShowPablo
  • -ShowIR
  • -ShowASM

The Phases of icgrep

There are many symbolic computing phases in icgrep.

  • RE phases
    • parse REs into internal AST (abstract syntax tree form)
    • perform RE to RE transformations, for example
      • Case-insensitive transformation
      • Star Normal Form
    • compile REs to Pablo using the Parabix regular expression algorithm
  • Pablo phases
    • perform Pablo to Pablo transformations, for example
      • common subexpression elimination
      • dead-code elimination
      • sinking
    • compile Pablo into LLVM IR
  • LLVM IR passes
    • IR to IR transformations
    • compile LLVM IR to target machine architecture
  • ASM phases
    • ASM to ASM transformations, such as peephole optimization
    • generate final binary machine code

Example

./icgrep '(?i)a(?:[bc]*[d-f]*)*[a-z]{2,9}t' -ShowAllREs -ShowPablo -ShowIR -ShowASM

RE Passes

Parser:
Group(+i:(Seq[CC "CC_61" [97]/Unicode,
              Rep((Seq[Rep(CC "CC_62_63" [98-99]/Unicode,0,Unbounded), 
                       Rep(CC "CC_64_66" [100-102]/Unicode,0,Unbounded)]),0,Unbounded),
              Rep(CC "CC_61_7a" [97-122]/Unicode,2,9),
              CC "CC_74" [116]/Unicode]))
resolveGraphemeMode:
Group(+i:(Seq[CC "CC_61" [97]/Unicode,Rep((Seq[Rep(CC "CC_62_63" [98-99]/Unicode,0,Unbounded),Rep(CC "CC_64_66" [100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_61_7a" [97-122]/Unicode,2,9),CC "CC_74" [116]/Unicode]))
resolveUnicodeProperties:
Group(+i:(Seq[CC "CC_61" [97]/Unicode,Rep((Seq[Rep(CC "CC_62_63" [98-99]/Unicode,0,Unbounded),Rep(CC "CC_64_66" [100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_61_7a" [97-122]/Unicode,2,9),CC "CC_74" [116]/Unicode]))
resolveCaseInsensitiveMode:
(Seq[CC "CC_41,61" [65][97]/Unicode,
     Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),
              Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),
     Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),
     CC "CC_54,74" [84][116]/Unicode])
excludeUnicodeLineBreak:
(Seq[CC "CC_41,61" [65][97]/Unicode,Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])
RemoveNullablePrefix:
(Seq[CC "CC_41,61" [65][97]/Unicode,Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])
RemoveNullableSuffix:
(Seq[CC "CC_41,61" [65][97]/Unicode,Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])
Star_Normal_Form:
(Seq[CC "CC_41,61" [65][97]/Unicode,
     Rep(CC "CC_42_46,62_66" [66-70][98-102]/Unicode,0,Unbounded),
     Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),
     CC "CC_54,74" [84][116]/Unicode])
resolveNames:
(Seq[CC "CC_41,61" [65][97]/Unicode,Rep(CC "CC_42_46,62_66" [66-70][98-102]/Unicode,0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])
Simplifier:
(Seq[CC "CC_41,61" [65][97]/Unicode,Rep(CC "CC_42_46,62_66" [66-70][98-102]/Unicode,0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])

Pablo

Initial Pablo AST:
basis[0] = Extract basis, 0
basis[1] = Extract basis, 1
basis[2] = Extract basis, 2
basis[3] = Extract basis, 3
basis[4] = Extract basis, 4
basis[5] = Extract basis, 5
basis[6] = Extract basis, 6
basis[7] = Extract basis, 7
not = (~basis[7])
not_1 = (~basis[5])
not_2 = (~basis[3])
not_3 = (~basis[2])
not_4 = (~basis[1])
not_5 = (~basis[0])
and = (basis[6] & not)
and_1 = (basis[4] & not_1)
or = (basis[2] | basis[3])
not_6 = (~or)
or_1 = (basis[0] | basis[1])
not_7 = (~or_1)
and_2 = (and & and_1)
or_2 = (or | or_1)
not_8 = (~or_2)
LF = (and_2 & not_8)
...

LLVM IR

; ModuleID = 'FD_source16@8_O1_SSE2:B'
target triple = "x86_64-apple-darwin14.5.0"

%"FD_source16@8_O1" = type { i64, i32, { [1 x [8 x <2 x i64>]]*, i64, i64 }*, i8*, i64, { i64, i64** }*, i64, i64, i64, i64 }

; Function Attrs: nounwind
define void @"FD_source16@8_O1_Init"(%"FD_source16@8_O1"* %self, i32 %fileDescriptor, { [1 x [8 x <2 x i64>]]*, i64, i64 }* %sourceBuffer_bufferPtr, { i64, i64** }* %sourceBufferConsumerLocks) #0 {
entry:
  store %"FD_source16@8_O1" zeroinitializer, %"FD_source16@8_O1"* %self, align 8
  %0 = getelementptr %"FD_source16@8_O1", %"FD_source16@8_O1"* %self, i64 0, i32 1
  store i32 %fileDescriptor, i32* %0, align 4
  %1 = getelementptr %"FD_source16@8_O1", %"FD_source16@8_O1"* %self, i64 0, i32 2
  store { [1 x [8 x <2 x i64>]]*, i64, i64 }* %sourceBuffer_bufferPtr, { [1 x [8 x <2 x i64>]]*, i64, i64 }** %1, align 8
  %2 = getelementptr %"FD_source16@8_O1", %"FD_source16@8_O1"* %self, i64 0, i32 5
  store { i64, i64** }* %sourceBufferConsumerLocks, { i64, i64** }** %2, align 8
  %3 = icmp eq i32 %fileDescriptor, 0
  br i1 %3, label %initializeRead, label %initializeMMap

...

Intel/AMD x86_64 ASM

FD_source16@8_O1_Init:                  ## @"FD_source16@8_O1_Init"
## BB#0:                                ## %entry
	pushq	%rbp
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	pushq	%rbx
	pushq	%rax
	movl	%esi, %r15d
	movq	%rdi, %rbx
	movq	$0, 72(%rbx)
	movq	$0, 64(%rbx)
	movq	$0, 56(%rbx)
	movq	$0, 48(%rbx)
	movq	$0, 40(%rbx)
	movq	$0, 32(%rbx)
	movq	$0, 24(%rbx)
	movq	$0, 16(%rbx)
	movl	$0, 8(%rbx)
	movq	$0, (%rbx)
	movl	%r15d, 8(%rbx)
	movq	%rdx, 16(%rbx)
	movq	%rcx, 40(%rbx)
	testl	%r15d, %r15d
	je	LBB0_1
## BB#3:                                ## %initializeMMap
	movabsq	$file_size, %rax
	movl	%r15d, %edi
	callq	*%rax
	movq	%rax, %r14
	movabsq	$mmap, %rax
	testq	%r14, %r14
	je	LBB0_4
## BB#5:                                ## %NonEmptyFile
	xorl	%edi, %edi
	movl	$1, %edx
	movl	$2, %ecx
	xorl	%r9d, %r9d
	movq	%r14, %rsi
	movl	%r15d, %r8d
	callq	*%rax
	movq	%r14, %r15
...

The icgrep example shows the different types of phases common to many symbolic computing applications:

  • parsing phases which transform a string representation into an internal AST (abstract syntax tree) for some symbolic domain
  • printing phases which print out ASTs to a string form
  • source-to-source transformation phases which perform a transformation mapping a symbolic expression into another symbolic expression in the same domain
  • translation phases which compile or otherwise convert a symbolic expression from one symbolic domain to another.

Performance Examples

  • Example: IP address regex
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}

Data source: 620 MB Wikibooks document set (15 languages)

Performance Results

ProgramProcessorSIMDInstructionsTime
i7-3770 @ 3.4 GHzSSE2232,079 M21.3 s
grep -Ei3-5010U @ 2.1 GHzAVX2232,423 M39.5 s
W-2102 @ 2.9 GHzAVX-512232,081 M25.6 s
i7-3770 @ 3.4 GHzSSE23,720 M0.49 s
icgrepi3-5010U @ 2.1 GHzAVX22,193 M0.49 s
W-2102 @ 2.9 GHzAVX-5121,349 M0.32 s
W-2102 (2 cores)AVX-5121,388 M0.20 s
  • Example: long-line search (500+ characters)
    • Regex: .{500}
ProgramProcessorSIMDInstructionsTime
grep -Ei3-5010U @ 2.1 GHzAVX25,265,058 M884.7 s
icgrepi3-5010U @ 2.1 GHzAVX28,472 M1.31 s
  • Example: Unicode category: Greek
    • Regex \p{Greek}
ProgramProcessorSIMDInstructionsTime
grep -Pi3-5010U @ 2.1 GHzAVX2241,655 M43.4 s
icgrepi3-5010U @ 2.1 GHzAVX23,960 M0.52 s
Updated Thu Oct. 11 2018, 08:57 by cameron.