Symbolic Data Processing Case Study: icgrep
Case Study: Extensive Symbolic Computing in icgrep
As an extended example of what you can do with symbolic computing techniques, consider icgrep, an SFU-developed command-line tool for regular expression search.
- currently written in C++, but many components prototyped in Haskell
- designed as a modern high-performance replacement for the traditional grep program
- supports:
- traditional Posix regular expressions (grep),
- Perl-compatible regular expressions (PCRE),
- full Unicode regular expressions of UTS #18: Unicode Regular Expressions
 
- implemented using parallel techniques
- bitwise data parallel regular expression matching
- SIMD parallelism
- multithread parallelism
 
- designed for outstanding performance with complicated regular expressions and big data applications
The Symbolic Data Domains
- RE: regular expressions
- Pablo: tiny programming language for unbounded bit streams
- LLVM IR: portable intermediate representation/high-level assembly language
- ASM: assembler language of the target machine.
When using icgrep, you can always display the symbolic output of the various phases using the command flags
- -ShowREs
- -ShowPablo
- -ShowIR
- -ShowASM
The Phases of icgrep
There are many symbolic computing phases in icgrep.
- RE phases
- parse REs into internal AST (abstract syntax tree form)
- perform RE to RE transformations, for example
- Case-insensitive transformation
- Star Normal Form
 
- compile REs to Pablo using the Parabix regular expression algorithm
 
- Pablo phases
- perform Pablo to Pablo transformations, for example
- common subexpression elimination
- dead-code elimination
- sinking
 
- compile Pablo into LLVM IR
 
- perform Pablo to Pablo transformations, for example
- LLVM IR passes
- IR to IR transformations
- compile LLVM IR to target machine architecture
 
- ASM phases
- ASM to ASM transformations, such as peephole optimization
- generate final binary machine code
 
Example
./icgrep '(?i)a(?:[bc]*[d-f]*)*[a-z]{2,9}t' -ShowAllREs -ShowPablo -ShowIR -ShowASM
RE Passes
Parser:
Group(+i:(Seq[CC "CC_61" [97]/Unicode,
              Rep((Seq[Rep(CC "CC_62_63" [98-99]/Unicode,0,Unbounded), 
                       Rep(CC "CC_64_66" [100-102]/Unicode,0,Unbounded)]),0,Unbounded),
              Rep(CC "CC_61_7a" [97-122]/Unicode,2,9),
              CC "CC_74" [116]/Unicode]))
resolveGraphemeMode:
Group(+i:(Seq[CC "CC_61" [97]/Unicode,Rep((Seq[Rep(CC "CC_62_63" [98-99]/Unicode,0,Unbounded),Rep(CC "CC_64_66" [100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_61_7a" [97-122]/Unicode,2,9),CC "CC_74" [116]/Unicode]))
resolveUnicodeProperties:
Group(+i:(Seq[CC "CC_61" [97]/Unicode,Rep((Seq[Rep(CC "CC_62_63" [98-99]/Unicode,0,Unbounded),Rep(CC "CC_64_66" [100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_61_7a" [97-122]/Unicode,2,9),CC "CC_74" [116]/Unicode]))
resolveCaseInsensitiveMode:
(Seq[CC "CC_41,61" [65][97]/Unicode,
     Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),
              Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),
     Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),
     CC "CC_54,74" [84][116]/Unicode])
excludeUnicodeLineBreak:
(Seq[CC "CC_41,61" [65][97]/Unicode,Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])
RemoveNullablePrefix:
(Seq[CC "CC_41,61" [65][97]/Unicode,Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])
RemoveNullableSuffix:
(Seq[CC "CC_41,61" [65][97]/Unicode,Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])
Star_Normal_Form:
(Seq[CC "CC_41,61" [65][97]/Unicode,
     Rep(CC "CC_42_46,62_66" [66-70][98-102]/Unicode,0,Unbounded),
     Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),
     CC "CC_54,74" [84][116]/Unicode])
resolveNames:
(Seq[CC "CC_41,61" [65][97]/Unicode,Rep(CC "CC_42_46,62_66" [66-70][98-102]/Unicode,0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])
Simplifier:
(Seq[CC "CC_41,61" [65][97]/Unicode,Rep(CC "CC_42_46,62_66" [66-70][98-102]/Unicode,0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])
Pablo
Initial Pablo AST: basis[0] = Extract basis, 0 basis[1] = Extract basis, 1 basis[2] = Extract basis, 2 basis[3] = Extract basis, 3 basis[4] = Extract basis, 4 basis[5] = Extract basis, 5 basis[6] = Extract basis, 6 basis[7] = Extract basis, 7 not = (~basis[7]) not_1 = (~basis[5]) not_2 = (~basis[3]) not_3 = (~basis[2]) not_4 = (~basis[1]) not_5 = (~basis[0]) and = (basis[6] & not) and_1 = (basis[4] & not_1) or = (basis[2] | basis[3]) not_6 = (~or) or_1 = (basis[0] | basis[1]) not_7 = (~or_1) and_2 = (and & and_1) or_2 = (or | or_1) not_8 = (~or_2) LF = (and_2 & not_8) ...
LLVM IR
; ModuleID = 'FD_source16@8_O1_SSE2:B'
target triple = "x86_64-apple-darwin14.5.0"
%"FD_source16@8_O1" = type { i64, i32, { [1 x [8 x <2 x i64>]]*, i64, i64 }*, i8*, i64, { i64, i64** }*, i64, i64, i64, i64 }
; Function Attrs: nounwind
define void @"FD_source16@8_O1_Init"(%"FD_source16@8_O1"* %self, i32 %fileDescriptor, { [1 x [8 x <2 x i64>]]*, i64, i64 }* %sourceBuffer_bufferPtr, { i64, i64** }* %sourceBufferConsumerLocks) #0 {
entry:
  store %"FD_source16@8_O1" zeroinitializer, %"FD_source16@8_O1"* %self, align 8
  %0 = getelementptr %"FD_source16@8_O1", %"FD_source16@8_O1"* %self, i64 0, i32 1
  store i32 %fileDescriptor, i32* %0, align 4
  %1 = getelementptr %"FD_source16@8_O1", %"FD_source16@8_O1"* %self, i64 0, i32 2
  store { [1 x [8 x <2 x i64>]]*, i64, i64 }* %sourceBuffer_bufferPtr, { [1 x [8 x <2 x i64>]]*, i64, i64 }** %1, align 8
  %2 = getelementptr %"FD_source16@8_O1", %"FD_source16@8_O1"* %self, i64 0, i32 5
  store { i64, i64** }* %sourceBufferConsumerLocks, { i64, i64** }** %2, align 8
  %3 = icmp eq i32 %fileDescriptor, 0
  br i1 %3, label %initializeRead, label %initializeMMap
...
Intel/AMD x86_64 ASM
FD_source16@8_O1_Init: ## @"FD_source16@8_O1_Init" ## BB#0: ## %entry pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %rbx pushq %rax movl %esi, %r15d movq %rdi, %rbx movq $0, 72(%rbx) movq $0, 64(%rbx) movq $0, 56(%rbx) movq $0, 48(%rbx) movq $0, 40(%rbx) movq $0, 32(%rbx) movq $0, 24(%rbx) movq $0, 16(%rbx) movl $0, 8(%rbx) movq $0, (%rbx) movl %r15d, 8(%rbx) movq %rdx, 16(%rbx) movq %rcx, 40(%rbx) testl %r15d, %r15d je LBB0_1 ## BB#3: ## %initializeMMap movabsq $file_size, %rax movl %r15d, %edi callq *%rax movq %rax, %r14 movabsq $mmap, %rax testq %r14, %r14 je LBB0_4 ## BB#5: ## %NonEmptyFile xorl %edi, %edi movl $1, %edx movl $2, %ecx xorl %r9d, %r9d movq %r14, %rsi movl %r15d, %r8d callq *%rax movq %r14, %r15 ...
The icgrep example shows the different types of phases common to many symbolic computing applications:
- parsing phases which transform a string representation into an internal AST (abstract syntax tree) for some symbolic domain
- printing phases which print out ASTs to a string form
- source-to-source transformation phases which perform a transformation mapping a symbolic expression into another symbolic expression in the same domain
- translation phases which compile or otherwise convert a symbolic expression from one symbolic domain to another.
Performance Examples
- Example: IP address regex
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}
Data source: 620 MB Wikibooks document set (15 languages)
Performance Results
| Program | Processor | SIMD | Instructions | Time | 
|---|---|---|---|---|
| i7-3770 @ 3.4 GHz | SSE2 | 232,079 M | 21.3 s | |
| grep -E | i3-5010U @ 2.1 GHz | AVX2 | 232,423 M | 39.5 s | 
| W-2102 @ 2.9 GHz | AVX-512 | 232,081 M | 25.6 s | 
| i7-3770 @ 3.4 GHz | SSE2 | 3,720 M | 0.49 s | |
| icgrep | i3-5010U @ 2.1 GHz | AVX2 | 2,193 M | 0.49 s | 
| W-2102 @ 2.9 GHz | AVX-512 | 1,349 M | 0.32 s | |
| W-2102 (2 cores) | AVX-512 | 1,388 M | 0.20 s | 
- Example: long-line search (500+ characters)
- Regex: .{500}
 
- Regex: 
| Program | Processor | SIMD | Instructions | Time | 
|---|---|---|---|---|
| grep -E | i3-5010U @ 2.1 GHz | AVX2 | 5,265,058 M | 884.7 s | 
| icgrep | i3-5010U @ 2.1 GHz | AVX2 | 8,472 M | 1.31 s | 
- Example: Unicode category: Greek
- Regex \p{Greek}
 
- Regex 
| Program | Processor | SIMD | Instructions | Time | 
|---|---|---|---|---|
| grep -P | i3-5010U @ 2.1 GHz | AVX2 | 241,655 M | 43.4 s | 
| icgrep | i3-5010U @ 2.1 GHz | AVX2 | 3,960 M | 0.52 s | 
Updated Thu Oct. 11 2018, 08:57 by cameron.