Symbolic Data Processing Case Study: icgrep
Case Study: Extensive Symbolic Computing in icgrep
As an extended example of what you can do with symbolic computing techniques, consider icgrep, an SFU-developed command-line tool for regular expression search.
- currently written in C++, but many components prototyped in Haskell
- designed as a modern high-performance replacement for the traditional grep program
- supports:
- traditional Posix regular expressions (grep),
- Perl-compatible regular expressions (PCRE),
- full Unicode regular expressions of UTS #18: Unicode Regular Expressions
- implemented using parallel techniques
- bitwise data parallel regular expression matching
- SIMD parallelism
- multithread parallelism
- designed for outstanding performance with complicated regular expressions and big data applications
The Symbolic Data Domains
- RE: regular expressions
- Pablo: tiny programming language for unbounded bit streams
- LLVM IR: portable intermediate representation/high-level assembly language
- ASM: assembler language of the target machine.
When using icgrep, you can always display the symbolic output of the various phases using the command flags
- -ShowREs
- -ShowPablo
- -ShowIR
- -ShowASM
The Phases of icgrep
There are many symbolic computing phases in icgrep.
- RE phases
- parse REs into internal AST (abstract syntax tree form)
- perform RE to RE transformations, for example
- Case-insensitive transformation
- Star Normal Form
- compile REs to Pablo using the Parabix regular expression algorithm
- Pablo phases
- perform Pablo to Pablo transformations, for example
- common subexpression elimination
- dead-code elimination
- sinking
- compile Pablo into LLVM IR
- perform Pablo to Pablo transformations, for example
- LLVM IR passes
- IR to IR transformations
- compile LLVM IR to target machine architecture
- ASM phases
- ASM to ASM transformations, such as peephole optimization
- generate final binary machine code
Example
./icgrep '(?i)a(?:[bc]*[d-f]*)*[a-z]{2,9}t' -ShowAllREs -ShowPablo -ShowIR -ShowASM
RE Passes
Parser: Group(+i:(Seq[CC "CC_61" [97]/Unicode, Rep((Seq[Rep(CC "CC_62_63" [98-99]/Unicode,0,Unbounded), Rep(CC "CC_64_66" [100-102]/Unicode,0,Unbounded)]),0,Unbounded), Rep(CC "CC_61_7a" [97-122]/Unicode,2,9), CC "CC_74" [116]/Unicode])) resolveGraphemeMode: Group(+i:(Seq[CC "CC_61" [97]/Unicode,Rep((Seq[Rep(CC "CC_62_63" [98-99]/Unicode,0,Unbounded),Rep(CC "CC_64_66" [100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_61_7a" [97-122]/Unicode,2,9),CC "CC_74" [116]/Unicode])) resolveUnicodeProperties: Group(+i:(Seq[CC "CC_61" [97]/Unicode,Rep((Seq[Rep(CC "CC_62_63" [98-99]/Unicode,0,Unbounded),Rep(CC "CC_64_66" [100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_61_7a" [97-122]/Unicode,2,9),CC "CC_74" [116]/Unicode])) resolveCaseInsensitiveMode: (Seq[CC "CC_41,61" [65][97]/Unicode, Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded), Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded), Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9), CC "CC_54,74" [84][116]/Unicode]) excludeUnicodeLineBreak: (Seq[CC "CC_41,61" [65][97]/Unicode,Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode]) RemoveNullablePrefix: (Seq[CC "CC_41,61" [65][97]/Unicode,Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode]) RemoveNullableSuffix: (Seq[CC "CC_41,61" [65][97]/Unicode,Rep((Seq[Rep(CC "CC_42_43,62_63" [66-67][98-99]/Unicode,0,Unbounded),Rep(CC "CC_44_46,64_66" [68-70][100-102]/Unicode,0,Unbounded)]),0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode]) Star_Normal_Form: (Seq[CC "CC_41,61" [65][97]/Unicode, Rep(CC "CC_42_46,62_66" [66-70][98-102]/Unicode,0,Unbounded), Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9), CC "CC_54,74" [84][116]/Unicode]) resolveNames: (Seq[CC "CC_41,61" [65][97]/Unicode,Rep(CC "CC_42_46,62_66" [66-70][98-102]/Unicode,0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode]) Simplifier: (Seq[CC "CC_41,61" [65][97]/Unicode,Rep(CC "CC_42_46,62_66" [66-70][98-102]/Unicode,0,Unbounded),Rep(CC "CC_41_5a,61_7a,17f,212a" [65-90][97-122][383][8490]/Unicode,2,9),CC "CC_54,74" [84][116]/Unicode])
Pablo
Initial Pablo AST: basis[0] = Extract basis, 0 basis[1] = Extract basis, 1 basis[2] = Extract basis, 2 basis[3] = Extract basis, 3 basis[4] = Extract basis, 4 basis[5] = Extract basis, 5 basis[6] = Extract basis, 6 basis[7] = Extract basis, 7 not = (~basis[7]) not_1 = (~basis[5]) not_2 = (~basis[3]) not_3 = (~basis[2]) not_4 = (~basis[1]) not_5 = (~basis[0]) and = (basis[6] & not) and_1 = (basis[4] & not_1) or = (basis[2] | basis[3]) not_6 = (~or) or_1 = (basis[0] | basis[1]) not_7 = (~or_1) and_2 = (and & and_1) or_2 = (or | or_1) not_8 = (~or_2) LF = (and_2 & not_8) ...
LLVM IR
; ModuleID = 'FD_source16@8_O1_SSE2:B' target triple = "x86_64-apple-darwin14.5.0" %"FD_source16@8_O1" = type { i64, i32, { [1 x [8 x <2 x i64>]]*, i64, i64 }*, i8*, i64, { i64, i64** }*, i64, i64, i64, i64 } ; Function Attrs: nounwind define void @"FD_source16@8_O1_Init"(%"FD_source16@8_O1"* %self, i32 %fileDescriptor, { [1 x [8 x <2 x i64>]]*, i64, i64 }* %sourceBuffer_bufferPtr, { i64, i64** }* %sourceBufferConsumerLocks) #0 { entry: store %"FD_source16@8_O1" zeroinitializer, %"FD_source16@8_O1"* %self, align 8 %0 = getelementptr %"FD_source16@8_O1", %"FD_source16@8_O1"* %self, i64 0, i32 1 store i32 %fileDescriptor, i32* %0, align 4 %1 = getelementptr %"FD_source16@8_O1", %"FD_source16@8_O1"* %self, i64 0, i32 2 store { [1 x [8 x <2 x i64>]]*, i64, i64 }* %sourceBuffer_bufferPtr, { [1 x [8 x <2 x i64>]]*, i64, i64 }** %1, align 8 %2 = getelementptr %"FD_source16@8_O1", %"FD_source16@8_O1"* %self, i64 0, i32 5 store { i64, i64** }* %sourceBufferConsumerLocks, { i64, i64** }** %2, align 8 %3 = icmp eq i32 %fileDescriptor, 0 br i1 %3, label %initializeRead, label %initializeMMap ...
Intel/AMD x86_64 ASM
FD_source16@8_O1_Init: ## @"FD_source16@8_O1_Init" ## BB#0: ## %entry pushq %rbp movq %rsp, %rbp pushq %r15 pushq %r14 pushq %rbx pushq %rax movl %esi, %r15d movq %rdi, %rbx movq $0, 72(%rbx) movq $0, 64(%rbx) movq $0, 56(%rbx) movq $0, 48(%rbx) movq $0, 40(%rbx) movq $0, 32(%rbx) movq $0, 24(%rbx) movq $0, 16(%rbx) movl $0, 8(%rbx) movq $0, (%rbx) movl %r15d, 8(%rbx) movq %rdx, 16(%rbx) movq %rcx, 40(%rbx) testl %r15d, %r15d je LBB0_1 ## BB#3: ## %initializeMMap movabsq $file_size, %rax movl %r15d, %edi callq *%rax movq %rax, %r14 movabsq $mmap, %rax testq %r14, %r14 je LBB0_4 ## BB#5: ## %NonEmptyFile xorl %edi, %edi movl $1, %edx movl $2, %ecx xorl %r9d, %r9d movq %r14, %rsi movl %r15d, %r8d callq *%rax movq %r14, %r15 ...
The icgrep example shows the different types of phases common to many symbolic computing applications:
- parsing phases which transform a string representation into an internal AST (abstract syntax tree) for some symbolic domain
- printing phases which print out ASTs to a string form
- source-to-source transformation phases which perform a transformation mapping a symbolic expression into another symbolic expression in the same domain
- translation phases which compile or otherwise convert a symbolic expression from one symbolic domain to another.
Performance Examples
- Example: IP address regex
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?):(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}
Data source: 620 MB Wikibooks document set (15 languages)
Performance Results
Program | Processor | SIMD | Instructions | Time |
---|---|---|---|---|
i7-3770 @ 3.4 GHz | SSE2 | 232,079 M | 21.3 s | |
grep -E | i3-5010U @ 2.1 GHz | AVX2 | 232,423 M | 39.5 s |
W-2102 @ 2.9 GHz | AVX-512 | 232,081 M | 25.6 s |
i7-3770 @ 3.4 GHz | SSE2 | 3,720 M | 0.49 s | |
icgrep | i3-5010U @ 2.1 GHz | AVX2 | 2,193 M | 0.49 s |
W-2102 @ 2.9 GHz | AVX-512 | 1,349 M | 0.32 s | |
W-2102 (2 cores) | AVX-512 | 1,388 M | 0.20 s |
- Example: long-line search (500+ characters)
- Regex:
.{500}
- Regex:
Program | Processor | SIMD | Instructions | Time |
---|---|---|---|---|
grep -E | i3-5010U @ 2.1 GHz | AVX2 | 5,265,058 M | 884.7 s |
icgrep | i3-5010U @ 2.1 GHz | AVX2 | 8,472 M | 1.31 s |
- Example: Unicode category: Greek
- Regex
\p{Greek}
- Regex
Program | Processor | SIMD | Instructions | Time |
---|---|---|---|---|
grep -P | i3-5010U @ 2.1 GHz | AVX2 | 241,655 M | 43.4 s |
icgrep | i3-5010U @ 2.1 GHz | AVX2 | 3,960 M | 0.52 s |
Updated Thu Oct. 11 2018, 08:57 by cameron.