We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm, covering important practical optimizations. All transformations from a regular expression to an optimized automaton are explained on a step-by-step example. We consider both ahead-of-time and just-in-time determinization and describe variants of the algorithm suited to each setting. We provide benchmarks showing that the algorithm is very fast in practice. Our research is based on two independent implementations: an open-source lexer generator RE2C and an experimental Java library.
翻译:本文提出了一种基于标记化确定性有限自动机的正则表达式解析与子匹配提取算法。该算法适用于不同的消歧策略。我们提供了详细的算法伪代码,涵盖了重要的实际优化技术。通过逐步示例,阐释了从正则表达式到优化自动机的所有转换过程。我们同时考虑了提前确定化与即时确定化,并描述了适用于每种场景的算法变体。基准测试表明,该算法在实际应用中具有极高的执行速度。本研究基于两个独立实现:开源词法分析器生成器RE2C与实验性Java库。