The focus of this paper is not in fact the pattern matching algorithm; although much care is given in it's explanation; but rather it's application in calculating the period of a stream; or some distance to it being periodic. The pattern matching algorithm is used as a tool to calculate these values.
The pattern matching algorithm runs in O(log |u| log n) bits of space and O(log|u|) per-item time, where n is the length of the stream, u the length of the pattern.
It is a fairly complex algorithm, but underpinning it are Rabin-Karp fingerprints. These are essentially elements of a large finite field. They can be computed in log(n) space, and allow strings to be compared in constant time (by comparing fingerprints). The fingerprints can be concatenated split up using constant time arithmetic; allowing for fast update. I did briefly look at the paper where these were introduced, and it demonstrates a straightforward linear time pattern matching algorithm; this requires storing the pattern to update the fingerprint quickly, however was a nice simple example of using fingerprints for pattern matching.
The fingerprints are used to extend a given (a naive solution) algorithm to match a longer pattern using log n extra bits. Essentially you store a constant number of fingerprints and construct the fingerprint for the longer pattern, starting at each original match, using these stored ones. This is done carefully so that no backtracking on the stream is done; of course, backtracking would ruin the streaming model.
Now that one can extend the pattern and find where the extended pattern matching, the pattern is taken in log parts: s[1,2^k]. You use a straightforward pattern matching for the first one, and then extend it for the larger k using the fingerprint method.
I believe the algorithm I shall eventually implement will be based on this. This algorithm in particular will require strict care to ensure that all the parts of it are computed at the right time; the exposition of the algorithm is such that parts to be computed in parallel are explained at various points.
I briefly read the algorithm for period calculation, the application of the string matching algorithm and calculation of frequency moments. However, given the other papers in the current reading list these parts seemed less relevant. My next port of call, therefore, is on the paper by Porat and Porat on which this string matching algorithm is based. This paper is mentioned here as requiring some pre-processing, and as being more complicated (inferred).
No comments:
Post a Comment