Tuesday, 16 August 2011

Sunday, 14 August 2011

Porat and Porat

The paper by Porat and Porat gives a different yet very similar algorithm for pattern matching in a stream. This one however does have a small chance of false negatives; that is it may miss an actual match.

As is seemingly usual you take 1..logm lengths of the pattern and tries to match the longer length from the previous one (i from i-1). To do this, if there is a match, the algorithm jumps by period(i-1) characters until you have no 'overlap'; that is the period(i) is longer that the remaining length of the substring of the you are on.

The details of how this 'process' above stitches together involves creating and then aborting processes 'just in case' you need the result from that process. If you later find it useless (this will be before it will finish) you just kill it and start a new one.

In comparison with the previous streaming method, which extends the pattern using fingerprints, rather than here using fingerprints generally and extend the pattern by relying on periodicity and jumps and 'checkpoints'.

The paper goes onto 1 and k mismatch pattern matching which uses the pattern matching algorithm as a black box. Once a method is implemented for general pattern matching, this could be a natural extension to the project here.

Next stop is a paper which claims solves the pattern matching in a stream problem using real time and log space. This algorithm could be replaced as the blackbox above, or for the periodicity calculating algorithm from the previous paper.

Friday, 5 August 2011

KMP

Having started reading the Porat and Porat paper for their streaming pattern matching algorithm, it was made clear that these algorithms share a parentage of sorts in the KMP algorithm. for completeness I have read the details of the algorithm from their original paper. These streaming algorithms essentially 'shift' the (sub)pattern by it's period. This pecisesly what KMP does. The streaming algorithm has extra complexity of fingerprints, but as KMP ensures one never backtracks it clear to see why the streaming algorithms bear this striking resemblance.

Monday, 1 August 2011

Periodicity in Streams - Ergun, Jowhari and Saglam

The focus of this paper is not in fact the pattern matching algorithm; although much care is given in it's explanation; but rather it's application in calculating the period of a stream; or some distance to it being periodic. The pattern matching algorithm is used as a tool to calculate these values.

The pattern matching algorithm runs in O(log |u| log n) bits of space and O(log|u|) per-item time, where n is the length of the stream, u the length of the pattern.

It is a fairly complex algorithm, but underpinning it are Rabin-Karp fingerprints. These are essentially elements of a large finite field. They can be computed in log(n) space, and allow strings to be compared in constant time (by comparing fingerprints). The fingerprints can be concatenated split up using constant time arithmetic; allowing for fast update. I did briefly look at the paper where these were introduced, and it demonstrates a straightforward linear time pattern matching algorithm; this requires storing the pattern to update the fingerprint quickly, however was a nice simple example of using fingerprints for pattern matching.

The fingerprints are used to extend a given (a naive solution) algorithm to match a longer pattern using log n extra bits. Essentially you store a constant number of fingerprints and construct the fingerprint for the longer pattern, starting at each original match, using these stored ones. This is done carefully so that no backtracking on the stream is done; of course, backtracking would ruin the streaming model.

Now that one can extend the pattern and find where the extended pattern matching, the pattern is taken in log parts: s[1,2^k]. You use a straightforward pattern matching for the first one, and then extend it for the larger k using the fingerprint method.

I believe the algorithm I shall eventually implement will be based on this. This algorithm in particular will require strict care to ensure that all the parts of it are computed at the right time; the exposition of the algorithm is such that parts to be computed in parallel are explained at various points.

I briefly read the algorithm for period calculation, the application of the string matching algorithm and calculation of frequency moments. However, given the other papers in the current reading list these parts seemed less relevant. My next port of call, therefore, is on the paper by Porat and Porat on which this string matching algorithm is based. This paper is mentioned here as requiring some pre-processing, and as being more complicated (inferred).