Blogging the Fourth Year Project

Tuesday, 16 August 2011

Sunday, 14 August 2011

The paper by Porat and Porat gives a different yet very similar algorithm for pattern matching in a stream. This one however does have a small chance of false negatives; that is it may miss an actual match.

As is seemingly usual you take 1..logm lengths of the pattern and tries to match the longer length from the previous one (i from i-1). To do this, if there is a match, the algorithm jumps by period(i-1) characters until you have no 'overlap'; that is the period(i) is longer that the remaining length of the substring of the you are on.

The details of how this 'process' above stitches together involves creating and then aborting processes 'just in case' you need the result from that process. If you later find it useless (this will be before it will finish) you just kill it and start a new one.

In comparison with the previous streaming method, which extends the pattern using fingerprints, rather than here using fingerprints generally and extend the pattern by relying on periodicity and jumps and 'checkpoints'.

The paper goes onto 1 and k mismatch pattern matching which uses the pattern matching algorithm as a black box. Once a method is implemented for general pattern matching, this could be a natural extension to the project here.

Next stop is a paper which claims solves the pattern matching in a stream problem using real time and log space. This algorithm could be replaced as the blackbox above, or for the periodicity calculating algorithm from the previous paper.

Friday, 5 August 2011

KMP

Having started reading the Porat and Porat paper for their streaming pattern matching algorithm, it was made clear that these algorithms share a parentage of sorts in the KMP algorithm. for completeness I have read the details of the algorithm from their original paper. These streaming algorithms essentially 'shift' the (sub)pattern by it's period. This pecisesly what KMP does. The streaming algorithm has extra complexity of fingerprints, but as KMP ensures one never backtracks it clear to see why the streaming algorithms bear this striking resemblance.

Monday, 1 August 2011

Periodicity in Streams - Ergun, Jowhari and Saglam

The focus of this paper is not in fact the pattern matching algorithm; although much care is given in it's explanation; but rather it's application in calculating the period of a stream; or some distance to it being periodic. The pattern matching algorithm is used as a tool to calculate these values.

The pattern matching algorithm runs in O(log |u| log n) bits of space and O(log|u|) per-item time, where n is the length of the stream, u the length of the pattern.

It is a fairly complex algorithm, but underpinning it are Rabin-Karp fingerprints. These are essentially elements of a large finite field. They can be computed in log(n) space, and allow strings to be compared in constant time (by comparing fingerprints). The fingerprints can be concatenated split up using constant time arithmetic; allowing for fast update. I did briefly look at the paper where these were introduced, and it demonstrates a straightforward linear time pattern matching algorithm; this requires storing the pattern to update the fingerprint quickly, however was a nice simple example of using fingerprints for pattern matching.

The fingerprints are used to extend a given (a naive solution) algorithm to match a longer pattern using log n extra bits. Essentially you store a constant number of fingerprints and construct the fingerprint for the longer pattern, starting at each original match, using these stored ones. This is done carefully so that no backtracking on the stream is done; of course, backtracking would ruin the streaming model.

Now that one can extend the pattern and find where the extended pattern matching, the pattern is taken in log parts: s[1,2^k]. You use a straightforward pattern matching for the first one, and then extend it for the larger k using the fingerprint method.

I believe the algorithm I shall eventually implement will be based on this. This algorithm in particular will require strict care to ensure that all the parts of it are computed at the right time; the exposition of the algorithm is such that parts to be computed in parallel are explained at various points.

I briefly read the algorithm for period calculation, the application of the string matching algorithm and calculation of frequency moments. However, given the other papers in the current reading list these parts seemed less relevant. My next port of call, therefore, is on the paper by Porat and Porat on which this string matching algorithm is based. This paper is mentioned here as requiring some pre-processing, and as being more complicated (inferred).

Wednesday, 20 July 2011

Muthukrishnan

I read (excluding the chapter on New Directions) the book by Muthukrishnan. Having briefly looked at his website it turns out I read what he describes as "Prelim". In looking at the now published book this preliminary paper seems to form the core foundation of the book, and so I don't think much has been lost, bar one or two extra examples.

This paper introduces both intuition and some formal notation and foundations on which study of data streaming can be exploited.

Particular of note is the idea of the stream describing a signal; that is a one dimensional function. The best way I have found to think of this is as the signal being some untouchable array. The stream the describes either just a 'printf' (called Time Series Model) of the array in sequential order; else updates to this array (Cash Register/Turnstile). Either way, the stream describes this array without you actually considering the array as a whole.
A good example from the paper is of a subway system, each element in the array is a count of the number of people arrived-departed from that turnstile, and the data stream is someone arriving/leaving; updating the array. Another example is each item in the data stream is a new IP packet; each element of the array (signal) is the packet.

It also talks about the essential goal as computing functions on the signal. E.g. "the number of X seen so far".

Pretty good set of notes; although I feel it is time to start getting full on into the papers on pattern matching in a stream.

Monday, 20 June 2011

Algorithm Conferences

I have had a browse of some theoretical computer science conferences. A list of the conferences was given here.

The FOCS 2011 conference is scheduled for October so no list of submitted abstracts was available today. The list of abstracts for the 2009 conferences was unavailable (broken link). From the '11 homepage, they highlight papers in many aspects of theoretical computer science from algorithms and data structures, complexity, networks, crypto, etc.

STOC 2011 ended about a week ago and consists of papers similar in topic to FOCS. There are many papers on graph problems, and on approximation algorithms for problems. A paper on data streams was in the fast estimation of moments (Kane, Nelson, Porat & Woodruff). Problems based on graphs seem to dominate the list of papers here, specifically multicut [pdf].

SODA (sponsored by SIAM) has many pure maths (combinatorics, algebra, number theory, graphs, metric spaces, etc) papers as well as computer science. Their last conference was in January 2011. The about page highlights their goal to link mathematics with other sciences. The papers again have graph problems, but also many are on approximations or solutions to mathematical problems.

ALGO 2011 is scheduled for September and is a collection of conferences. WABI seem to have a lot of genetic/biological papers (given also by their name!). IPEC (2010) had a lot on complexity. ESA didn't have a list of papers for this year, but in 2010 did a joint conference with WABI. The ESA topic list includes topics as detailed in FOCS/STOC.

Briefly, SPAA deals with parallelism in algorithms, IPCO with graphs, sets, etc; and COLT with machine learning.

In conclusion, approximation algorithms are prolific throughout all conferences, but there is also seems to be a lot of current interest in graph problems.

Friday, 17 June 2011

Streaming Algorithm Lecture Notes

I begin with a brief read of some lecture notes about Streaming Algorithms. The links to the documents are also added for completeness. I have only read the first couple of chapters of these to get an idea into thought processes and styles in this field.

Lecture Notes by Muthu Muthukrishna
http://www.cs.mcgill.ca/~denis/notes09.pdf

Talks initially about some streaming algorithms; estimating the minimum frequency item in a stream. Hash functions seem prevalent in this field so will need some work, and also some probability in the analysis of bounds. Will need to do a bit of work on these two aspects (particularly hashing) if I continue down this path.
Then talks about counting elements in a stream, again using hashing. Very good.

Lecture Notes from Dartmouth
http://www.cs.dartmouth.edu/~ac/Teach/CS85-Fall09/Notes/lecnotes.pdf

Gives a more precise model of computation for streaming algorithms.

Lecture from Stanford
http://theory.stanford.edu/~rajeev/cs361.html#read-massive

Starts with naive approach, and discusses memory issues when you stream data. The lecture notes above explain this all a bit better really, but I read it so the link is here.

Lectures (and paper) from Wisconsin
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/

First of these lectures (17) just recaps briefly on model including frequency moments, and then moves to counting distinct elements (continued in Lecture 18). Median of means to take result after many trials.
Paper is the book below.

Book by S. Muthukrishnan
http://scholar.google.co.uk/scholar?hl=en&q=Data+Streams%3A+Algorithms+and+Applications&btnG=Search&as_sdt=0%2C5&as_ylo=&as_vis=1

Applications and motivations for the thoughts.