The first step is to do a 2D cross correlation on the image using a pre-defined mask. The output of this step is a series of threshold points, which may be part of markers. The next step is to group these points and to calculate the centroids of the markers. This gives a series of coordinate pares of possible markers.

The 2D cross correlation consists of placing a pre-defined mask on a certain image area. Each point of the mask is then multiplied with its corresponding point on the image. All these products are added to obtain the correlation value.

Before the different ways of doing the cross correlation are discussed, it is necessary to look at the design of the mask.

**1. The Mask**

The choice of the applicable mask is very important, as it determines the effectiveness of the marker recognition algorithm. The mask is designed in such a way that then if it is correlated with an image of a marker, the cross correlation value should generate a peak. The sum of all the points on the mask must be zero, this will ensure that a zero correlation value is generated if the mask is correlated with a constant background. The mask must also be equal or larger in size than the image of a marker, otherwise it may consider a large marker image as a constant background, which will not generate a high correlation value.

Below is an example of a marker image and the mask used to detect the marker:

During the cross correlation the mask is placed on the image of the marker and the products are calculated between the corresponding points on the image and the mask. The result is shown in the third table of the figure. The sum of all these products is 192.

It is not necessary to manually create the points for the mask, it is
relatively easy to generate a mask from a marker image. A typical example
of a marker is taken from the camera image, usually the biggest marker
image. This image is then clipped to a square which contains the marker
image. The mask will be the same size as the square. Filtering may now
be applied to get rid of any noise, the following equation can be used
to calculate the filtered intensity value, P_{filter}, for a certain
pixel:

P_{filter} = (1-a)P
+ 0.125*a*G

Where P is the intensity of the pixel and G is the sum of the intensities of the 8 surrounding pixels. a is the filter ratio. If a=0, no filtering is applied, while if a=1, pixel P is filtered out totally.

Due to non-uniformity's in lighting markers may not appear round in the camera images. For this reason it is a good idea to make the marker image symmetrical around both axes. This is done by taken a pixel, and its mirror image, calculating the average of the two intensities and writing back this value to both the pixel and its mirror image pixel.

The mask will normally contain less intensity levels than the image, especially if integer maths is used for the correlation functions. It is therefore needed to quantitise the marker image pixels to lesser intensity levels. Section 4.6 explains choices for intensity levels further.

Next the average values of the intensities of all the pixels in the marker image is calculated. This value is then subtracted from each pixel's intensity value to make the average intensity of the image zero. The generation of the mask is now done and the mask may be saved.

**2. The 2D cross correlation**

The cross correlation may be done in two ways. The simplest way is to do it in the time-domain. If the mask and the search area are big, the time-domain method would need too much processing power. In such a case it is better to do the cross correlation in the frequency-domain, where the more efficient FFT and IFFT algorithms can be used.

Lets assume that the size of the area in which the markers are looked for, the search area, will be LxL pixels. Note that the search area in a practical setup may not be square. The size of the mask is MxM pixels. The mask must be smaller than the search area, therefore L>M. The image below shows the relantionship between the image, search area and the mask.

**2.1 Time-domain cross correlation**

The cross correlation is done on each pixel in the search area using the following equation:

Where ** R(h,k)** the correlation value of the image point
with coordinates (

When the correlation value for a pixel is known, the software should determine whether the point is an overthreshold point or not. All overthreshold points are saved for further processing. The simplest criteria to use is a comparation with a fixed threshold value. The main drawback of this method is that the correlation values depend on general image brightness, especially if there are noise present. A bright image would therefore generate more overthreshold points than a dim image. As the software will only handle a certain amount of overthreshold points, the buffer may be filled with points which isn't even on markers, this will definitely cause false marker positions. If there are two few overthreshold points, markers may be missed, and detected marker positions may be inaccurate.

A better method is to use a moving threshold value. This may be implemented in software by having a sorted buffer containing the N highest correlation values as overthreshold points. The minimum value of N is the number of markers, since that each marker needs at least one overthreshold point. The larger N is, the more threshold points per marker can be saved, which will make the centroid calculations more accurate. If N is two small, one large, bright marker may fill up the buffer and no threshold points will be recorded for any other markers. The value of N must thus be bigger instead of smaller. The only drawback of a big buffer is that it takes more memory, and makes processing of the threshold points slower.

Each image element needs M^{2} real multiplications and M^{2}
real summations. For the purpose of comparing the cross correlation methods,
the assumption will be made that multiplication and summation operation
take the same processor clock cycles. The total number of calculations
for the total area can now be calculated as 2L^{2}M^{2}.

The advantages of the time-domain method are ease of implementation and flexibility. The disadvantage of this method is that it takes huge amounts of processing time, especially if the search area is big.

**2.2 Frequency-domain cross correlation**

The cross correlation can be done in the frequency domain. The advantage of this is that the processing time can be minimised by using the time efficient FFT algorithm.

A 2D FFT algorithm is first used to convert the search area to the frequency domain. This result is then multiplied by the 2D FFT of the mask. The 2D FFT of the mask is only done once, as it would not change. The product of the two FFTs is then converted back to the time-domain using a 2D IFFT algorithm. The correlation result is then scanned for overthreshold points.

A 2D FFT is done by calculating the FFT of each column, after which the FFT is done on the resulting rows.

If it is assumed that the size of the search area is LxL and the size of the mask is MxM, then the size of the FFT must be NxN with N>L+M to prevent aliasing in the frequency domain. Furthermore the FFT algorithm can only be used if N is a power of two. If this is not the case, N should be made a power of two and the image and mask should be zero padded.

A 1D FFT algorithm needs 5N log_{2} N real calculations. A 2D
FFT or IFFT algorithm consists of 2N 1D FFT or IFFT algorithms, which gives
a total of 10N^{2} log_{2} N real calculations. The multiplication
stage needs another 6N^{2} calculations. The total number of calculations
is therefore N^{2}(20log_{2}N + 6). At this point the overthreshold
points have not yet been scanned, which will also take some time.

The advantage of the frequency-domain method is that the number of calculations stays fairly the same for different sizes of search areas and mask sizes, until the next power of two for N is exceeded. The drawback of this method is that it uses considerably more memory than the time-domain method. Furthermore it is not as flexible as the time-domain method.

**2.3 Comparison of the above methods**

If the search area is 692x692 pixels and the mask size is 16x16, then the time-domain correlation method will need 245 million calculations, while the frequency-domain method 216 would need million calculations. If the mask size is reduced to 15x15, the two methods will be equal with 216 million calculations.

The reason for this is that the frequency-domain method's amount of calculations stays constant for mask sizes between 1x1 (L=693) to 332x332 (L=1024). The smallest power of 2 which is larger or equal to L for N is 1024 for these series of mask sizes. The time-domain method's number of calculations is highly dependant on the size of the mask. The graph below compares the two methods for different search area sizes with a mask size of 16.

The performance increase of the frequency-domain method gets more prominent for large search areas, while the time-domain method gives better performance on small search areas. This show that the frequency-domain is not always the fastest solution. By making the search area as small as possible, the time-domain method will be must faster than using the frequency-domain. A factor which has not been included in this comparison is the search of the overthreshold points in the frequency-domain method.

**3. Optimizing of the cross-correlation**

There as three factors which determines the speed of the marker recognition algorithm. The first factor is the amount of image data which needs to be downloaded or scanned. The second factor is the cross-correlation process itself. The third factor is the handling of the overthresholdpoints.

**3.1 The search of markers in limited areas**

It is not necessary to search the whole image area for the markers. This is true especially in marker tracking where markers would not move too much between frames. The software can thus only search for markers around their previous locations. The slower the markers move, or the higher the camera framerate, the smaller the search areas needs to be.

It may happen that a marker might move too fast and leave the search area, in this case the software would need to rescan the whole image. It is therefore better to make the search areas too big instead of too small.

**3.2 The use of a pre-scan routine**

Another method to reduces the search area is to use a pre-scan routine to do a fast search on the image to get locations on which cross-correlations can be done.

As markers are lighter than the background, the simplest method is by comparing intensity values between adjacent pixels. If the intensity difference is more than a certain threshold value, a cross-correlation is performed on that point. Correlations are therefore done on a point basis, and no longer an area basis. The threshold value should be selected to ensure that cross-correlations are not done unnecessarily on the background, but not so high that markers are missed. The cross correlation must be done in the time-domain in this case as it is faster than the frequency-domain for small areas.

A requirement of the pre-scan algorithm is that is should use much less processing time than the cross correlation algorithm, otherwise it is not worthwhile using it.

**3.3 The use of a pre-sample routine**

The pre-scan routine optimises the cross-correlation process. The cross-correlation process is not the only process which takes up time, the download of data from the framegrabber also takes time. The limiting of the search area alone can not limit the amount of image data downloaded, as the search area depends on marker movement, etc.

The purpose of the pre-sample routine is to only look at every N-th pixel. It uses basically the same algorithm as the pre-scan routine, except that its output is areas containing possible markers.

The advantage of the pre-sample rountine is that the search area can now be made much larger without loosing performance. Even if a Bt848 based framegrabber is used which automatically downloads the image into PC memory, the pre-sample routine will still have a performance increase as not all the image pixels need to be processed.

The first few steps of the marker recognition algorithm are as follows:

1. Grab the image from the camera.

Output: Image in memory.

2. Search only location where markers might
be.

Output: Large search areas
in image data.

3. Pre-sample routine.

Output: Small search areas
in image data.

4. Pre-scan routine:

Output: Single points in
image data.

5. Cross-correlation.

Output: Over-threshold
points.

**4. Handling of the overthreshold points**

When the cross-correlation of a point has been calculated, it is necessary to see if that point is an overthreshold point. As already explained the software saves the N highest value overthreshold points. These points are stored in a sorted list. The list is sorted from the highest to the lowest correlation values. When a new correlation value is known, it is added to the list in the right place. If the list is already full, the lowest correlation value will be removed from the list.

Below is a standard sorted list with a capacity of 5 elements. The values 9, 10, 6 and 5 have already been added to the list in the first example.

In the second example the value, 4, is added. As there are now elements to the right, it is not necessary to shift any elements to make space for the new value. In the last example the value, 7, is added to the list. The elements 6, 5 and 4 is first shifted to the right to make way for the new value. The smallest value, 4, is discarded.

This implementation of the list has several major disadvantages. Each time a new value is added to the list, it is first necessary to find its place by scanning the values already in the list. This process took time. When the new value's place has been found, it may be necessary to shift all lower values in the list one position to make way for the new value, something which may even take longer than the list scanning process. This would not be a problem for small lists, but is a major problem for lists the size that is needed by the marker recognition algorithm. The insertion point search process can be made faster by using a binary search function, but data still needs to be shifted to insert new values.

A better list implementation is to use a binary tree structure. Each element in the list points to the two other elements. The left pointer points to an element with a smaller value than the current one, while the right pointer points an element with a higher value than the current one. The insertion point search process is much easier. The software starts at the root element. If the new value is smaller, go left, else go right. The same algorithm is used for the next element. This process is repeated until there are no more elements on the branch. The new value is than added to the end of that branch. If the list is full, it will be necessary to remove the lowest value element from the list.

Below is a binary tree structure. The elements 9, 10, 6 and 5 has already been added in the first example.

When the value, 4, is added in the second example, the scan algorithm starts at 9, go left to 6, go left to 5. As the left side of element 5 is empty, the new element is added here. In the last example the value, 7, is added. The algorithm starts at 9, and go to 6. As the right side of element 6 does not point to anything, the new value is added here. As a maximum of 5 elements is allowed in this list example, it is necessary to remove element 4 from the list.

The advantage of the binary tree structure is that it is not necessary to move elements, only the pointers can be adjusted. Even for a complex process like removing an element from the list, it is necessary to only alter the pointers of two elements in the list. The binary tree structure is thus much more effective than the standard sorted list due to the fact that the time needed to process the list is much less.

The advantages of the binary tree structure become only apparent for larger list sizes. The binary tree structure is not perfect. If all the points which are added always are larger or smaller than the previous ones, the tree will only contain one branch, which means that the binary tree would then basically be like the sorted list in the previous example. In the practical system marker recognition system, the values added is unsorted and the tree would therefore not consist of only one branch.

With the practical system it was found that for 1000 threshold points the maximum branch length was approximately 32. The theoretical optimum branch length for 1000 elements is approximately 10. When the binary tree structure has been implemented instead of the sorted list method, the speed of the marker recognition algorithm was doubled.

**5. Sorting of the overthreshold points**

After the cross correlation process is finished, the overthreshold points must be processed. Below is a representation of some overthreshold points as they would appear on the camera image.

Overthreshold points are usually grouped into islands, as can be seen from above. Each island is a possible marker. The software now needs to sort and group the overthreshold points. The software considers points to belong to the same island if they touch each other. After an island of points has been isolated, centroid calculations must be made on that data to determine the position for the possible marker. The centroid of the island is calculated by calculating the 2D centre of mass of the correlation values of the points. The following equation is used:

Where W is the total weight of the island, (x_{c};y_{c})
is the centroid co-ordinates of the island R_{ij} is the correlation
value of the overthreshold point with co-ordinates (x_{i};y_{j}).
The results of the centroid calculations of each island are saved for further
processing.

The software can save time by ignoring islands which do not have enough points to make up a marker. This saves the time of calculating the centroid. The minimum number of allowable points should be a user adjustable option.

The islands are sorted according to weight. An island with a heavy weight(high W value) has a bigger chance of being a marker than an island with a light weight.

It may happen that the overthreshold points of one marker may lie close to each other, but not against each other. The software will generate more than one island if this is the case and the software may think that the one marker is more than one marker. To overcome this problem, the software scans the islands to see which ones are too close to each other. The minimum allowable distance between markers should be user adjustable. One of two things can be done if islands lie too close to each other. One option is combine the islands and to calculate a new combined weight and centroid. The second option is to keep the island with the highest weight and to ignore the rest.

The second option is the better option, as a normal marker will not easily be detected as more than one marker. What may happen is that a reflection or another object in vicinity of the marker can be recognised as an island. If the first option was used, the resulting marker co-ordinates would be wrong, while using the second option the false island would have been thrown away as its weight will be less than the true marker's weight.

After all the markers have been recognised and their coordinates are known, the markers should be classified. Classification depends on the application and is therefore not discussed here.

**6. Practical implementation**

The cross correlation is done using integer maths as floating-point maths takes more processing power, although this might not be the case with modern processors. Integer maths puts restraints on allowable correlation results. In the practical system the image data was 6-bits wide. As the mask is derived from the image, it can be assumed that 6-bits will be a good bit resolution for the mask. If no resolution must be lost during multiplication, the results should be 12-bits wide.

One problem with integer maths, which is not a problem with floating-point maths is that MSB bits will be lost if results are too big, which will cause wrap-around of the resulting value. This may cause bright and big markers to have very small weights, which would mean that the software would not think it is a marker. For this reason calculations should be done using a larger data size and then reduce it afterwards by clipping the value.

In the practical system 16-bit values were used, while 32-bis values were used during the calculations. If the correlation value is above 32767, it was set to 32767 the be correctly represented by a 16-bit value. This method will work if an occasional correlation point is too high, but if too many points on a marker are too high, the centroid calculations will be incorrect.

If 16-bit integers are used, the correlation values must lie between -32768 and 32767. The maximum correlation value is achieved if both the mask's and image's pixel intensities are at a maximum. The following equation can then be used the calculate the maximum correlation value:

R_{max} = M^{2}
* M_{max} * P_{max}

with R_{max} the
maksimum correlation value (=32767)

M_{max} the maksimum
mask point value (=2, typical)

P_{max} the maksimum
pixel waarde is (=63)

M the largest mask size
(=16, typical)

The values between brackets are what was used in the practical system.
M depends on the size of the used marker image and cannot be changed, but
M_{max} can be used to limit the correlation result.

Note that the above is the worst-case scenario and will never happen
if a valid mask is used. As the mask's pixels must have an average intensity
of zero, all the pixels will not have an intensity of M_{max}.
If this worst-case scenario is used to determine the mask's bit-depth,
overflow will never occur. There is one drawback with using this scenario,
the mask may not have enough quantisation levels for markers to be recognised
accurately.

In practise the maximum correlation for a certain mask occurs if that mask is placed on the marker image from which each was designed. It is therefore best to determine this value and then scale the mask pixel intensities until this value is not too high. There may still be exceptions where overloading will occur, but it wouldn't matter because only one or two points will overload.

**7. Practical results**

The results were obtained using a time-domain cross correlation algorithm
on a search area of 512x512 pixels with a mask size of 16x16 pixels. An
image containing 9 markers was used. The algorithm was run on a Pentium
166Mhz MMX based PC. Using the standard integer maths instructions, the
algorithm took around 6.92 seconds. Using the MMX instruction set, the
algorithm only took 2 seconds. If the pre-scan routine is used, the algorithm
time is reduced by of factor of 10. If only one marker is used, the time
may be reduced by a factor of 9. Theoretically marker recognition for one
marker would take only 2.2ms, which gives a framerate of 450 frames per
second. A modern PC is therefore quite adequate for tracking markers. Note
that these calculations does not take the downloading of image data from
frame buffers or any other overhead into account.