Document Image Stitching for Mobile Developers

12 min readAug 18, 2023

Image Stitching or Photo Stitching is the process of combining multiple images with overlapping fields of view to produce a segmented panorama or high-resolution image. The internet has good coverage of this topic for panoramas, but there is a lack of resources available on how to stitch document photos.

Panorama and document stitching are both techniques for merging multiple images to create a single one. However, there are some key differences between them. Panorama stitching typically requires sphere projection, which means that the images are warped to fit onto a sphere. This is because panoramas are typically used to capture a wide view of a scene, and sphere projection ensures that the image will be seamless and distortion-free. Document stitching, on the other hand, does not require sphere projection because documents are typically flat objects, and there is no need to warp the images to fit onto a sphere. The other key difference is that you almost always know what objects you are stitching — usually, it is a black-and-white document in the center of the photo. As a result, document stitching can be a simpler process that often produces better quality results but has some specific. This article addresses the issue of resolving computer vision problems with the experience of a mobile developer and provides a good flow to start. On the other hand, I believe that for an appropriate domain expert, this article may seem a bit basic.

As a mobile developer, with the help of document stitching, you can significantly enhance certain app experiences and provide impressive “magic” for your users — like scanning a long receipt, for example. So let’s define the task that we are going to resolve as “stitching multiple photos of a long receipt into a single image”.

Here Comes OpenCV

OpenCV (Open Source Computer Vision Library) is a powerful library of programming functions mainly used for real-time computer vision. It can be utilised for a wide variety of computer vision tasks and is available for various platforms. This means that you can use it to stitch images together on any platform you need — build a prototype in Jupyter Notebook, which is a popular environment for interactive computing; integrate it in iOS with C++ implementation, or in Android with OpenCV Kotlin. In this article, I am going to add C++ code snippets and attach a Jupyter Notebook at the end for you to experiment with.

If we are talking about moving forward with OpenCV, I am going to introduce you the first concept — cv::Mat. Mat is a basic data structure in OpenCV C++ that represents a matrix of pixels. It is used to store images, image channels, and other multidimensional data. So please consider that when I refer to images in OpenCV, I am talking about Mat, not iOS UIImage. However, do not worry, as OpenCV 3 supports converting UIImage to Mat and Mat to UIImage out of the box with import of opencv2/imgcodecs/ios.h

Another important note is that OpenCV has a built-in stitcher with “SCANS” mode which you can use with:

Stitcher::create(cv::Stitcher::SCANS)

It provides some opportunities for customisation, but from my point of view, it has significant disadvantages. The major issue with it is the “lack of control” — you just give it the vector<Mat> (array of images) and it returns the stitched Mat as an output. It can fail with a limited amount of errors such as “Need More Images”, “Homography Fail”, and “Camera Params Adjust Failed”. But actually, you did not really know what went wrong. Even worse is that the chances of encountering “Out of Memory” are quite high. So, my point is “If you are working on a production solution, you should be in control of your code”. That is why I decompose the process into steps with an understanding of what each step does and why I need it.

Stitching Process Steps

Image preprocessing
Features detection
Matches finding
Homography finding
Images stitching with homography applying

Some of these terms may be unfamiliar. I am going to describe each step in simple language and provide more detail about what it does and why it is important.

Step 1. Image Preprocessing

The crucial step to achieve good results and avoid “Out of Memory” is image preprocessing. Image preprocessing in this context means image downscaling. But I put it in a separate step because it is not so simple. Image downscaling helps to achieve two important things here — reduce image size and get rid of possible noise on document photos. Otherwise, some random pixels that are produced by photo camera noise can participate in stitching and affect the result.

The key point here is that we should use “INTER_AREA” interpolation (bilinear interpolation) in:

cv::resize(image, output, new_dimensions, cv::INTER_AREA)

INTER_AREA is a resampling method used to resize images. It works by averaging the pixels in the source image to create new pixels in the destination image while preserving the aspect ratio of the image. This is important for document images as it ensures that the text in the image will not be blurry or pixelated. It also means that resized images will not contain noise because noise pixels are averaged out.

The image on the left shows the result of resizing the image using the INTER_AREA interpolation, while the image on the right shows the result of INTER_LINEAR interpolation method.
The INTER_AREA method preserves more detail in the image, while the INTER_LINEAR lost a lot of details.

Step 2. Features Detection

The next step is to find “features” on images. A feature is a point in an image that has a unique appearance. Features are often used for image matching, object recognition, and other computer vision tasks.

We use the SIFT feature detection algorithm, where SIFT stands for Scale-Invariant Feature Transform. It works by first detecting interest points in an image. Interest points are points in an image that are likely to be features, such as corners or blobs. Once the interest points have been detected, they are then described using a SIFT descriptor. For any object in an image, interesting points on the object can be extracted to provide a “feature description” of the object. This description, extracted from an image, can then be used to identify the object when attempting to locate the object in a test image containing many other objects.

To perform robust recognition, it is important that the features extracted from the image be detectable even under changes in image scale, noise, and illumination. Such points usually lie on high-contrast regions of the image, such as object edges.

SIFT can robustly identify objects even among clutter and under partial occlusion, as the SIFT feature descriptor is invariant to uniform scaling, orientation, rotation, illumination changes, and partially invariant to affine distortion. Another important characteristic of these features is that the relative positions between them in the original scene shouldn’t change from one image to another. For example, if only the four corners of a door were used as features, they would work regardless of the door’s position; but the recognition would fail if the door became opened. The interesting thing about the keypoint locations search is that the algorithm starts with an area of 4×4 pixels. That’s why image downscale was an important step in preprocessing to avoid dealing with high-res images and positively affects the quality, speed and memory consumption.

Looks good, but…

One of the challenges of document images stitching is that the background of an image can also contain valid features. For example, if a document is being photographed on a textured table, the table texture could also be detected as features. This can lead to problems, as the algorithm may try to stitch the images together based on the background features, rather than the document features.

One way to address this problem is to use a mask to restrict the area of the image where features are detected. A mask is a binary image that defines the region of interest in the original image. Only pixels in the region of interest are considered when detecting features. OpenCV’s SIFT detectAndCompute allows you to pass a mask to the function. This will ensure that only features in the region of interest are detected.

To find a mask for a document image, you can use the following two steps: convert the image to grayscale and apply a threshold to the grayscale image to create a binary image with:

cv::threshold(image, mask, threshold, 255, cv::THRESH_BINARY)

You can start with threshold equal to 127. 127 is used to create a binary image that represents the region of interest in the original image. In this case, the region of interest is the document itself. A threshold value of 127 means that any pixel that is brighter than the threshold will be set to white, and any pixel that is darker than the threshold will be set to black. You can experiment with different threshold values to find what works best for you. I finished with threshold = 150.

The image on the left shows the result of matching founded features without document mask, while the image on the right shows the result of document mask applying.

Step 3. Matches Finding

After SIFT features are extracted from the two images, the next step is to find matches between the features in the two images. This is done using a Brute-Force Matcher. The Brute-Force Matcher compares each feature in the first image to all of the features in the second image. The matches with the highest scores are considered to be the most likely matches.

It uses function:

brute_force_matcher.knnMatch(image1_desc, image2_desc, matches, 2)

This function takes four arguments: the descriptors from the first image, the descriptors from the second image, the matches result pointer as input, and the number of nearest neighbors to consider. The function provides a list of matches, where each match is a tuple of two elements: the index of the feature in the first image and the index of the feature in the second image. The parametr “2” in the knnMatch function call specifies that the function should consider the two nearest neighbors for each feature. This is done because the SIFT descriptor is a 128-dimensional vector, and it is possible that two features in different images could have similar descriptors. By considering the two nearest neighbors, the Brute-Force Matcher can reduce the number of false matches. The matches with the highest scores are considered to be the most likely matches and can then be used to stitch the two images together.

After finding all of the matches between the two images, the next step is to filter the matches a little bit more. This is done by using a ratio test. The ratio test compares the distance between the two nearest neighbors for each match. If the distance between the first nearest neighbor is less than a certain threshold times the distance between the second nearest neighbour, then the match is considered to be a good match. The ratio test is a simple but effective way to filter out false matches which passed knnMatch. The threshold value can be adjusted to control the number of good matches that are found. A higher threshold value will result in fewer good matches being found, but it will also reduce the number of false matches. I finished with threshold = 0.5.

Now we are ready to use the good matches to stitch the two images together.

Step 4. Homography Finding

Once the good matches have been found, the next step is to find the homography between the two images. Homography is a 3x3 matrix that maps points from one image to another. It can be used to stitch images together, warp images, and solve other problems in computer vision.

The homography can be found using a robust estimator called RANSAC. RANSAC is a RANdom SAmple Consensus algorithm that is used to find the best-fit model to a set of data points. RANSAC works by iteratively sampling a subset of the data points and fitting a model to the subset. The model that best fits the subset is then used to estimate the homography. This process is repeated a number of times, and the homography that is estimated most often is the best estimate of the true homography.

RANSAC is a powerful tool for finding the homography between two images. It is robust to outliers and can be used to stitch images together even in the presence of noise and errors.

As always the OpenCV function provides us an ability to set the threshold for RANSAC. The RANSAC threshold is a critical parameter, as it affects the accuracy of the homography estimation. A lower threshold will result in more inliers being included in the homography estimation, but it will also increase the chances of including outliers. A higher threshold will result in fewer inliers being included in the homography estimation, but it will also decrease the chances of including outliers.

The optimal RANSAC threshold value depends on the data set and the application. A good way to find the optimal threshold value is to experiment with different values and see which value results in the best homography estimation.

Step 5. Images Stitching with Homography Applying

The homography can be used to stitch the two images together by warping the second image to the first one. Operations I implemented in C++ code are — finding the frame size of a new stitched image with a correction vector (the correction vector is used to offset the second image when it is warped to the first), warping the second image to the first one, and copying one image onto another.

The code for calculating the new frame size first reads the size of the secondary image. Then, it creates a matrix that contains the coordinates of the corners of the secondary image. This matrix is then multiplied by the homography matrix to find the final coordinates of the corners of the image after transformation. The correction vector is used to offset the secondary image so that it is stitched to the first image in the correct location. The correction factor is calculated by finding the minimum and maximum values of the x and y coordinates of the corners of the image after transformation. If the minimum value is negative, then the correction factor is set to the absolute value of the minimum value.

The code for calculating the new frame size is a critical step in the image stitching process. The correct calculation of the new frame size ensures that the secondary image is stitched to the first image in the correct location. The correction vector is also important, as it ensures that the secondary image is not offset outside of the frame.

Job almost done — get the warped secondary image and copy it on the stitched frame. That is it…

This result was created by stitching together four separate images. The previous result was used as an input for the next one.

One more thing, let’s address the black areas that appear when the second image is overlaid on the previous one. There are a few techniques for doing this, the most common of which is known as alpha blending. However, I prefer a simpler approach. I preferred to copy pixel information from the background image if the stitched image has a black pixel.

The technique of alpha blending is a more sophisticated way of overlaying two images. It allows for the blending of the two images so that the edges of the stitched image are more seamless. However, alpha blending can be more computationally expensive than the simpler approach of copying pixel information from the background image.

To implement “copy pixel” approach, we need to do the following. After warping the second image onto the first using the homography matrix, we can identify these black pixels by checking if their color values are (0, 0, 0) in the RGB color space. Once we find a black pixel, we can simply copy the corresponding pixel from the background image onto the stitched image to fill in the black area. It’s important to note that while this approach is simpler and computationally less expensive, it may not provide the same level of visual quality as alpha blending. Depending on the application and the level of seamlessness required in the final image, you may choose between the two techniques.

The final result:

Jupiter Notebook:

Jupyter-Stitching-OpenCV/Stitching-OpenCV.ipynb at master · bigMOTOR/Jupyter-Stitching-OpenCV

Document Image Stitching Jupyter Playground for [LINK TO MEDIUM] - Jupyter-Stitching-OpenCV/Stitching-OpenCV.ipynb at…

github.com

Used Links:

Scale-invariant feature transform - Wikipedia

The scale-invariant feature transform ( SIFT) is a computer vision algorithm to detect, describe, and match local in…

en.wikipedia.org

Random sample consensus - Wikipedia

Random sample consensus ( RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of…