Experimental demonstration of extended depth-of-field f/1.2 visible High Definition camera with jointly optimized phase mask and real-time digital processing

Increasing the depth of ﬁeld (DOF) of compact visible high resolution cameras while maintaining high imaging performance in the DOF range is crucial for such applications as night vision goggles or industrial inspection. In this paper, we present the end-to-end design and experimental validation of an extended depth-of-ﬁeld visible High Deﬁnition camera with a very small f -number, combining a six-ring pyramidal phase mask in the aperture stop of the lens with a digital deconvolution. The phase mask and the deconvolution algorithm are jointly optimized during the design step so as to maximize the quality of the deconvolved image over the DOF range. The deconvolution processing is implemented in real-time on a Field-Programmable Gate Array and we show that it requires very low power consumption. By mean of MTF measurements and imaging experiments we experimentally characterize the performance of both cameras with and without phase mask and thereby demonstrate a signiﬁcant increase in depth of ﬁeld of a factor 2.5, as it was expected in the design step. [DOI: http://dx.doi.org/10.2971/jeos.2015.15046]


INTRODUCTION
Most imaging systems now include digital post-processing to enhance image quality and correct defects of the optical system. In classical design approaches, the lens is first optimized to yield the best possible optical quality, then postprocessing algorithms are designed to correct residual aberrations. However, it has been shown that optimizing jointly the lens and the post-processing algorithm can yield imaging systems with better global performance, enhanced capabilities and/or lower complexity [1]- [4].
This joint optimization approach is particularly relevant for designing hybrid imaging systems based on wavefront coding. This technique consists in placing a phase mask in the aperture stop of the lens and applying deconvolution to the obtained image in order to recover contrast. Dowski and Cathey introduced wavefront coding to increase depth of field by designing phase masks that make the Modulation Transfer Function (MTF) of the lens insensitive to defocus [5]. Different types of masks have then been investigated and their MTF through focus have been optimized in order to make deconvolution easier [6]- [9]. For that purpose, different optimization criteria have been proposed, such as reducing the intensity variations in the focal line [10], securing the absence of zeros in the MTF [8], or implementing a joint optimization approach that consists in maximizing the quality of the restored image by jointly optimizing the phase mask and the deconvolution algorithm [3]. The advantages of wavefront coding have been experimentally demonstrated to simplify the design and complexity of the optical system [11,12], to reduce focus variations due to temperature changes in imaging systems such as mirror-based telescopes [13], to improve the depth of field of thermal cameras [14] or of microscopes [15,16], and to increase the acquisition volume for biometric iris imaging system [17].
In this article we use the joint optimization approach to extend, for the first time to our knowledge, the depth of field of a standard frame rate High Definition (HD) f /1.2 camera operating in visible and near-infrared spectral range. The design step consists in determining the parameters of a binary phase mask and of a deconvolution algorithm that jointly maximize the restored image quality [3]. The optimal mask is manufactured and inserted in an end-to-end imaging chain including the sensor and a real time deconvolution algorithm implemented on a Field-Programmable Gate Array (FPGA) board. By experimentally estimating the MTF and image quality of the as-built wavefront-coded camera and comparing it to a conventional one, we finally demonstrate an increase in depth of field of a factor 2.5 in operational conditions.

DESIGN AND OPTIMIZATION
Let us consider a hybrid imaging system based on a phase mask in the stop plane of the lens. The phase function of the mask depends on a set of parameters that will be denoted ϕ. In this paper we will consider an annular binary phase mask, composed of six concentric rings whose phase is alternatively zero and π radians (see Figure 1(a)) at a reference wavelength denoted λ o . The mask is thus defined by the N − 1 outer radii, that is ϕ = {r n ; n ∈ J1, N − 1K}. In such a system the image acquired by the sensor can be modelled as: where * denotes convolution, r denotes the image spatial coordinates, O(r) is the perfect image of the scene, n(r) is the acquisition noise and h ϕ ψ (r) denotes the PSF of the lens, that depends on the phase mask parameters ϕ and on the defocus value ψ defined by: where R is the aperture stop radius, λ 0 is the reference wavelength, f is the focal length, d 0 is the object distance and d i is the image distance. We take into account the low-pass effect of pixel integration, but we do not consider aliasing effects. The acquired image is then restored by applying a linear deconvolution filter of impulse response d ϕ (r) as follows: The goal is to determine the phase mask set of parameters ϕ opt that minimizes the mean square error (MSE) between the scene and the restored image over a set of K defocus values uniformly distributed in the defocus range [0; ψ max ]. Following [3], the expression of the MSE for a given value of the defocus ψ is: where denotes statistical averaging, the superscript ∼ refers to Fourier transform, ν denotes the spatial frequency, S oo (ν) denotes the Power Spectral Density (PSD) of the scene O and S nn (ν) denotes the PSD of the noise n. The deconvolution filter that minimizes the average MSE over all defocus values, , is the following Wiener-like filter [3]: In practice, we compute this filter using a generic PSD model for S oo (ν) [3,6]. S nn (ν) is a constant since the noise n is assumed to be white. We then determine numerically the mask parameters that minimize the maximal MSE over the defocus values, that is, where the Wiener-like filter in Eq. (5) is used to compute the MSE in Eq. (4).
In this paper, we consider a 1296 × 972 HD camera with fnumber equal to 1.2, a focal length of 20 mm and a pixel pitch of 4.4 µm. The whole field of view is 16 • × 12 • . The sensor is sensitive to visible and near infrared radiation. The conventional camera has been designed to resolve details up to 60 lp/mm and provides sharp images for objects from 12 m to infinity. The design of the hybrid imaging system is based on two steps: a traditional optical design of the lens followed 15046-2 by the joint design of the optical mask and the digital filter following Eq. (6). A model of the lens of this camera and of the phase mask are input in the CodeV ® optical design software to perform accurate computation of the optical system PSFs across the object distance range. The optimization loop is managed by the Matlab ® computing software.
Our goal is to increase the depth of field by a factor 2.5 so that the system can provide sharp images on a defocus range from 4.8 m to infinity. The desired defocus range is sampled by K = 3 defocus values corresponding to 10 km, 9.6 m and 4.8 m. The input sensor level signal-to-noise ratio, defined as: is set to 34 dB. Local optimization using the Nelder-Mead simplex algorithm has been performed on-axis at the reference wavelength λ 0 = 750 nm. The obtained optimal radii values r nom i of the phase mask are given in

FULL HYBRID IMAGING CHAIN INTEGRATION AND IMPLEMENTATION
The optimal phase mask has been manufactured on a ZnS plate by diamond turning. Since the refractive index of ZnS is 2.32, the mask transition height must be equal to h = 284 nm in order to yield transitions of π radians at λ 0 . For ease of realization with diamond turning process, we have manufactured a pyramidal mask whose each step height h nom i must be equal to h at each transition: it thus consists of a set of six concentric plateaus that lead to phase changes of 0, π radians, 2π radians, etc. (see Figure 1(b)). The diamond turning technique allows a plate roughness of 10 nm. Thanks to this technique, rings are auto-centred from one to another, and the position of the overall rings depends mainly on the plate position error of ±0.05 mm. Moreover the manufacturing precision is of ±15 nm on radii and of ±15 nm over transition heights. The manufactured phase mask has been characterized using an optical profilometer. The transition heights h meas i and ring radii r meas i were estimated by averaging 8 phase mask acquired profiles. The measurement uncertainty is of 10 nm over the transition height and of 0.01 mm on the ring size. Measurement precisions are similar to the manufactured phase plate roughness. Therefore we can consider that standard deviations on radii and height measurements are comparable with measurement precisions values. Results are presented in Table 1. Considering the manufacturing tolerances given by diamond turning process, we can say that the manufactured mask complies with specifications. Moreover we have checked that such a pyramidal shape does not introduce significant changes of the mask behavior along the spectral band relatively to the nominal wavelength. The manufactured phase plate is inserted inside the camera lens so that its etched side is located in the stop plane.
The implementation of real-time HD post-processing is depicted in Figure 2. The camera sensor provides 1296 × 972 pixel images at 33.6 frames per second (FPS). The video stream is sent to a Xilinx ZYNQ ZC702 board, containing a Zynq 7020 System-on-Chip (SoC), which aim is to acquire the video stream, implement the convolution and send the processed video stream to a HD display. The SoC embeds a dual core ARM Cortex A9 @ 866 MHz coupled with an Artix-7 FPGA fabric that provides better performance and energy efficiency than the processor at the expense of programmability. The ARM processor acquires the video stream via an Ethernet link using the GigE Vision interface for high-performance industrial cameras and stores it in DDR memory, while the FPGA is used for retrieving the frame in DDR memory and for the computationally intensive convolution. We use a VHDL Intellectual Property block developed at Thales that convolves the input image with a deconvolution kernel which parameters can be tuned at runtime. The kernel was obtained by computing the inverse Fourier transform of the Wiener-like filter expressed in Eq. (5), and by cropping the result to an 11 × 11 pixel window. This setup performs the convolution of a 1296 × 972 pixel image by the 11 × 11 kernel at up to 119 FPS and therefore allows one to perform real time imaging at the 33.6 FPS camera video rate. As a comparison, the same computation performed by a general purpose processor would barely reach 1 FPS. The total FPGA power consumption has been measured to be 600 mW including 430 mW for the convolution itself. It has to be compared with the nominal 5 W power consumption of the camera. The deconvolved image is finally displayed on a HD screen at 60 FPS.
The whole imaging chain is represented in Figure 3. Two identical cameras are set side by side. One of them includes the phase mask in the stop of its lens and is connected to the FPGA board for deconvolution. It will be called in the following the 'hybrid camera'. The other camera includes in its stop plane a simple ZnS plate with the same average thickness as the phase mask. It will be called in the following the 'conventional camera'. Thanks to this setup, we can compare images of the same scene acquired at the same time by both cameras.

EXPERIMENTAL CHARACTERIZATION AND IMAGING DEMONSTRATION
To characterize and compare imaging performance of both cameras, we first estimated their polychromatic MTF. We conducted measurement indoor at the three following object distances: 4.8 m, 9.6 m and 47 m which was the maximum reachable distance. The MTF measurement method consists in estimating the horizontal MTF from vertical bar targets corresponding to spatial frequencies on the sensor of 10 lp/mm, 20 lp/mm, 30 lp/mm, 40 lp/mm, 50 lp/mm and 60 lp/mm. All these targets have similar spatial extensions in order to minimize phasing error on MTF estimation. For calibration needs, we acquired at each distance a full white target image and a full black target image. The target image has been normalized with respect to the white and the black target images as follows: where t(r) denotes the bar target image at a given frequency, w(r) denotes the white image, b(r) denotes the black image, and t 0 (r) denotes the calibrated bar target image. Horizontal profiles were extracted from the calibrated image and averaged to produce one single profile with reduced noise. To avoid side effects caused by its finite extension, the averaged profile has been apodized using a Hamming window function. The apodized profile is denoted {p k } k=1..P where P is the number of profile samples. The MTF was then estimated in the following way, when ν 0 = 0: For conventional and hybrid cameras, the focus have been set respectively at 24 m and 9.6 m in order to provide the most optimized condition for depth of focus and image quality. All experiments have been carried out at full aperture ( f /1.2). Measured MTF for both cameras at the three object distances are represented in Figure 4. As expected, the conventional camera MTF (blue curves) dramatically decrease as the object distance gets smaller. The MTF of the hybrid camera without deconvolution (red curves) is lower than the conventional one at 47 m, but as expected, it varies much less with the focus distance, which will allow us to perform deconvolution with a single filter. Using the same bar target-based method, we measured the MTF of the hybrid camera with deconvolution.
The results are showed in Figure 4 (yellow curves). It can be noticed that the MTF of the hybrid camera is above that of the conventional camera at each frequency for all object positions, especially for the closest ones. As the deconvolution filter was designed to restore frequencies that have been attenuated by the lens with phase mask, the ideal hybrid system without noise would have a MTF equal to one over the frequency range and would be therefore higher than the MTF of a diffraction-limited lens without post-processing. In practical conditions, the post-processed hybrid camera MTF does not reach 1 but has significantly higher values especially at 9.6 m than the conventional system as shown in Figure 4. This validates the performance of the hybrid camera.
As a further illustration, we compare in Figure 5 images of the same scene acquired by both cameras at full aperture. The scene is composed of two spoke targets positioned at 4.8 m and at 47 m. These targets contain all spatial frequencies, from 5 lp/mm at the periphery to infinity at the centre, and therefore allow one to observe continuous contrast variations with respect to spatial frequency. It can be noticed that the hybrid system allows to overcome the conventional camera incapacity to provide sharp images simultaneously at 4.8 m and 47 m for a given focus setting. For the conventional camera, the nearest spoke target is significantly blurred and has two contrast inversions indicating that the MTF goes to zero at least twice as it is confirmed by the measured MTF in Figure 4(c) (blue curve). For the hybrid system instead, both targets are sharp and have similar image quality.

CONCLUSION
In conclusion, we experimentally demonstrated the increase in depth of field of a f /1.2 HD visible and near infrared camera using wavefront coding. We designed and built an end-toend hybrid imaging system combining an optimized six ring pyramidal phase mask placed in the aperture stop of the lens and a digital deconvolution algorithm. The performance of this hybrid camera has been compared to that of a conventional camera based on the same conventional optical design by performing MTF measurements and imaging experiments.
The results show an increase in depth of field of a factor 2.5. This is, to the best of our knowledge, the first time that such an increase of performance using wavefront coding is designed, implemented and characterized for a system providing real time High Definition images with a very small f -number.
This result shows that it is possible to easily increase the depth of field of traditionally designed optical systems by simply inserting a mask in the stop plane and using FPGA-based deconvolution filter while keeping the optical system as compact and lightweight as the traditional one. This could lead to many applications in such domains as embedded imaging systems, handheld or head-mounted devices, where compactness and weight are key parameters. For example, an attractive application of this approach would be to increase the performance of security and surveillance equipment at low light level while reducing their cost and weight by avoiding mechanical parts for focusing.