Constant Color API: Technology Review & Demos

Colors are Clearer Than Ever Before: Constant Color API Review

The human visual system adapts to a wide range of lighting conditions, from warm sunlight to the cool glow of office fixtures. Yet, a smartphone camera applies numerous system-level processing steps and enhancements.

As a result, the same color sample can appear differently under varying illumination or on different devices. In a professional environment, such inconsistency leads to significant waste of time and resources.

In this article, the It-Jim mobile app development team explores how smartphones process images, what factors influence color consistency, and examines the Constant Color API presented by Apple.

In Search of Constant Color

Different color reproduction across devices based on the photos showcasing three towels

The root cause of color inconsistency lies in the hardware and software processing. Modern smartphone cameras rely on a series of automated adjustments known collectively as the 3A pipeline.

First, Auto-Focus analyzes contrast in the scene to lock onto the sharpest subject. Then Auto-Exposure measures overall brightness and adjusts shutter speed and aperture. Finally, Auto-White-Balance estimates the scene’s color temperature, whether warm incandescent light or cool daylight, and applies corrective tint so that whites appear neutral.

All of these decisions draw on built-in light meters and computer vision (CV) before the sensor data proceeds to multi-frame fusion and further enhancement.

3A color pipelines that include balanced color tone, brightness control, and sharpness

A mobile application that delivers a stable color signal regardless of lighting conditions can become a competitive advantage for both end users and enterprise customers.

Context of Existing Solutions

Let’s examine how modern smartphones process images “under the hood” and why this affects color consistency across devices.

The simplified flow diagram below illustrates the overall pipeline.

High-level diagram showing the color consistency in a camera device

Smartphone cameras begin by capturing light through a grid of red, green, and blue filters and then reconstruct a full-color image by filling in the missing data.

They automatically adjust focus, exposure, and white balance before blending multiple exposures and reducing noise to produce a clear, well-lit photo. While these steps make images look good, each phone’s unique processing can shift colors so that the same scene may appear differently on different devices.

Some newer smartphones even replace the entire sequence with a single deep-learning-based image-signal-processing model (DeepISP). One common workaround uses physical color targets, such as the X-Rite ColorChecker, or laboratory-grade spectrophotometers, which provide reference spectral data but are bulky and expensive.

Another approach is to calibrate the camera using a white and/or gray card of known reflectance. By using the card as a reference, photographers can ensure that colors will accurately reproduce and that the image is correctly exposed.

However, this method requires manual setup and cannot guarantee perfect results, especially when the device is in motion or the lighting is changed.

Grey reference reflector for camera calibration

In iOS 18, Apple introduced the Constant Color API framework, which activates a dedicated “studio” flash mode to capture images with a neutral white balance regardless of ambient light sources.

Four images of the same coffee package by using different camera settings

Conventional pipelines such as 3A, HDR fusion, denoising, and tone mapping are unsuitable for exact color measurement, while producing visually pleasing results for general viewers. Physical targets and spectre-processing devices remain impractical for mobile applications.

The Constant Color API and similar “studio-lighting” approaches combine ease of use with accuracy, delivering both stable color captures and per-pixel confidence data. These outputs enable advanced features such as extracting the exact color of a selected region of interest.

Taming the Constant Color API

To obtain a “studio”-quality image free from color distortion, we selected the Constant Color API in AVCapturePhotoOutput, available from iOS 18 onward.

In this mode, the system fires the device’s built-in flash at a fixed spectrum and locks the white balance regardless of ambient lighting. In addition to the image itself, the API returns a confidence map that enables assessment of measurement accuracy within a selected region.

Samples of normal photo, constant color photo, and confidence map

It is important to note certain device limitations. The mode is supported only on hardware with a sufficiently powerful flash (iPhone 14 and newer). It disables manual exposure control and requires RAW capture to be turned off. In very low‐light conditions without enough reflected flash, the quality of the confidence map may degrade.

To leverage the Constant Color API, a specific AVCaptureSession configuration is required:


func setupCaptureSession() {
    defer { captureSession.commitConfiguration() }

    // Some default setup for AVCaptureSession
    captureSession.beginConfiguration()  
    captureSession.sessionPreset = .photo
    // setup AVCaptureDeviceInput, AVCaptureDeviceOutput, 
    // AVCaptureDevice, depth data and quality

    // Special option for Constant Color API

    // A BOOL value specifying whether constant color capture is supported 
    // This property returns YES if the session's current configuration allows 
    // photos to be captured with constant color. When switching cameras 
    // or formats this property may change
    photoDataOutput.isConstantColorEnabled = photoDataOutput.isConstantColorSupported
}

In the AVCapturePhotoCaptureDelegate, the AVCapturePhoto object now exposes additional properties:

  • constantColorConfidenceMap – a pixel buffer with the same aspect ratio as the constant color photo, where each pixel value (unsigned 8-bit integer) indicates how fully the constant color effect has been achieved in the corresponding region of the constant color photo – 255 means full confidence, 0 means zero confidence.
  • constantColorCenterWeightedMeanConfidenceLevel – score summarizing the overall confidence level of a constant color photo.

func photoOutput(
    _ output: AVCapturePhotoOutput,
    didFinishProcessingPhoto photo: AVCapturePhoto,
    error: Error?
) {
    if photo.isConstantColorFallbackPhoto {
        normalPhotoImage = // convert AVCapturePhoto to UIImage and save
        // Return for waiting next photo with Constant Color data
        return
    }

    // Save Constant Color image photo
    constantColorPhotoImage = // convert AVCapturePhoto to UIImage

    // Get Confidence Map pixel buffer
    let photoConfidenceMap: CVPixelBuffer = photo.constantColorConfidenceMap

    // Save Confidence Map image photo
    confidenceMapImage = // convert CVPixelBuffer to UIImage


    // Set parameters of ROI
    let roiSize: Int = 30    // in pixels
    var roiColor: UIColor? = nil
    var roiConfidence: Float? = nil


    // Get color of ROI
    guard let cgImage = constantColorPhotoImage.cgImage else { return }

    // Init rect for ROI (zone in center of photo)
    let rect = CGRect(
        x: (cgImage.width - roiSize) / 2,
        y: (cgImage.height - roiSize) / 2,
        width:  roiSize,
        height: roiSize
    )

    // averageColor is our special UIImage extension next
    roiColor = // calculate color from constantColorPhotoImage by rect
        
    // Calculate confidence for ROI
    if let avgGray = confidenceMapImage?.averageColor(rect: rect) {
        var white: CGFloat = 0
        var alpha: CGFloat = 0
        avgGray.getWhite(&white, alpha: &alpha)
        roiConfidence = Float(white)        
    }

    // Return feedback for sharing info about photos and colors
}

It is important to recognize that the chosen region of interest (ROI) size critically affects color accuracy.

Through experimentation, we discovered that regions smaller than 20×20 pixels yield technically correct readings but tend toward muted, pastel tones. Besides, regions larger than 50×50 pixels preserve saturation more faithfully, yet the extracted color often blends into a grayer spectrum, losing its special hue.

To compute the region’s average color, we implemented a UIImage extension that accepts a CGRect parameter, applies a Core Image filter to the specified area, and returns the resulting UIColor.


func averageColor(rect: CGRect) -> UIColor? {
    let ciAvgFilterName = "CIAreaAverage"
    
    // Crop original CGImage to specified rect
    guard let cgImage = cgImage?.cropping(to: rect) else {
        return nil
    }
    // Create CIImage from cropped CGImage for usage CIFilter
    let ciImage = CIImage(cgImage: cgImage)
    
    // Init avg filter by name
    guard let filter = CIFilter(name: ciAvgFilterName) else {
        return nil
    }
    
    // Set the input image for the filter
    filter.setValue(ciImage, forKey: kCIInputImageKey)
    
    // Obtain the filter output, which is a 1×1 CIImage representing average color
    guard let output = filter.outputImage else {
        return nil
    }
    
    // Prepare buffer to hold RGBA8 pixel data
    var bitmap = [UInt8](repeating: 0, count: 4)
    
    // Render the CIImage into the buffer to extract pixel bytes
    CIContext().render(
        output,
        toBitmap: &bitmap,
        rowBytes: 4,
        bounds: CGRect(x: 0, y: 0, width: 1, height: 1),
        format: .RGBA8,
        colorSpace: CGColorSpaceCreateDeviceRGB()
    )
    
    // Convert RGBA8 bytes into UIColor normalized to [ 0, 1 ] range
    return UIColor(
        red:   CGFloat(bitmap[0]) / 255,
        green: CGFloat(bitmap[1]) / 255,
        blue:  CGFloat(bitmap[2]) / 255,
        alpha: 1
    )
}

Project Demo Video and Images

The specialized configuration phase is complete, and now we move on to the demo. To showcase the logic we have implemented, we will recreate a user interface composed of a Main View featuring a Capture Button and a separate Preview View.

In the Preview View, we will present four interactive cards: Color, Confidence Map, Constant Photo, and Normal Photo. On the Color card, we will add functionality to find the closest matching RAL palette color based on the captured RGB values.

 

Interactive demo example comparing a normal photo and a constant photo of a cup with the It-Jim logo

 

Interactive demo example comparing a normal photo and a constant photo of a coffee package

 

Interactive demo example comparing a normal photo and a constant photo of a washing sponge

Let’s Finalize About the Constant Color API

Accurate color measurement in the field remains a nontrivial challenge: the spectral characteristics of light sources, surface properties, and the camera’s internal processing all introduce their distortions.

Our implementation based on the Constant Color API shows that on modern devices, by using a controlled “studio” flash and a per-pixel confidence map, one can closely approximate the true hue: the resulting images render object and surface colors far more naturally, narrowing the gap between digital capture and human perception under neutral (diffuse) lighting.

It must be remembered again that this method does not guarantee 100 % correlation with the optical spectrum. In the real world, factors such as material, surface roughness, ambient light, and camera angle still require additional compensation. However, access to pixel-level confidence and the ability to programmatically filter out “weak” regions open new horizons for mobile color-measurement solutions.

Looking ahead, integration of machine-learning models for advanced spectral correction promises further gains – each year, these networks become more capable of inferring true colors despite variable lighting.

Yet even today, the Constant Color API represents a powerful tool for achieving far more natural color reproduction than previously available methods.

How would you apply this technology? Can our current handheld devices truly see and convey pure color to us?

Computer Vision Technology Costs: Key Factors & Use Cases

Computer Vision Cost: Understand Your Budget to Build Powerful Vision AI Solutions

How much does computer vision technology cost?

To make a long story short, the rough cost of the basic AI vision software or pilot project starts at $30,000. The more advanced computer vision solution costs around $100,000 or higher.

The overall cost of the computer vision project depends on its complexity, data acquisition processes, integration requirements, compliance and security matters, as well as specifications of hardware and software components. Additionally, consider price variations concerning industry-specific use cases, annual maintenance costs, and the selected team of CV experts working on the project. For accurate budgeting, it is essential to evaluate all these factors.

Thus, many unknown aspects make it challenging to calculate the precise development costs of R&D projects. This aspect leads to unpredictability and imprecision in estimates, particularly in the early stages.

In this comprehensive guide, our team will examine the cost of computer vision software and help you plan your investment accordingly. You’ll discover:

  • Key factors influencing the final cost of a computer vision project.
  • Understand if computer vision is indeed expensive.
  • Specifics of software and hardware costs involved.
  • AI vision pricing options on the selected infrastructure setup.
  • Computer vision development cost breakdown for each project phase.
  • Use cases and cost of computer vision across industries.
  • Strategies to optimize your computer vision model costs.

Let’s start by exploring the specifics of computer vision technology.

Is Computer Vision Expensive to Implement?

The global AI vision market is estimated to be worth $15.85 billion and projected to reach $108.99 billion by 2033, representing a 24.1% annual growth rate.

Such an incredible demand for innovations is also supported by government initiatives that promote digital transformation and sustainability. As a result, entrepreneurs in various fields utilize modern technologies, including deep learning models and computer vision, to enhance their operations.

AI vision market size and growth forecast 2023-2033

The primary goal of the technology is efficiency, as it converts raw footage into informed business decisions. At the same time, economic feasibility plays a critical role; if the implementation costs of computer vision are too high, the business case falls apart.


“CV is expensive. – Yes, if you’re solving the wrong problem or building the wrong solution. But when designed right, it replaces hours of manual work, reduces human error, and delivers long-term savings”.

Ievgen Gorovyi, PhD in Computer Vision & Founder at It-Jim


Thus, the investment in AI vision projects can be substantial, but it offers significant benefits. Emerging AI technology and computer vision aid in process automation, accuracy improvement, cost optimization, and enhanced efficiency.

Computer vision is a relatively new AI technology that needs a skilled pool of talent. It may be challenging to find genuine professionals with relevant expertise. Top-notch technicians, AI consultants, and solution architects are in high demand, and even a small team can be costly, becoming the project’s most significant expense.

Other challenges may lie in lighting, motion, hardware limitations, deployment environments, computational burden, and, most importantly, user experience and business optimization. If one piece is missed, the business ROI crumbles. As a result, even a highly experienced team needs to invest significant effort and time to turn a computer vision software idea into reality.

To conclude, computer vision algorithms are costly and require a significant amount of technical expertise to implement effectively. On the other hand, it doesn’t mean that computer vision implementation is out of reach for smaller businesses; it simply means that companies must be cautious when deciding how to deploy computer vision technology.


Here’s my take: the biggest cost in computer vision isn’t the tech. It’s the gap between assumptions and reality.”

Ievgen Gorovyi, PhD in Computer Vision & Founder at It-Jim


Save time and connect with our experts by sharing your computer vision software idea. 

Factors Influencing Computer Vision Costs

When estimating the cost of a computer vision project, you must consider several key aspects. Let’s outline each of them one by one.

1. Project Scope & Complexity

The scope of the AI vision project directly influences cost. The solution may include advanced functionalities such as real-time processing, image recognition, object detection, multi-camera support, 3D modeling, and similar capabilities.

These computer vision tasks require a higher number of necessary resources (hardware and technical expertise) compared to simpler functions. As a result, this aspect drives the need to incorporate machine learning models and establish a high-performance computing infrastructure.

Additionally, software complexity isn’t just about algorithms. It encompasses the overall scale, interdependencies, and advanced technologies required to develop practical computer vision applications.

For example, basic object detection projects can range from $10,000 to $30,000, while custom model development can start at $50,000 and increase in cost depending on complexity.

Real-time video analysis projects typically cost between $40,000 and $100,000, while advanced 3D computer vision solutions can exceed $100,000. These figures clearly indicate that the price of computer vision projects varies significantly.

2. Data Amount & Quality

Gathering the necessary data helps train AI vision software to complete tasks with higher accuracy. The proper amount and quality of data are critical factors influencing the success of machine and deep learning models.

Obtaining high-quality, annotated data from a large dataset requires time and resources. You can either use applicable data from in-house sources (e.g., video footage, images) or public databases, or purchase it from a third-party provider.

The price of computer vision can vary depending on the chosen data acquisition method and the required quality. High-quality data annotation costs more, but it leads to better model accuracy and performance. Achieving higher accuracy often requires more complex algorithms and increased development costs and time.

3. Hardware Investments

The price of hardware components can also become a significant factor in the overall project cost, particularly for those with an edge-based approach.

It is necessary to invest a substantial amount of money in high-quality cameras, processing capacities, and other equipment that support the project objectives to capture and process visuals.

Some typical hardware components of vision AI systems include:

  • Industrial cameras or other types of sensors.
  • Graphics Processing Units (GPUs) for parallel image processing and network training. Access to sufficient GPU resources is essential for running deep learning models efficiently, especially when processing images or video frames on remote data center servers.
  • Edge devices for real-time processing, such as mobile devices, cameras, robots, or embedded IoT systems.
  • High-performance CPUs and RAM for complex tasks of image preprocessing and data augmentation. Powerful processors are essential for handling resource-intensive image processing tasks, directly impacting the efficiency and cost-effectiveness of computer vision.

It is crucial to capture and handle data accurately, with high security in mind. This element is essential, for instance, in healthcare projects, where privacy and data concerns are vital.

Additionally, factors such as environmental conditions and camera placement may impact the total investment in hardware. Sufficient physical space is necessary to accommodate hardware and ensure proper integration, especially in cluttered environments.

Regarding investment in camera equipment for computer vision projects, the price ranges from just $30 to $3,500 per unit. The cost varies depending on resolution, transfer speed capabilities, and other features.

Camera type Price range Features
Basic $30 – $200 Standard resolution, basic transfer speeds
Professional $200 – $1,500 High-resolution, advanced features
Enterprise $1,500 – $3,500 Premium specs, industrial grade

4. Software Frameworks & Tools 

Software costs in computer vision projects can differ substantially, especially when comparing proprietary and open-source options. The general advice is to look beyond the initial subscription or licensing fees. Take into consideration ongoing costs associated with hosting, software updates, and any necessary customization or integration.

Open-source tools such as TensorFlow, PyTorch, and OpenCV provide a robust and adaptable foundation for developing custom computer vision software. Integrating various machine learning platforms within a project can significantly impact system complexity, maintenance, and overall implementation costs.

These tools give access to source code and community resources, which are ideal for teams that need customization and budget management. However, developing and maintaining custom computer vision software can be resource-intensive, requiring significant processing power and specialized expertise.

In comparison, off-the-shelf AI vision solutions, such as MATLAB, offer better support and easy-to-use interfaces. Yet, these services come together with substantial licensing fees, extra costs for support, and unsuitable functionality.

Thus, many companies opt to develop custom AI vision solutions, as they offer improved accuracy and performance.


“Off-the-shelf models might get your 60-70% accuracy. Sounds fine until you realize that in production, 70% of the time, it fails. When a business problem is specific, your solution has to be too”.

Ievgen Gorovyi, PhD in Computer Vision & Founder at It-Jim


5. Integrations with Internal Systems

Integrating with existing systems or databases increases the total cost of building computer vision solutions. For seamless communication, we need custom API development, data mapping, and thorough testing to ensure that the AI vision service functions correctly.

In addition, architectural design choices and infrastructure setups have a significant impact on integration and costs. Complex architecture can raise costs. This is particularly true when adding advanced features or ensuring the system integrates smoothly with existing workflows. We will elaborate on the specifics of infrastructure costs further in detail.

Using standard interfaces and protocols can facilitate seamless integration. Organizations should be cautious of technology lock-ins when utilizing computer vision systems. Relying on off-the-shelf solutions can limit flexibility and make future upgrades more challenging.

6. Personnel Expertise & Team Location

Top computer vision engineers earn high salaries since they offer a top level of expertise and knowledge. This adds to the costs of computer vision projects, especially for advanced solutions.

Additionally, the costs of implementing computer vision vary depending on the selected development model. Companies can choose from an in-house model, hiring an individual AI consultant, or working with a remote team (IT outsourcing). 

In-house development often requires additional equipment and increases staffing costs. Additionally, the location of an AI and CV software development company influences project pricing, as labor salaries can vary significantly across different regions. Hourly rates of CV professionals on the local market can be 30-50% higher than addressing a team of CV specialists from Eastern Europe, for example.

Therefore, delegating computer vision software development to a remote team of professionals, such as It-Jim, is a wise decision. This way, you save on your budget and receive top-quality expertise. 

Our team has developed various business solutions that utilize computer vision technology for object detection, productivity monitoring, visual search recommendations, and more. 

Our team has developed various business solutions utilizing computer vision technology for object detection, productivity monitoring, visual search recommendation, and more. 

Reach out to our CV experts and discuss the project from both technical and business perspectives to ensure a high ROI in your business case.

Infrastructure Computer Vision Costs: Cloud vs. Edge Computing

An illustration depicting various key factors influencing computer vision price, including technology and hardware costs

Infrastructure choices, including the need for cloud storage and processing resources, primarily drive software development costs. Choosing between cloud-based and edge computing has implications for project cost, efficiency, and latency.

Many overlook the architectural design when estimating the costs of computer vision.

Computer Vision Cost: Cloud-Based Solutions

Cloud-based solutions utilize popular systems such as AWS Rekognition, Azure Cognitive Services, or Google Cloud AI Vision. These services connect via APIs that send every image or camera frame (data) to a cloud server for processing. The API response usually includes detected classes or OCR data. These details are key for grasping API performance and cost.

These cloud services have flexible pricing. They charge based on units, detection, labels, or frames per second (FPS). For most CV projects, you need a mix of these services to cover all AI vision tasks and boost output accuracy. A plate recognition system needs three services: car detection, plate number identification, and plate reading. Thus, estimating the cost of cloud-based computer vision with precision may be challenging.

As a result, a cloud-based method offers flexibility and a lower initial investment. These solutions offer free trials for small PoC projects with low-volume testing. However, the price can rise significantly due to latency issues, higher processing volumes, or scalability needs. There is also a risk of bottlenecks. These can raise costs since the system requires a constant internet connection to work well.

Computer Vision Price: Edge Solutions 

Edge computing enables rapid data processing, removing the need for data transmission to a central server. The system operates on physical computers and servers with direct network connections. This decentralized method is very scalable. You can add or remove edge endpoints without affecting the others. Edge AI is crucial for real-time processing and privacy protection, and it works well in settings such as smart factories.

This method requires a larger upfront investment in hardware, such as local processors or AI accelerators. Despite high investment, edge computing can cut costs over time and improve efficiency. It processes data locally, which is especially helpful for large projects.

Here’s a comparison table showing the main differences between cloud-based and edge-based AI:

Factor Cloud-Based AI Vision Edge-Based AI Vision
Latency Higher latency due to network transmission Low latency with real-time processing on-device
Connectivity Requires a stable internet connection Works offline or with intermittent connectivity
Processing Location Data is sent to the cloud for processing Processing occurs locally on the edge device
Bandwidth Usage High, as raw or semi-processed data is transmitted Low, since data is processed and filtered locally
Hardware Requirements Lightweight devices: heavy lifting is done in the cloud Requires powerful edge devices (e.g., GPUs, TPUs)
Scalability Easily scalable; resources can be added in the cloud Scaling may require deploying and managing more edge devices
Security & Privacy More risk; data is transmitted and stored remotely Improved privacy; data remains local
Maintenance & Updates Easier to update centrally More effort is needed to update distributed edge devices
Cost Model Ongoing costs for cloud services and data transfer Higher upfront hardware cost but lower long-term cloud fees
Use Cases Ideal for batch processing, analytics, or centralized monitoring Best for time-sensitive tasks like real-time detection, control

In conclusion, a good infrastructure choice lies somewhere in between, and many adopt a hybrid approach that balances cost efficiency and system performance. The optimal option depends on project size, performance requirements, and long-term scalability needs.

Computer Vision Cost Breakdown per Project Phase

Dividing software development into phases helps manage the project budget efficiently. You can break down the cost of a computer vision project into the following stages:

  • Planning the project.
  • Preparing the data.
  • Developing the computer vision model.
  • Implementing and deploying the system.
  • Testing and quality assurance.
  • Maintaining and updating the solution.

In this part, we will elaborate on these steps of developing an AI vision solution in greater detail.

1. Project Planning and Scope Definition

Clear goals and careful planning help companies establish a strong foundation for their software development projects. This stage typically accounts for approximately 10% to 15% of the total cost of the computer vision project. It may produce the following deliverables:

  • Defined project goals and success metrics.
  • Defined project functionality and scope.
  • Established agreements among stakeholders.
  • Allocated budget and personnel.
  • Estimated a rough project cost and timeline.
  • Set realistic milestones and deadlines.
  • Ensured adequate data availability for the CV model.

Early project discussions with clients are crucial for gathering information and requirements, establishing a clear roadmap, and preventing scope changes. With clear objectives, you can boost project success and get the expected results. The approach helps manage the development costs of a computer vision solution and optimizes the budget.


Ironically, skipping early scoping is what usually delays the project later. We’ve learned that the best way to speed things up is to slow down just enough at the start.

Ievgen Gorovyi, PhD in Computer Vision & Founder at It-Jim


2. Data Preparation & Model Creation

For the project to succeed, it is necessary to have sufficient, high-quality data (e.g., relevant images, video materials) for the system to analyze and learn from. Depending on the problem you want to solve, you can utilize public datasets, synthetic data, or custom image capturing.

Once there is enough data pool, the next step is to label it correctly (e.g., segmentation masks, classification tags, or bounding boxes). Proper labeling ensures that the computer vision model knows what to search and what results to provide, thus directly influencing the system’s accuracy and performance. The choice of AI model architecture also plays a crucial role, as it can significantly affect both the accuracy and cost of the computer vision system.

Thus, data acquisition, annotation, and computational resources for model training can vary widely depending on the specific use case, ranging from 20% to 50% of the budget.

3. Project Implementation & Deployment

The development phase typically incurs the highest costs of a computer vision project, accounting for more than 50% of the total budget. This step corresponds to the need for engineering expertise, system integrations, and security matters.

Agile development approaches (e.g., Scrum and Kanban) help minimize costs by aligning implementation with project needs. Focusing on critical functionality can streamline timelines and prevent budget overruns.

Architectural design choices and infrastructure setups have a significant impact on the integration process and associated costs for computer vision. It is vital to deliver system compatibility with the existing workflow and ensure seamless integration of CV models into production. At this stage, MLOps becomes crucial. It aids in version control, CI/CD, performance monitoring, and scaling computer vision models for deployment in real-world settings.

Also, security is vital for protecting sensitive image data and intellectual property. If you need this functionality, be aware that it can be costly and requires investment in infrastructure hardening, data encryption, and continuous monitoring.


“Accurate cost estimation starts with understanding the unique data and infrastructure challenges of each business. Missing these details can lead to underestimations of 70%.”

Ievgen Gorovyi, PhD in Computer Vision & Founder at It-Jim


4. Testing & Quality Assurance

The testing and QA stage is crucial to ensuring the reliability and accuracy of AI vision systems. Rigorous testing methodologies and tools are used to identify issues early and provide scope for improvement.

Computer vision costs can increase due to custom API developments, data mapping, and extensive testing when integrating with existing systems. It is a wise strategy to initiate QA in the early development stages, as it enables refinements based on user feedback, ensuring a high level of accuracy and performance.

5. Ongoing Support & Maintenance

Maintaining the latest functionality and high security level of computer vision solutions can be achieved through regular updates and improvements. These updates typically incur an annual cost of approximately 20% of the original computer vision project cost

Ongoing monitoring and technical support guarantee optimal system performance. Technical support helps resolve issues quickly, ensuring the system operates smoothly and efficiently. Monitoring helps identify problems early and prevent significant downtime, ensuring steady performance.

The table below provides a rough cost allocation for each project phase.

Project Stage Typical Cost Share (in %) Key Considerations
Project Planning  10% – 15% Define project scope, functionality, integrations, and infrastructure setup
Data Preparation & Model Creation  20% – 50% Data collection, cleaning, and annotation.

Algorithm selection, training, and validation.

Implementation & Deployment 40% – 60% System integration and deployment.
QA & Testing 15% – 20% System testing, scope for improvements, and quality assurance
Ongoing Support & Maintenance (annually)  10% – 20% Ongoing support, updates, and scalability enhancements.

Estimated Timeline & Cost of a Computer Vision Project

As mentioned, the key drivers of computer vision cost include software complexity, industry requirements, integrations, data testing and annotation processes, deployment method, and the selected team of AI developers.

Taking all these cost factors into account, the total budget for developing a solid computer vision solution is within the $100,000 to $350,000. But if you want to test the technology or implement a system with prioritized functionality, the cost starts at $60,000 for an MVP project. 

The table below provides rough estimates based on the type of computer vision project.

Project Complexity Development Cost Development Timeline Specifications
Pilot, Simple Project $10,000+ 1-2+ months PoC project to test the hypothesis
Basic AI Vision Software $30,000+ 2-3+ months MVP project with basic features (e.g., OCR, simple classification)
Moderate CV-based Project $60,000+ 3-5+ months mid-level complexity

1-2 complex features (e.g., object detection)

Complex Visio AI 

Software

$100,000+ 6-12+ months advanced functionality (e.g., custom ML models, real-time tracking), enterprise-level

The basic computer vision solutions cost around $30,000 and last 2-3 months. Designing and building solutions of medium complexity typically begins at $60,000, with a 3-to 5-month timeline. The pricing for advanced systems with increased precision can exceed $100,000 and last for more than 6 months. Logically, the more complex the system is, the longer it takes to implement.

Important Note on Proof of Concept

Proof of concept (PoC) is a strategic step and one of the best ways to start with vision AI projects. Since there is a significant number of unknown elements, through pilot testing, it is possible to elaborate on the project’s feasibility and refine the solution using real-world feedback.

A PoC project typically takes 1 to 3 months and costs only 10-20% of the computer vision budget. Here are the benefits you can expect:

  • Identify potential challenges before the project launches.
  • Understand methods to overcome burdens or limitations encountered.
  • Update the project scope based on feedback from real-world settings.
  • Validate performance standards and system metrics.
  • Reduce risks associated with full-scale vision AI implementation.

Want to estimate the cost of implementing your custom vision AI idea?

Contact our experts, and they can help analyze your project requirements and outline an initial budget. 

Computer Vision Price Across Industries

According to the recent McKinsey report, organizations are increasingly utilizing AI and computer vision technology across multiple business functions, including product and service development, service operations, and software engineering.

Computer vision utilization across industries

Computer vision enables organizations to automate tasks, reduce costs, enhance accuracy, and increase productivity. For instance, artificial intelligence and computer vision in healthcare are utilized to enhance diagnosis and reduce operational expenses.

New capabilities enabled by computer vision technology allow organizations to develop innovative solutions for operational challenges. The ROI of computer vision can differ by industry, use case, and implementation.

Many are already seeing impressive results with the following functions:

  • Manufacturing & Industrials: visual inspection, predictive maintenance, defect detection, quality control, safety, and workforce monitoring.
  • Logistics & Warehousing: package tracking, inventory detection, storage optimization, goods counting, object detection, automation.
  • Healthcare: medical imaging, segmentation, diagnostics support, patient monitoring.
  • Sports & Fitness: pose estimation, real-time movement tracking, athlete analysis.
  • Retail & E-commerce: shelf monitoring, customer behavior analysis,  product recognition, visual search, optical character recognition (OCR), inventory management.
  • Real Estate & Construction: 2D and 3D modeling, layout recognition, property measurements, virtual tours.

Computer vision costs vary across industries due to differences in data complexity, infrastructure requirements, system integration challenges, and regulatory demands.

Key cost drivers include the need for high-precision models, real-time processing, specialized hardware, and compliance with sector-specific standards such as HIPAA in healthcare or GDPR in retail. The scale of deployment and the solution’s integration with current systems also greatly affect the total investment.


At It-Jim, we don’t just build things that operate; instead, we create things that continue to work even when reality gets messy.

We deliver tailored, cost-effective CV solutions across various industries, including manufacturing, sports, healthcare, and retail. If you’re building a new AI product or struggling to get an existing one to perform, let’s talk.

How to Cut Down Computer Vision Software Development Costs

Employing cost-effective strategies maximizes the return on investment in computer vision projects. Focusing on key features, utilizing open-source tools, and rolling out updates in phases can help reduce costs.  

Here’s a helpful list of tips to save money on your next computer vision project:

Advice 1: Prioritize the functionality

To effectively manage and optimize your development budget, elaborate on the essential and secondary functionalities of your solution. Such prioritization helps launch a project within a defined timeline and start testing it in a real-world setting more quickly.

Advice 2: Plan the data collection process

The issue with the data lies in the quality of relevant use cases and correct data labeling so that the system achieves a high accuracy level. Therefore, ensure that you delegate this process to a reputable team of professionals, such as It-Jim.

Advice 3: Consult with experts before investing in hardware or software

Before purchasing hardware or other sensors to collect and process data, you’d better consult with experienced CV professionals like It-Jim to avoid pitfalls. Even high-quality cameras and hardware components may not be suitable for a project’s needs, and investing in them may result in a waste of money.

Choose the software to be used in the CV project carefully. Consider your team’s tech skills and the project’s long-term maintenance needs. Open-source tools can save money and provide flexibility. However, they may require additional resources for management and updates.

Advice 4: Leverage open-source technologies

Using open-source frameworks for computer vision projects provides flexibility, cost-effectiveness, and access to extensive community resources. Open-source tools can increase development speed and reduce resource needs, thereby enhancing efficiency.

Proprietary software often incurs licensing fees and limits customization options, leading to higher ongoing costs. Thus, leveraging open-source tools can lead to significant cost savings.

Advice 5: Follow a step-by-step implementation

Breaking down computer vision projects into manageable stages makes the process less overwhelming and more flexible. The phased implementation enables organizations to allocate their budgets more effectively and avoid significant upfront investments.

This method facilitates continuous learning, enabling businesses to adapt their strategies based on early-stage results and feedback. The gradual approach not only minimizes risks but also enhances overall project efficiency and effectiveness.

Advice 6: Start with PoC

Proof-of-concept projects help businesses improve their computer vision solutions. Many unknown factors exist in projects that use AI and computer vision technology. Pilot projects help refine solutions by using real-world feedback and data. This method reduces risks and enhances the system before full deployment.

Advice 7: Choose an IT outsourcing model

If you’re on a tight budget, consider remote or outsourced development as an option to lower your computer vision costs. Reach out to experts in Eastern Europe, who possess a high level of education and technical experience, with rates ranging from $100 to $150 per hour, compared to $300 in the USA.

Conclusion on Computer Vision Cost Estimation

Process of estimating the computer vision project cost

Measuring the return on investment (ROI) for computer vision projects can be tough.

In terms of immediate benefits, you can expect lower operational costs from automation, improved accuracy in quality control, and faster detection of defects or errors. Regarding the longer-term benefits, computer vision technology may lead to enhanced customer satisfaction, a stronger brand reputation, and access to new revenue streams with improved capabilities.

Thus, starting a computer vision project can be challenging, but it can also transform your business for the better. Success needs more than just technical skills. You also need clear goals, good data, and a solid plan from the start.

To sum things up, the key ideas derived from this extensive evaluation of computer vision pricing are as follows:

  1. Understanding key cost drivers, including project complexity, data collection, and integration needs, is essential for effective budget planning.
  2. The main cost drivers of a computer vision project include the complexity, industry-specific requirements, data acquisition, annotation, training, integrations, software and hardware specifications, and potentially other unknown factors. 
  3. Step-by-step implementation, prioritization of essential features, and leveraging open-source tools are proven strategies to minimize computer vision costs.
  4. The costs associated with computer vision vary significantly across different industries, depending on the specific application requirements.
  5. Real-world examples demonstrate the practical benefits and cost savings of computer vision technology. 
  6. Starting a project with proof of concept is a wise strategy to ensure feasibility and project effectiveness.

Why Choose IT-Jim for AI & Computer Vision Development?

Partnering with IT-Jim for AI and computer vision development offers several competitive advantages, namely:

  • Multidisciplinary team with 10+ Ph.D. holders across multiple scientific domains (Physics, Mathematics, Biophysics).
  • R&D company with a portfolio of 100+ successful projects in computer vision, image and signal processing, machine and deep learning.
  • Offers intellectual processing of visual information for advanced tech applications.
  • Delivery of tailored, cost-effective solutions that align with your business needs.

According to Clutch and our client’s feedback, “It-Jim provides competitive pricing and good value for cost, as highlighted by clients who appreciated their budget fit and quality deliverables. Project investments ranged from $10,000 to $100,000, with a strong emphasis on cost efficiency and effective resource management.”

To finalize, by leveraging the expertise of It-Jim, businesses can optimize their costs for computer vision projects and achieve their desired outcomes.

When you’re ready to move forward, we can help bring your vision to life. We ensure your computer vision projects deliver maximum value and long-term ROI.

Extended Reality Project: Code Samples & Demos

Extended Reality – XR: A Gateway to Spatial Interfaces

The Augmented Reality (AR), Mixed Reality (MR), and Virtual Reality (XR) markets continue to evolve and grow rapidly. Once the stuff of science fiction, it is now part of the future reality.

Precedence Research predicts rapid growth in the AR and VR market over the next decade, as illustrated in the graph below.

Graph representing size of AR&VR market 2025-2034

Users are increasingly interested in portable XR devices, driven by the emergence of spatial devices such as Apple Vision Pro and Meta Glasses. These platforms have normalized gesture-based interaction, especially the pinch gesture, as a natural way to control virtual content.

Images of Apple Vision Pro and Meta Glasses

At It-Jim, we’re inspired by the vision of a world where physical and virtual environments are seamlessly blended, and interaction with them is intuitive and unified.

In this small project, we aimed to determine if XR-type gestures can be achieved on a regular iPhone before XR glasses become widely adopted.

A key stage of the project involved technical research, where the task was to evaluate the feasibility of implementing such a system using only built-in iOS tools.

While third-party solutions were considered, deeper analysis revealed that all the necessary mechanisms are already available natively through Apple’s Vision Framework, ARKit, and RealityKit.

We already have experience with and existing solutions that utilize Hand Pose Detection, including the demo featured in this article. input in admin panel/ example to use:

Example of hand pose detection using iPhone camera

Let’s define the key aspects of our task: tracking stability, recognition accuracy, minimal latency during video stream processing, and the ability to integrate this data into the AR scene.

Chosen Approach for Extended Reality Project Implementation

Based on the results of our research, we formulated the hypothesis that building a gesture-first AR application is entirely feasible, even without the use of large-scale ML models or external SDKs.

Instead of complex or multilayered solutions, it is sufficient to correctly combine the Vision Framework as a source of hand motion data with ARKit as the tool for rendering and handling the spatial scene.

This combination forms the foundation of the application. We defined the working scheme of future services and their communication flow.

Responsibilities are divided into dedicated services:

  • AR-related operations service: handles ARKit operations, manages the 3D scene, and provides ARFrame output at a defined FPS.
  • Hand tracking within provided frames: processes ARFrame data to analyze finger positions and send back control signals.

For gesture control, we focused on gestures that are both easy to track and naturally understood by users.

Communication Flow in the Extended Reality Project

During the initialization phase, the following sequence takes place.

First, a session is created (once), and the corresponding ARView is obtained to render the scene to the user.

Diagram showing the initial AR session setup

A continuous processing loop begins, where the AR Manager sends AR Frame data to the MixedRealityManager.

From there, it is forwarded to the Hand Tracking Manager for analysis. The results are then returned to the MixedRealityManager, which determines the appropriate changes to apply within the AR session. 

Diagram showcasing AR session with MixedRealityManager

With the idea, hypothesis, and signal flow defined, we are ready to begin building the actual implementation.

The Art of Control AR Scene

To manage the AR scene within the application, we defined a dedicated service that implements the ARManager protocol. Its primary purpose is to provide the MixedRealityManager with abstract access to the required ARKit capabilities without coupling it to the framework’s internal details or any additional nested logic.


protocol ARManager {
    // MARK: - Publishers
    var eventPublisher: AnyPublisher<ARManagerEvents, Never> { get }
    
    // MARK: - Session Control
    func setupSession() -> ARView
    func startSession()
    func resetSession()
    func pauseSession()
    
    // MARK: - Scene Control
    func toggleSceneMeshPreview()

    // MARK: - Gestures Control
    func addPrimitiveObject(type: GeometricPrimitiveType)
    func moveObjectByPinch(screenPoint: CGPoint)
    func resetPinchGesture()
}

enum ARManagerEvents {
    case newARFrameForTracking(ARFrame)
}

One of the key communication channels is a Combine publisher that emits the newARFrameForTracking event. This allows ARManager to transmit each new ARFrame to other modules, primarily the MixedRealityManager, for further analysis by the HandTrackingManager.

Extended Reality Project: Session Setup

The ARConfiguration and ARView must be properly configured to ensure that objects remain anchored in the scene, physics simulation works correctly, and Person Segmentation is explicitly enabled. This allows for accurate depth layering of virtual content relative to real-world people, both visually and during interaction.


func setupSession() -> ARView {
    // Setup ARView with options
    arView.session.delegate = self
    arView.environment.sceneUnderstanding.options = []
    arView.environment.sceneUnderstanding.options.insert(.occlusion)
    arView.environment.sceneUnderstanding.options.insert(.physics)
    arView.debugOptions.insert(.showSceneUnderstanding)
    arView.renderOptions = [.disableDepthOfField, .disableMotionBlur]
    arView.automaticallyConfigureSession = false
    
    // Setup ARConfiguration with options
    configuration.environmentTexturing = .automatic
    configuration.sceneReconstruction = .meshWithClassification
    configuration.frameSemantics.insert(.personSegmentationWithDepth)
    configuration.planeDetection = [.horizontal, .horizontal]
    
    return arView
}

Person Segmentation Feature

Below is the visual difference when using personSegmentationWithDepth.

By analyzing each ARFrame from the ARSession, the system automatically utilizes depth data and the associated depth map to determine the relative position of the user’s limbs within the scene.

As a result, the user is not visually occluded by overlapping scene objects, allowing for clearer orientation and smoother interaction with virtual elements.

App screens with and without person segmentation

Scene Understanding Feature

By enabling scene understanding through showSceneUnderstanding for the user preview and using sceneReconstruction as part of the session configuration, we provide additional environmental data and gain the ability to treat real-world surfaces as physical elements.

This allows 3D objects to interact with the physical environment when physics is enabled, deepening the overall experience. There is no need for rigid constraints or artificial boundaries; the real-world floor or tabletop becomes a natural constraint.

For the user, scene understanding is visually represented as a polygonal mesh with color gradients that reflect the depth map relative to the device.

App screen without and added scene understanding

Each camera frame is received through the ARKit session via the didUpdate method.

In real-world conditions, processing all 60 FPS provided by the ARSession on devices like the iPhone 14 Pro is highly demanding and places a significant load on the CPU.

Therefore, we limit the target frame rate to 30 FPS to maintain performance and reduce system strain.


func session(
    _ session: ARSession,
    didUpdate frame: ARFrame
) {
    // Get the current timestamp
    let currentTime = Date()
    // Calculate interval between frames based on the desired FPS
    let fpsTime: Double = 1 / handTrackFps
    
    // Send ARFrame for hand tracking if enough time has passed
    if currentTime.timeIntervalSince(lastObservationTime) > fpsTime {
        lastObservationTime = currentTime
        self.eventSubject.send(.newARFrameForTracking(frame))
    }
}

Gestures Control Feature

The flowchart below illustrates the whole logic of gesture-based interaction in our AR application. Starting from each incoming ARFrame, the system detects hands, analyzes finger positions, and identifies gestures.

Based on the recognized gesture, either “Pinch” or “Index Up”, it either creates a new object or initiates the movement of an existing one.

Flowchart showing the entire logic of gesture-based interaction in an AR application

Hand Tracking Using Vision

For gesture-driven AR control, the key component is the hand detector provided via VNDetectHumanHandPoseRequest. This request can identify up to two hands in the frame and returns landmark points for each, including the position of every finger joint.

This enables the development of real-time finger tracking without the need for external sensors or depth hardware. Vision automatically normalizes the coordinates, allowing seamless use within UIView or ARKit environments.

To implement this functionality, we define our service using the HandTrackingManager protocol. This service handles incoming ARFrames from the ARSession, generates corresponding gesture events, and provides a UIView overlay for visualization.


protocol HandTrackingManager {
    // MARK: - Publisher
    var eventPublisher: AnyPublisher<HandTrackingManagerEvents, Never> { get }
    
    // MARK: - Funcs
    func getHandOverlayView() -> UIView
    func processHands(_ frame: ARFrame)
}

enum HandTrackingManagerEvents {
    case indexFingerGestureActive
    case pinchGestureActive(onScreenPoint: CGPoint)
    case pinchGestureInactive
}

Its main entry point is the processHands function, which takes an ARFrame as input. Each time processHands is called, the frame is processed and passed through a VNDetectHumanHandPoseRequest. The results of this request are then handled by the processObservation() method.


func processHands(_ frame: ARFrame) {
    // Extract pixel buffer from the AR frame
    let pixelBuffer = frame.capturedImage
    // Create Vision request handler with set orientation
    // ARKit provides camera feed in .right orientation
    let imageRequestHandler = VNImageRequestHandler(
        cvPixelBuffer: pixelBuffer,
        orientation: .right,
        options: [:]
    )

    // Perform the hand pose detection request
    try? imageRequestHandler.perform([handPoseRequest])

    // Check if at least one hand was detected
    guard
        let results = handPoseRequest.results,
        let observation = results.first
    else {
        return
    }

    // Process the detected hand observation
    processObservation(observation)
}

The processObservation() call follows a straightforward structure. After retrieving the keypoints for the hands, it triggers the visualization overlay and checks for recognized gestures.

Since we selected simple and intuitive gestures, such as a “pinch” (similar to VisionOS) and an “index finger up” gesture; it’s enough to check for these in a prioritized sequence.


func processObservation(_ observation: VNHumanHandPoseObservation) {
    // Try to extract all landmarks from detected observation
    guard
       let recognizedPoints = try? observation.recognizedPoints(.all)
    else {
       return
    }

    handVisualize(points: recognizedPoints)
        
    // Check for pinch gesture and emit corresponding event if detected
    if checkPinchGesture(recognizedPoints: recognizedPoints) {
        return
    }
    // Check for index finger pointing gesture and emit event if detected
    else if indexFingerGesture(recognizedPoints: recognizedPoints) {
        return
    }
}

It’s worth noting that the visualization layer supports multiple display modes, which we will use throughout the application. For preview purposes, we include three options: All Hand, Thumb + Index Fingers, and Turn Off (disable overview).

App screens with gesture recognition

Gesture recognition is based on analyzing the key joint points of the hand. For the “pinch” gesture, we specifically check the distance between the tips of the index finger and the thumb.

Since these coordinates are provided in a 2D screen coordinate system, we must define a trigger threshold, meaning the distance at which the gesture is considered active.


func checkPinchGesture(
   recognizedPoints: [VNHumanHandPoseObservation.JointName: VNRecognizedPoint]
) -> Bool {
    // Try to get positions and confident
    // of the thumb and index finger tips
    guard
        let thumbPoint = recognizedPoints[.thumbTip],
        let indexPoint = recognizedPoints[.indexTip],
        // Check confidences
    else {
        // If any point is missing or not confident enough, 
        // consider pinch inactive
        self.eventSubject.send(.pinchGestureInactive)
        return false
    }
        
    // Calculate distance between thumb and index finger tips
    let dx = thumbPoint.location.x - indexPoint.location.x
    let dy = thumbPoint.location.y - indexPoint.location.y
    let distance = sqrt(dx * dx + dy * dy)
        
    // If the distance is small enough, consider it a pinch gesture
    if distance < expectedDistance {
        // Convert the thumb tip point to screen coordinates
        let screenPoint = convertToScreenSpace(indexPoint.location)
        // Notify system that pinch gesture is active
        self.eventSubject.send(
            .pinchGestureActive(onScreenPoint: screenPoint)
        )
        return true
    } else {
        // Otherwise, treat it as inactive
        self.eventSubject.send(.pinchGestureInactive)
        return false
    }
}

The processing for indexFingerGesture() is even simpler. It only requires checking the alignment of three consecutive joint points along the index finger to determine if the finger is extended and pointing.


func indexFingerGesture(
    recognizedPoints: [VNHumanHandPoseObservation.JointName: VNRecognizedPoint]
) -> Bool {
    // Try to get the required index finger joints
    guard
        let indexTip = recognizedPoints[.indexTip],
        let indexDIP = recognizedPoints[.indexDIP],
        let indexPIP = recognizedPoints[.indexPIP],
        // Check confidences
    else {
        // If any point is missing or not confident enough, 
        // gesture is not valid
        return false
    }
    
    // Collect horizontal x-values of index finger joints
    let xValues: [CGFloat] = [  get index’s X locations ]
    
    // Ensure we can compute the spread of x-values
    guard let maxX = xValues.max(), let minX = xValues.min() else {
        return false
    }
    
    // If finger is mostly vertically aligned (x spread is small),
    // it's considered an active index finger gesture
    if maxX - minX <= expectedRange {
        self.eventSubject.send(.indexFingerGestureActive)
        return true
    } else {
        return false
    }
}

This solution fully isolates the gesture recognition logic from the rest of the application. The ARManager simply provides frames, while the HandTrackingManager is responsible for analyzing them and making decisions based on finger tracking.

Extended Reality: Combination of Elements

It’s time to bring together the services we’ve built into a complete solution that enables real-time gesture-based interaction with the AR scene.

Let’s recall the signal flow diagram shown below.

At the center is the MixedRealityManager, which acts as the coordinating layer. Using Combine, we can subscribe to event updates from our services and organize the desired sequence of operations accordingly.

Diagram showcasing AR session with MixedRealityManager

Step 1: Obtain the ARFrame

The first step is obtaining the ARFrame. ARKit automatically generates ARFrame objects during each session update.

The ARManager intercepts these frames via the session(_:didUpdate:) delegate method described earlier and sends them through a Combine stream to the MixedRealityManager at a defined FPS. These frames serve as the foundation for gesture detection.

Step 2: Pass the Frame

The second step is to pass the frame to the HandTrackingManager. The MixedRealityManager calls the processHands() method and provides the latest ARFrame.

Step 3: Recognize the Gesture

The third step is gesture recognition. If the HandTrackingManager detects one of the expected gestures (such as pinch or index finger), it publishes an event through the Combine stream.

The MixedRealityManager, which is subscribed to these events, executes the corresponding logic, such as adding a new object or activating sandbox movement.

Final Step: Issue the Command

The final step is issuing a command to change the AR scene. Depending on the recognized gesture, the MixedRealityManager triggers the appropriate function to interact with the scene.

A key aspect of the implementation is that all scene control events are routed back to the ARManager. For instance, during a pinch gesture, the screen coordinates are converted into 3D space, and the object’s position is updated accordingly.

The MixedRealityManager does not contain any direct logic for modifying the scene, as that responsibility lies entirely with the ARManager. This separation of layers makes it easy to adjust behavior, introduce new gesture types, or update the UI without affecting the low-level logic.

Extended Reality Project Results: Demos

Below are the final demos showcasing the selected gestures and deeper scene interaction, where the entire surrounding environment becomes part of the AR scene, enabling physical interaction with virtual objects.

 

It’s worth highlighting the accuracy of the visualization provided to the user through ARView. In addition to generating a polygonal mesh at the start of the ARSession, the mesh is dynamically updated as the device or real-world objects move. This enables the system to avoid phantom boundaries, resulting in a smoother and more immersive experience.

 

 

This is a truly exceptional and unique experience today. The ideas demonstrated above represent not just a step, but a new direction in the evolution of user experience. With modern processing power, high-quality camera sensors, lightweight models, and rapidly advancing tools, we can now create experiences that were once considered science fiction.

Final Word on the Extended Reality Project

Gesture-based interaction in AR is no longer just a technical challenge; it is a real step beyond traditional UX thinking.

This project successfully combined the power of computer vision, via the Vision Framework, with the spatial capabilities of ARKit to create a path toward XR experiences that are free of physical interfaces.

From a business perspective, such solutions introduce new interface models for AR apps, especially in the emerging market of wearable consumer tech. This isn’t innovation for its own sake; it’s a new entry point into digital interaction, a bridge to expanded toolsets and digital learning.

The scalability of these approaches extends beyond B2C, offering tremendous potential in the B2B sector.

  • In manufacturing, a mechanic assembling a vehicle can visualize a 3D part model directly in their workspace.
  • In healthcare, surgeons can navigate pre-op environments without physical contact.
  • In logistics, workers can manage alerts, cargo, and automation without being tethered to a console. The full potential is yet to be uncovered.

Technically, this project delivers a working foundation that is modular, scalable, and reliable. With clearly separated services, multithreaded execution, and architecture built for extension, it enables both experimentation and product development.

Our prototype is not just a showcase. It’s a foundation. A step, not sideways, but forward on a path that leads into a new market and a new interface paradigm. We’re already here, and we’re moving ahead.

What kind of experience are you expecting from personal MR devices?

Your idea could be the next stage in this journey.

3D Reconstruction on iOS: Ultimate Guide with Code Samples

Ultimate Tutorial to 3D Reconstruction on iOS: Key Techniques, Differences, & Workflow

In the fast-changing world of mobile technology, 3D model reconstruction on handheld devices is a big leap forward. Algorithms like SLAM, Voxel-Based Reconstruction, and Point Cloud Reconstruction help create 3D models from captured images.

Traditionally, these computationally intensive processes required desktop computers. In contrast, mobile devices are limited to data collection due to their constrained CPU/GPU, memory, and storage.

Apple’s ObjectCapture has changed the game. It enables high-quality 3D model creation directly on mobile devices.

Introduced at WWDC21 for macOS, the technology was initially used for data collection on iOS devices. From WWDC23 and iOS 17, ObjectCapture now supports full 3D reconstruction on iPhones and iPads.

This comprehensive guide on building 3D reconstruction solutions will cover:

  • ObjectCapture’s features for capturing objects and creating 3D reconstructions
  • Different output data structures of ObjectCapture. 
  • Limitations encountered during ObjectCapture integration. 
  • Real-world use cases of ObjectCapture for 3D reconstruction on iOS.
  • Alternative data capture methods: RoomPlan, AVCaptureSession, Photogrammetry.
  • Code samples for each data capture method.
  • A detailed comparison of data-capturing methods for the best results.

Let’s start by examining the workflow and specifics of ObjectCapture.

Overview of ObjectCapture Workflow for 3D Reconstruction

It is essential to understand the general workflow for creating a 3D object directly on an iPhone or iPad using the ObjectCapture API.

The entire process can be divided into two main stages:

  1. Capturing the input data
  2. Reconstructing the object with the captured data

In the first stage, you use your device’s camera to take many photos of the object from various angles. The quality and coverage of these images directly impact the accuracy of the final 3D reconstruction model.

Scheme of 3D reconstruction flow

During the second stage, the ObjectCapture API processes the captured images. The API checks the photos and combines them to make a detailed 3D visualization of the object, including precise texture, color, and shape.


Interested to learn about 3D reconstruction on iOS and other innovative technologies?

Reach out to our team for an individual consultation. Learn how to utilize 3D computer vision services and machine learning capabilities in your business or next big project. With a deep understanding of technologies and 10+ years of experience, we ensure you achieve the most value and results.

Contact us


3D Reconstruction on iOS with ObjectCapture API

ObjectCapture API is a tool for high-quality data capture. 

Data capture is the essential step in 3D object reconstruction. This process is not just about snapping a few photos. It is about preparing the foundation for a precise and detailed 3D model.

The accuracy and quality of the final 3D reconstruction on iOS are directly tied to how well the images are taken. Recording every angle and detail of the object ensures the most accurate and realistic result.  

The capturing process splits into “scan passes.” These substages create images of the object from different angles and collect extra data. ObjectCapture’s UI shows areas where more images are needed and gives tips to improve shot quality. 

ObjectCapture has an easy-to-use interface. It helps users collect data and offers visual cues to navigate around the object. It captures frames, records camera poses, and creates depth maps automatically. This makes the 3D reconstruction process easy and accessible for users.

Thus, this functionality integrates into the app using:

  1. ObjectCaptureView provides a user interface that guides the user through the capturing flow.
  2. ObjectCaptureSession performs data capturing and prepares the data source for further reconstruction. 
  3. Behind the scenes, ObjectCaptureSession relies on ARSession.

High-level view of the 3D object capture process

1. Object Capture View

ObjectCaptureView is a high-level SwiftUI view that encapsulates the entire image capture experience. It provides built-in guidance, visual instructions, and progress tracking as the user walks around an object or environment.

Although ObjectCaptureView is a SwiftUI view, Apple has made it easy to integrate this interface into UIKit-based apps using UIHostingController

This aspect is beneficial for projects that still rely on UIKit but want to take advantage of the latest AR and 3D technologies provided by SwiftUI.

Here’s a simple code example of how to embed ObjectCaptureView into a UIViewController:


struct CaptureView: View {
    // MARK: - Properties
    private let session: ObjectCaptureSession
    
    // MARK: - Init
    init(session: ObjectCaptureSession) {
        self.session = session
    }
    
    // MARK: - Body
    var body: some View {
        ZStack {
            ObjectCaptureView(
                session: session,
                cameraFeedOverlay: {
                    CameraFeedOverlayView()
                }
            )
        }
    }
}

Then, create a UIHostingController with CaptureView as the rootView and add UIHostingController’s view as a subview to your view. 


let hostingController = await UIHostingController(rootView: CaptureView(session: session))

view.addSubview(hostingController.view)

2. Object Capture Session

The ObjectCaptureSession class manages the image capture workflow, dividing the process into structured stages to ensure optimal data collection.

Users are guided through each stage via the ObjectCaptureView, which overlays helpful instructions and feedback directly onto the camera interface.

Let’s take a look at an example of how to utilize ObjectCaptureSession to capture images for 3D reconstruction on iOS.

For this purpose, we created an ObjectCaptureService, which is responsible for managing all stages of data capture using ObjectCaptureSession:


protocol ObjectCaptureService {
    // MARK: - Publishers
    var eventPublisher: AnyPublisher&amp;amp;lt;ObjectCaptureServiceEvent, Never&amp;amp;gt; { get }
    
    // MARK: - Functions
    func setImagesFolder(folder: URL)
    
    func getPointCloudView()
    
    func isFlippableObject(completion: @escaping (Bool) -&amp;amp;gt; ())
    func start()
    func finish()
    func pause()
    func resume()
    func cancel()
    func startDetecting()
    func resetDetecting()
    func startCapturing()
    func beginNewScanPass()
    func beginNewScanPassAfterFlip()
}

First, we need to initialize the ObjectCaptureSession. 

To start the session, we provide a path to the folder where the captured images will be saved.


func start() {
        Task { [weak self] in
            if self?.session != nil {
                self?.resetSession()
            }
            self?.session = await ObjectCaptureSession()
            
            guard
                let session = self?.session,
                let imagesFolderUrl = self?.imagesFolderUrl
            else {
                self?.eventSubject.send(.failed(errorMessage: "Unable to create session"))
                return
            }
            
            var configuration = ObjectCaptureSession.Configuration()
            configuration.isOverCaptureEnabled = true
            
            await session.start(
                imagesDirectory: imagesFolderUrl,
                configuration: configuration
            )
            
            self?.setupBindings()
             
            await self?.eventSubject.send(.captureView(view: .init(rootView: .init(session: session))))
        }
    }

We must set up bindings to receive updates on camera tracking, session state, and completed scan passes. This action helps us manage the session and give users the proper instructions. 


func setupBindings() {
        tasks.append(
            Task { [weak self] in
                guard let session = self?.session else {
                    return
                }
                for await cameraTracking in await session.cameraTrackingUpdates {
                    self?.cameraTrackingState = cameraTracking
                }
            }
        )
        
        tasks.append(
            Task { [weak self] in
                guard let session = self?.session else {
                    return
                }
                for await sessionState in await session.stateUpdates {
                    self?.sessionState = sessionState
                }
            }
        )
        
        tasks.append(
            Task { [weak self] in
                guard let session = self?.session else {
                    return
                }
                for await scanPassUpdate in await session.userCompletedScanPassUpdates {
                    self?.eventSubject.send(.scanPassCompleted(success: scanPassUpdate))
                }
            }
        )
    }

3. Object Mode

In Object mode, ObjectCapture focuses on 3D scanning distinct items placed on a surface.

This mode is perfect for digitizing individual products or artifacts. The bounding box becomes particularly important here, helping to estimate the object’s real-world dimensions and ensuring accurate scaling.

Object mode is most effective when the target item is well-lit, visually distinct from the background, and positioned so the user can easily walk around it. The mode supports single-side or multi-side captures based on the object’s orientation and complexity.

After selecting an object to capture, it is necessary to define its bounding box. ObjectCaptureView allows users to adjust its position and size to ensure sufficient coverage easily. This stage is critical to ensure that the size of the produced model will be close to the real-life one. Also, it helps with further user guidance through flow capturing.

Therefore, capturing the object in 3D involves three steps:

  • Selecting a target object.
  • Defining a bounding box.
  • Capturing an object.

Process of 3D object capturing on iOS

 

These steps correspond to the methods of ObjectCaptureSession:

Steps of an object capture session

In Object mode, ObjectCapture indicates if an object is flippable, prompting users to rotate the object and recapture it. Although this process requires redefining the bounding box, it ensures that the 3D reconstruction fully captures all sides of the object. 

4. Area Mode

Area mode expands ObjectCaptureView’s scanning capabilities beyond single objects, enabling users to capture large physical spaces such as rooms, hallways, large installations, and entire environments​.

This mode is helpful for applications that require a spatial understanding of surroundings, such as interior design, architecture, construction, and real estate.

In this mode, the user is guided to move around a space, capturing overlapping images from different angles and heights.

Unlike Object mode, where the subject is central and isolated, Area mode requires broader spatial scanning and more extensive user movement.

Area mode in iOS 3D scanning

 

In Area mode, there is no need to define a bounding box, which simplifies the capturing process into two steps:

Steps in Area mode

3D Reconstruction on iOS: Pros & Cons of  Object Capture API

While the ObjectCapture API simplifies image capturing and provides a user-friendly experience, there are also some limitations to be aware of when integrating it into apps.

The advantages of using 3D reconstruction on iOS include:

1. Integrated visual guidance

Provides real-time visual cues that help users properly scan an object or scene. It highlights areas that require more image coverage and offers feedback on image quality and coverage. 

2. Flippable object support 

The API detects whether an object should be flipped to capture unseen areas. This feature leads to more complete reconstructions, especially for complex shapes.

3. Automatic frame capturing

Frames are captured automatically when optimal angles and stability are detected. This functionality reduces motion blur and ensures even spacing, simplifying the workflow and improving output quality.

4. Platform-optimized and energy-efficient

Object Capture is aware of system resources, dynamically adjusting capture behavior to maintain efficiency on iPhone and iPad.

The disadvantages of creating 3D reconstruction solutions with ObjectCapture are as follows:

1. Fixed image format

All captured images are stored in HEIC format. While efficient, this may not be a good match if you need other image formats in specific cases.

2. Limited customization of capture flow

Developers can not modify camera behavior, such as frame capture rate, focus, or exposure. 

3. No real-time frame access

Captured frames are not accessible in real-time, which restricts the ability to run custom processing (such as machine learning or computer vision tasks). There are still options to access frames during capturing, but ObjectCapture does not provide an API for this function.

4. Non-customizable capture UI

The default ObjectCaptureView has a fixed appearance and user interaction flow. Developers can not modify styling, which can be limiting for apps that require a customized or branded UI.

Data Capturing with RoomPlan

Apple RoomPlan API is a robust framework that helps developers capture and map indoor environments accurately.

It leverages the power of iPhone sensors, such as LiDAR technology, to create 3D model reconstruction of room layouts, including structures such as walls, furniture, and doors.

The framework provides RoomCaptureSession, which allows developers to capture an entire room or environment seamlessly. This technology is ideal when the goal is to map a whole indoor space and understand the relationship between different objects within that space rather than focusing on a specific object.

RoomCaptureSession extends ARSession by adding the capability to scan and map entire indoor environments, reconstructing the layout of a room along with its structures, such as walls, furniture, and doors.

This scan produces a 3D reconstruction that captures the space’s general structure and geometry. You can utilize PhotogrammetrySession to achieve a more detailed reconstruction with fine textures, subtle color variations, and intricate details.

Using this approach, we can capture frames with ARSession and process them with PhotogrammetrySession while obtaining the data that RoomCaptureSession captured.

Room Plan Benefits of 3D Reconstruction on iOS

By combining these datasets, developers can significantly enrich their models. This combined approach allows the following advantages of 3D reconstruction:

1. Incorporating texture and color

RoomCaptureSession provides structural data of a room, while PhotogrammetrySession can capture detailed textures and colors. This approach makes the environment feel more lifelike and visually appealing. This can be particularly useful for interior design apps, architectural 3D visualizations, and furniture previews.

2. Reconstructing entire rooms

This approach creates immersive AR experiences where users can interact with the entire environment rather than just isolated objects.

RoomPlan: Code Implementation and Key Considerations

To integrate RoomCaptureSession for capturing objects in 3D, we created a separate service called RoomCaptureService.


protocol RoomCaptureService {
    // MARK: - Publishers
    var eventPublisher: AnyPublisher&amp;amp;lt;RoomCaptureServiceEvent, Never&amp;amp;gt; { get }
    
    // MARK: - Functions
    func setup(configuration: RoomCaptureConfiguration)
    
    func start()
    func pause()
    func stop()
}

We need to conform our service to two delegates:

  • RoomCaptureSessionDelegate to receive RoomCapturedData. 
  • ARSessionDelegate to handle individual frames.

   // MARK: - RoomCaptureSessionDelegate
extension RoomCaptureServiceImpl: RoomCaptureSessionDelegate {
    func captureSession(
        _ session: RoomCaptureSession,
        didEndWith data: CapturedRoomData,
        error: (any Error)?
    ) {
        guard error == nil else {
            DispatchQueue.main.async { [weak self] in
                self?.prepareCaptureView(reset: true)
                self?.roomCaptureView?.captureSession.run(configuration: .init())
            }
            return
        }
        captureFrames = false
        eventSubject.send(.didEnd(data: data))
    }
    
    func captureSession(
        _ session: RoomCaptureSession,
        didUpdate room: CapturedRoom
    ) {
        if !captureFrames {
            captureFrames = true
        }
    }
}
// MARK: - ARSessionDelegate
extension RoomCaptureServiceImpl: ARSessionDelegate {
    func session(
        _ session: ARSession,
        didUpdate frame: ARFrame
    ) {
        guard isValid(frame: frame), captureFrames else {
            return
        }
        updateFrame()
    }
}

To avoid capturing redundant frames, we implemented frame validation logic. There are several options to do that. 

One approach is to compare the camera transform’s position and angle of the current frame with the previous one. 

If they are nearly identical, the frame is skipped; if they differ, the frame is saved. This method significantly reduces frame count while preserving key frames.

 

func isValidFrame(currentTransform: simd_float4x4) -&amp;amp;gt; Bool {
        guard let previousTransform else {
            self.previousTransform = currentTransform
            return true
        }
        
        let angle = currentTransform.angle(to: previousTransform)
        let distance = currentTransform.distance(to: previousTransform)
        
        guard
            angle &amp;amp;gt; (RoomCaptureConstants.rotationThreshold / 180) * .pi ||
            distance &amp;amp;gt; RoomCaptureConstants.distanceThreshold
        else {
            return false
        }
        self.previousTransform = currentTransform
        
        return true
   }

Another option is to use the frame timestamp and compare it against a specified FPS (frames per second) to reduce the number of frames captured.


func isValidFrame(currentTimestamp: Double, fps: Int) -&amp;amp;gt; Bool {
        guard let previousTimestamp else {
            self.previousTimestamp = currentTimestamp
            return true
        }
        
        let difference = currentTimestamp - previousTimestamp
        let framesCapturingTimeDelta = 1 / Double(fps)
        if difference &amp;amp;gt;= framesCapturingTimeDelta {
            self.previousTimestamp = currentTimestamp
            return true
        } else {
            return false
        }
    }

Table capturing with RoomCaptureSession

Pros & Cons: 3D Reconstruction on iOS with Room Plan 

Now, let’s examine the advantages and disadvantages of using ARSession when capturing photogrammetry and 3D object reconstruction data.

Advantages of 3D reconstruction with RoomPlan:

  • Provides rich spatial data: meshes, camera positions, and scene structure.
  • Includes RGB and depth data.
  • It is useful when capturing and reconstructing entire rooms and spaces.
  • You can obtain real-time frames for custom processing.

Disadvantages of using RoomPlan for 3D reconstruction:

  • Not optimized for isolated object capture.
  • 3D reconstructions may lack fine details or have sections that appear blurry.
  • Developers must implement custom logic for frame validation and capture flow to ensure helpful photogrammetry input.
  • Developers need to implement user guidance for high-quality results.

Data Capturing with AVCaptureSession

AVCaptureSession is another powerful component of the AVFoundation framework that grants full access to camera input on iOS devices.

This technology allows developers to create highly customizable and versatile capture experiences, providing the ability to capture still images, record videos, and handle metadata with complete control.

With AVCaptureSession, you can fine-tune nearly every aspect of image and video capture, such as image resolution, exposure, white balance, and focus. These features allow developers to adapt the capture process to meet specific requirements. Depending on your needs, AVCaptureSession can be tailored to provide manual or automatic frame capturing.

AVCaptureSession can capture high-quality, still photos from various angles for 3D object reconstruction using PhotogrammetrySession. Unlike ObjectCaptureSession, it allows you to fully customize the user interface (UI) and the capture experience.

Benefits of Using AVCaptureSession for 3D Reconstruction

You can design your UI to match the specific needs of your application, whether it’s guiding users through the scanning process or providing manual controls for advanced users, namely:

  • On-screen overlays and instructions

You can implement custom on-screen overlays that guide users step-by-step through 3D scanning on iOS. This approach can include visual cues, like highlighting the area to focus on, showing the ideal positioning for objects, or displaying a progress bar indicating when enough frames have been captured.

  • Interactive experience

Developers can add interactive elements that allow users to manually adjust camera settings such as focus, exposure, or resolution.

  • Automatic or manual capture modes

Developers can use AVCaptureSession to create different user experiences. If your app lets users move around an object, it can capture frames automatically.

If they need to take photos manually, AVCaptureSession can handle that, too. This functionality allows you to design the flow and capture the best experience.

AVFoundation provides access to supplementary data, such as depth information and camera calibration, along with captured frames. You can use this for advanced processing or specific data capture needs.

AVCaptureSession: Code Implementation and Key Considerations

To implement frame capturing with AVCaptureSession, we created a service called AVSessionManager.


protocol AVSessionManager: AnyObject {
    // MARK: - Publishers
    var eventPublisher: AnyPublisher&amp;amp;lt;AVSessionManagerEvent, Never&amp;amp;gt; { get }
    
    // MARK: - Functions
    func setup(mode: AVCaptureMode) -&amp;amp;gt; AVCaptureVideoPreviewLayer
    
    func start()
    func stop()
    
    func capture()
    func focus(on point: CGPoint)
}

There are several options for capturing frames for object reconstruction using AVCaptureSession.

The first option is to use AVCapturePhotoOutput for manual capture. Here, the user must take the needed number of photos.

AVCapturePhotoOutput provides high-quality images and allows customization of photo settings (e.g., format). It can also capture depth data if available. When you save the photo with fileDataRepresentation, it also automatically saves the metadata and depth data.

When users take photos manually, they might miss some parts of the object. This can lead to not having enough common points for photogrammetry. Additionally, this method can be inconvenient for end-users.

Thus, to capture frames using AVCapturePhotoOutput, we must do a few things:

  • Add photo output to the capture session.
  • Set up the photo settings.
  • Implement the AVCapturePhotoCaptureDelegate protocol to manage the captured photos.

  // MARK: - Photo settings
private extension AVSessionManagerImpl {
    func getPhotoSettings() -&amp;amp;gt; AVCapturePhotoSettings {
        var settings = AVCapturePhotoSettings()
        
        if photoDataOutput.availablePhotoCodecTypes.contains(.hevc) {
            settings = AVCapturePhotoSettings(format: [AVVideoCodecKey: AVVideoCodecType.hevc])
        }
        
        settings.embedsDepthDataInPhoto = true
        settings.photoQualityPrioritization = .quality
        settings.isDepthDataDeliveryEnabled = photoDataOutput.isDepthDataDeliverySupported
        
        return settings
    }
}




// MARK: - AVCapturePhotoCaptureDelegate
extension AVSessionManagerImpl: AVCapturePhotoCaptureDelegate {
    func photoOutput(
        _ output: AVCapturePhotoOutput,
        didFinishProcessingPhoto photo: AVCapturePhoto,
        error: Error?
    ) {
        eventSubject.send(.photo(photo: photo))
    }
}

Save the captured photo using fileDataRepresentation in the image folder. This way, it can be used later in the PhotogrammetrySession.

Capturing a speaker 3D model with AVCaptureSession

 

Another option is to use AVCaptureVideoDataOutput with a specified frame rate.

In this case, frames are captured and saved automatically. The user just needs to move the camera around the object they want to capture.

Yet, additional setup is required to capture depth data along with the RGB frames.

Furthermore, when you save an image from a CMSampleBuffer, the metadata and depth data necessary for 3D reconstruction on iOS aren’t automatically saved with the image. We must handle this explicitly during the saving process.

We need to do a few things to save everything correctly:

  • First, convert the CMSampleBuffer to a CGImage. 
  • Then, extract the metadata from the CMSampleBuffer. 
  • Finally, save the image, metadata, and depth data using CGImageDestination to the designated image folder.

 func saveImageWithMetadata(
        to url: URL,
        cgImage: CGImage,
        metadata: [String: Any],
        depth: AVDepthData?
    ) -&amp;amp;gt; URL? {
        guard
            let destination = CGImageDestinationCreateWithURL(url as CFURL, AVFileType.heic as CFString, 1, nil)
        else {
            return nil
        }
        CGImageDestinationAddImage(destination, cgImage, metadata as CFDictionary)
        if var depthDict = depth?.dictionaryRepresentation(forAuxiliaryDataType: nil) {
            depthDict.removeValue(forKey: kCGImageAuxiliaryDataInfoMetadata)
            CGImageDestinationAddAuxiliaryDataInfo(
                destination,
                kCGImageAuxiliaryDataTypeDisparity,
                depthDict as CFDictionary
            )
        }
        
        if !CGImageDestinationFinalize(destination) {
            return nil
        }
        
        return url
    }


 func process(
        buffer: CMSampleBuffer,
        depth: AVDepthData?,
        index: Int
    ) {
        guard let outputFolder else {
            return
        }
        
        processingQueue.addOperation { [weak self] in
            let frameName = "\(index)_\(OutputProcessingConstants.frameFileName)"
            let url = outputFolder.appendingPathComponent(frameName)
            guard let cgImage = buffer.cgImage else {
                return
            }
            
            let metadata = CMCopyDictionaryOfAttachments(
                allocator: kCFAllocatorDefault,
                target: buffer,
                attachmentMode: kCMAttachmentMode_ShouldPropagate
            ) as? [String: Any] ?? [:]
            guard
                let rgbFile = self?.saveImageWithMetadata(
                    to: url,
                    cgImage: cgImage,
                    metadata: metadata, 
                    depth: depth
                )
            else {
                self?.eventSubject.send(.error(message: "Failed to save image"))
                return
            }
            
            self?.eventSubject.send(.output(url: rgbFile))
        }
    }

 

Scanning an object to create 3D model

Pros & Cons of AVCaptureSession for 3D Reconstruction

When using AVCaptureSession for 3D object reconstruction, it is vital to consider its flexibility and challenges.

Advantages of using AVCaptureSession as a 3D reconstruction solution:

  • Complete control over image capture parameters, allowing for precise customization.
  • Ideal solution for custom capture workflows.
  • Easy to integrate with custom user interfaces or capture guidance overlays.
  • Real-time access to frames for custom processing.
  • Capability to capture and save additional data, such as depth data and camera calibration.
  • It supports saving frames in various image formats, including HEIC, JPEG, etc.

Disadvantages of utilizing AVCaptureSession for 3D reconstruction on iOS:

  • It requires manual implementation and organization of image saving.
  • No automatic mesh generation or object detection.
  • More development effort is needed to create a fully functional scanning experience.
  • It may not produce the highest quality 3D models compared to results obtained with ObjectCaptureSession.

3D Reconstruction on iOS: Comparison Table of Object Capturing Methods 

Let’s make a final comparison to summarize the main distinctions among ObjectCaptureSession, RoomCaptureSession, and AVCaptureSession.

The table below provides a clear overview and helps you determine which 3D reconstruction solution best fits your automation, flexibility, and data richness needs.

Criteria ObjectCaptureSession RoomCaptureSession AVCaptureSession
User Guidance Built-in visual guidance and quality feedback Semantic feedback only (e.g., walls, doors highlighted); no active guidance Fully custom implementation required
Automation Auto frame capture at optimal angles Manual frame capture and logic implementation Manual frame capture and logic implementation
Real-time Frame Access No  Yes Yes
Camera Parameter Control No control No control Full control (focus, exposure, etc.)
UI Customization Limited Limited Fully customizable
Data Richness Only RGB images in HEIC format RGB, depth, mesh, camera transform RGB, depth, calibration data
Supported Image Formats Only HEIC format HEIC, JPEG, and more HEIC, JPEG, and more
Scene Coverage Supports flippable object logic for full coverage Great for full-room reconstruction Requires manual logic to ensure sufficient coverage
Mesh Generation No  Provides a mesh of the room, environment No 
Ideal For Isolated object reconstruction with minimal setup Room-scale scanning and reconstruction Custom capture workflows with high flexibility
Development Effort Minimal, high-level API Moderate, custom logic needed High, everything must be implemented manually
Output Quality High quality for objects, moderate for areas Moderate, can lack fine detail Varies, depends on implementation and captured images

Photogrammetry: 3D Reconstruction Technology 

Photogrammetry is a technique for reconstructing 3D models. It analyzes multiple overlapping 2D images of an object or environment to find key points, measure their relative positions, and rebuild the object’s shape and texture in 3D.

Photogrammetry transforms flat photos into detailed and accurate 3D visualizations by combining geometric algorithms with photometric consistency.

The 3D reconstruction process using photogrammetry involves multiple stages, which are performed to produce a precise model. These stages include:

1. Pre-processing

During this stage, it is crucial to check image quality, make corrections, set the camera’s internal parameters, and handle other preparations.

2. Image alignment

This phase involves adjusting and coordinating multiple images. They need to overlap and match correctly in 3D space.

3. Point cloud generation

This stage creates a 3D representation of an object by collecting and analyzing data from multiple images. It transforms 2D image information into a spatially accurate 3D model.

4. Mesh generation

This step involves converting a point cloud into a detailed 3D surface mesh. The process creates a polygonal model that represents the surface geometry of the scanned object.

5. Texture mapping

The texture mapping stage adds detailed textures, like color and surface details, to a 3D mesh. This method helps create a realistic look for the scanned object.

6. Optimization

This stage refers to refining the parameters of a 3D model. The goal is to reach the best accuracy and quality.

Photogrammetry on iOS: How to Use 3D Reconstruction

You can use the PhotogrammetrySession from the ObjectCapture API for 3D reconstruction on iOS with photogrammetry.

As a result of reconstruction, PhotogrammetrySession can produce different types of output data. This data can then be used in more complex processing pipelines.

Reconstruction allows PhotogrammetrySession to create various output data types. This data can then be used in more complex processing pipelines.

Let’s consider how we can reconstruct 3D models from a series of captured images.

We created a separate ReconstructionService, which is responsible for managing the photogrammetry process:


protocol ReconstructionService {
    // MARK: - Publishers
    var eventPublisher: AnyPublisher&amp;amp;lt;ReconstructionServiceEvent, Never&amp;amp;gt; { get }
    
    // MARK: - Functions
    func setOutputFolder(outputFolder: URL)
    
    func getModelFilePath() -&amp;amp;gt; URL?
    
    func start(configuration: ReconstructionServiceConfiguration)
    func cancel()
}
 

To start a session, we need to specify a configuration and the path to the folder where all captured images are stored.

We also need to identify the output requests we are interested in.


func start(configuration: ReconstructionServiceConfiguration) {
        var sessionConfiguration = PhotogrammetrySession.Configuration()
        sessionConfiguration.featureSensitivity = configuration.featureSensitivity
        sessionConfiguration.sampleOrdering = configuration.sampleOrdering
        sessionConfiguration.isObjectMaskingEnabled = configuration.isObjectMaskingEnabled
        
        guard
            let imagesFolderUrl,
            let modelFilePath,
            let session = try? PhotogrammetrySession(
                input: imagesFolderUrl,
                configuration: sessionConfiguration
            )
        else {
            eventSubject.send(.failed(errorMessage: "Session creation failed"))
            return
        }
        
        photogrammetrySession = session
        
        startObserving(outputs: session.outputs)
        
        do {
            try session.process(requests: [
                .modelFile(url: modelFilePath),
                .pointCloud,
                .poses
            ])
        } catch {
            eventSubject.send(.failed(errorMessage: error.localizedDescription))
        }
    }

Available PhotogrammetrySession.Request types and their corresponding output data include:

  1. modelFile – USDZ file with the reconstructed object.
  2. modelEntityan in-memory 3D object that can be directly used in the app.
  3. bounds – precise bounding box of the object, which was reconstructed.
  4. pointCloud – PointCloud, which was created during the reconstruction flow.
  5. poses – estimated sample pose using the 6DOF (Six Degrees of Freedom) algorithm.

Available PhotogrammetrySession.Configuration options include:

  1. featureSensivity
  2. sampleOrdering
  3. isObjectMaskingEnabled

Configurations of Photogrammetry Session on iPhone

To receive updates and output data, we need to start observing the session’s outputs:

 

func startObserving(outputs: PhotogrammetrySession.Outputs) {
        Task { [weak self] in
            guard let self = self else {
                return
            }
            
            let outputs = UntilProcessingCompleteFilter(input: outputs)
            
            for await output in outputs {
                switch output {
                case .requestError(let request, let error):
                    if case .modelFile = request {
                        self.eventSubject.send(.failed(errorMessage: error.localizedDescription))
                    }
                    
                case .requestComplete(_, let result):
                    switch result {
                    case .pointCloud(let pointCloud):
                        self.savePointCloud(output: pointCloud)
                        
                    case .poses(let poses):
                        self.savePoses(output: poses)
                        
                    default:
                        continue
                    }
                    
                case .processingComplete:
                    self.saveCapturedImagesMetadata()
                    self.eventSubject.send(.completed(output: self.outputFolderUrl))
                    self.photogrammetrySession = nil
                    
                case .processingCancelled:
                    self.photogrammetrySession = nil
                    break
                   
                case .inputComplete:
                    break
                    
                case .requestProgress(let request, let fractionComplete):
                    if case .modelFile = request {
                        self.eventSubject.send(.progress(value: Float(fractionComplete)))
                    }
                    
                case .requestProgressInfo(let request, let progressInfo):
                    if case .modelFile = request {
                        let remainingTime = progressInfo.estimatedRemainingTime
                        self.eventSubject.send(.remainingTime(interval: remainingTime))
                        
                        let processingStage = progressInfo.processingStage?.processingStageString
                        self.eventSubject.send(
                            .processingStage(description: processingStage ?? Strings.Reconstruction.processing)
                        )
                    }
                    
                default:
                    continue
                }
            }
        }
    }

We can retrieve valuable information from the session outputs. This includes the current processing stage and the estimated time left for processing. This data helps improve user experience by providing real-time feedback on the 3D reconstruction progress.

By adding these insights to the user interface, we allow users to stay informed and engaged throughout the reconstruction workflow.

Once the 3D reconstruction process is done, ObjectCapture provides a range of detailed outputs. You can access or export these for further use:

1. 3D Model 

The primary output is a high-quality 3D model of the scanned object or area, exported in USDZ format.

2. Bounds

The precise bounding box of the reconstructed object represents its size and spatial limits in 3D space.

3. Captured Images

All source images used during the photogrammetry session are preserved and can be exported for further processing, analysis, or archiving.

4. Image Metadata

Each captured image contains embedded metadata. This metadata can be extracted and saved separately as a text file.

5. PointCloud 

A point cloud representing the key visual features identified during image alignment can be exported as a plain text file, which is helpful for 3D visualization or custom processing pipelines.

6. Poses 

You can retrieve pose data for each captured image, including translation, rotation, and the extrinsic matrix. This information can be saved to a text file and used in custom processing or workflow analysis.

Exporting a 3D reconstruction of a captured object on an iPhone

 

Comparison of 3D Reconstruction Solutions on iPhone and Mac

While you can now easily perform a full 3D reconstruction entirely on an iPhone, you can also carry out the photogrammetry process on a Mac.

The core photogrammetry APIs work on both platforms. However, differences in performance, output quality, and features can affect results and user experience based on the device.

The iPhone has notable limitations compared to macOS. Specifically, the iPhone version lacks support for:

  • Multiple mesh types.
  • Different detail levels.
  • Custom detail specifications (e.g., maximum polygon count, texture format selection,  output texture maps, etc.).

These advanced features are available on macOS. This makes the Mac version more flexible. It is better for workflows that need fine-tuning and enhanced control over the final output.

The diagram below showcases how the same set of images can lead to different results based on whether the 3D reconstruction is done on an iPhone or a Mac. This analysis helps developers decide where to run the reconstruction workflow in their apps or pipelines.

To evaluate the differences in performance and output quality, we ran a series of tests on two devices: an iPhone 13 Pro Max and a Mac mini M1 (8 GB RAM). The same image sets and reduced detail levels were used for every reconstruction task. We measured how long each device took to complete the 3D reconstruction on iOS.

 

Graph photogrammetry 3D reconstruction time iPhone vs Mac

On average, the Mac performed about 4% faster than the iPhone. However, this average hides the fact that performance differences become especially noticeable when scanning larger areas, such as full rooms or complex interior spaces.

For small or single-object scans, the performance on iPhone and Mac was quite close. In some cases, the iPhone even performed faster.

This makes the Mac especially useful for workflows that involve room-scale reconstruction or larger environments, where processing time can grow significantly.

In these cases, the ability to process 3D reconstructions more quickly can improve productivity and reduce bottlenecks, especially in professional applications or iterative scanning tasks.

Regarding quality, both platforms produce visually and structurally similar results when working with simple objects. However, the Mac’s reconstruction results are often slightly better, especially in scenarios involving:

  • Complex geometry.
  • Fine surface details.
  • Irregular or organic shapes.

This means that the iPhone alone is often sufficient and convenient for quick on-site scanning of simple objects, while the Mac can deliver better results for room-scale or complex object scanning.

3D Reconstruction on iOS: To Sum Things Up

ObjectCapture transforms 3D reconstruction on iPhones and iPads, replacing bulky desktops. 

The ObjectCapture API simplifies the 3D reconstruction on iOS into a guided, user-friendly experience, allowing even beginners to produce high-quality 3D models effortlessly.

Object mode ensures precision for small objects, while Area mode offers spatial scanning of larger areas, architecture, or interiors. Despite fixed image formats and limited real-time frame access, it is ideal for AR solutions, product digitization, and more.

RoomCaptureSession excels in precise spatial mapping of large environments. Also, AVCaptureSession offers fine-tuned camera control for detailed object captures. Both these image acquisition methods require more management and setup but provide greater customization.

Thus, object-capturing tools empower programmers to enhance computer vision development services and 3D scanning apps on iOS across diverse use cases.

The general recommendation is to choose:

  • ObjectCapture for quick, reliable models.
  • RoomCaptureSession for spatial accuracy.
  • AVCaptureSession for detailed reconstructions. 

Ultimately, the choice of the technique depends on the desired outcome, whether it is creating high-quality 3D models of small objects, large environments, or anything in between.

ObjectCapture produces a dense textured mesh. That raw output still needs retopology, UV cleanup, and material refinement before it is usable in a game engine, product renderer, or manufacturing workflow. The same cleanup problems arise with AI-generated geometry. Our article AI 3D Generation: From Prototype to Production covers that post-processing pipeline in detail. Most of the same steps apply here.


Have you seen something inspiring in the article and come up with project ideas?

Let’s build it together and explore opportunities to integrate the latest technologies. Whether you want to improve your company operations or launch a new project, we can cover your business needs with cutting-edge solutions and add measurable outcomes.

Contact the It-Jim team for a consultation.


 

Fiducial Markers Overview: Types, Use Cases, & Comparison Table

Guide to Fiducial Markers: Exploring Types, Applications, and Key Differences

Accurate data tracking and measurement are constant challenges in numerous use cases. Can fiducial markers become a solution? Let’s find out. 

For instance, the medical industry requires colossal accuracy, and even a 1 millimetre deviation can jeopardize a surgery’s outcome.

Misaligned virtual surfaces in AR can disorient users. For instance, some objects may appear closer than their physical counterparts. 

Fiducial markers represent a powerful tool to address these pain points for various applications and computer vision tasks, such as object detection, camera pose estimation, and anything that requires a robust source of image features. 

People often mistakenly think of fiducial markers only as square binary codes, which limits their understanding of their true potential. Fiducials are designed for easy detection in different lighting, angles, and distances, making them reliable tools for real-world settings.

This comprehensive guide delves into the topic and highlights fiducial marker benefits and the following aspects:

  • Types of fiducial markers and their properties. 
  • Applications of fiducial markers across different industries.
  • Comparison of fiducial markers with their strengths and limitations for better decision-making.

Let’s dive right into defining what is a fiducial marker.

What Are Fiducial Markers and Their Benefits

Fiducial markers are created objects like black-and-white grids, checkerboards, or shapes with certain patterns. These markers are set in an environment or scene to help imaging systems find reference points.

The term “fiducial” comes from the Latin – fiducia, meaning trust, reflecting their function as dependable reference points for spatial measurements.

Designed for easy detection by cameras and algorithms, these markers enable precise 3D tracking. Typically, each fiducial marker is part of a system with a detection algorithm and coding. Detecting any marker generally carries information about its location on the image, orientation, and unique ID.

To make things even more straightforward, here is a simple explanation. Since images lose information about the captured scene depth, it is difficult to estimate the dimensions of an existing object properly. 

This issue may be solved by placing an object with well-known dimensions, such as a ruler, in the field of view. In this case, the ruler is a reference point and stands for a fiducial marker.

In computer vision development services, fiducial markers have similar purposes and expand in more ways of estimating camera geometry properties. Cameras can detect and interpret these marked objects to calculate position, orientation, and scale. 


Interested in knowing how to overcome industry challenges with cutting-edge fiducial markers

At It-Jim, you can explore our 10+ years of expertise in building computer vision solutions. We have proven experience designing and integrating various tech systems for existing businesses or new innovative projects. 

Contact us for a consultation


To sum things up, the fiducial marker benefits are as follows:

  • Accuracy: offers reliable reference points for precise positioning, alignment, and tracking. This feature boosts spatial accuracy in complex systems like imaging devices, robots, and AR platforms.
  • Automation: streamlines calibration and alignment. This fiducial marker property enables machines to operate with minimal human help in robotics and automated inspection processes.
  • Repeatability: ensures consistent results in repeated imaging. This use case is vital in medical imaging, 3D scanning, and automated manufacturing.
  • Simplification: makes tasks like object detection, 3D reconstruction, and spatial navigation easier.
  • Real-time tracking: provides instant feedback for applications such as motion capture, drone navigation, and interactive simulations. 
  • Cost-effectiveness: provides affordable, high-value solutions for enhanced functionality and performance.

These fiducial marker benefits make them invaluable in both research and commercial applications. 

Once correctly applied and set up, they help with tracking, localization, camera calibration, and object detection in applications like robotics, augmented reality, and manufacturing.

Types of Fiducial Markers

In typical computer vision, many fiducial marker systems exist. They differ primarily in their appearance and coding systems.

Generally, we can group all markers by their shape: circular, square, and topological. DL-based fiducial markers have also evolved in recent years, leading to another subclass.

Next, we will explore the most common fiducial marker types, their designs, and unique features for 3D computer vision services.

Existing fiducial marker systems

1. Circular Fiducial Markers

According to the studies, most round markers rely on the relative positions of inner circles, such as CCC, Cho, and CCTag. Based on their foundations, developers created more advanced Knyaz and InterSense systems. These novel fiducial markers use more complex coding.

Examples of circular markers

Circular markers are less popular now. This is because they are less accurate and do not help with 2D point localization.

According to the ResearchGate publication, one of the most successful circular markers is RuneTag, which uses a large number of points. This feature boosts its pose estimation and resistance to occlusions. Yet, it does slow down performance.

Thus, circular markers are primarily used in pose estimation tasks. This is because these tasks often deal with occlusion in the scene.

2. Square Fiducial Markers

The most common fiducials are square markers, called binary or checkerboard. Their essence lies in coding information into an internal structure with a binary grid. Another advantage is that they return complete information, including corner positions, pose, and ID.

Examples of square markers

The first square marker ideas were implemented in systems like Matrix, CyberCode, and VisualCode. These days, their work principles are outdated and inefficient. 

Currently, the top markers in this category are ARToolkitPlus, ARTag, AprilTag, and ArUco. ARToolkitPlus is a modern evolution of ARToolkit, which initially introduced the concept of image binarization.

All further systems were gradually improving versions of the previous ones:

  • ARToolkitPlus and ARTag are enhanced versions of ARToolKit.
  • AprilTag and ArUco are enhanced versions of ARTag.

Another interesting example is the ChromaTag, a colorized version of the AprilTag. According to the publication, its significant advantage over similar versions is its fast detection speed while keeping the same level of accuracy. 

On the other hand, this marker is more sensitive to a large angle of view and long distances. Therefore, even the authors of ChromaTag recommend using AprilTag in these use cases.

As a result, AprilTag and ArUco markers are regarded as some of the most reliable and high-performing fiducial markers available. They operate on the same principles but use various algorithms to compute dictionaries. 

ArUco markers are especially popular since OpenCV has included their implementation as a submodule.

3. Topological Markers

This type of fiducial marker has a more complex and diverse structure. D-Touch and ReactVision were the very first examples and are no longer relevant.

A recent piece of research in the field of topological markers was the TopoTag. This fiducial example uses an inner binary structure similar to checkerboard markers.

Three TopoTag marker examples

TopoTag’s authors achieve high robustness and near-perfect detection accuracy. These markers offer more feature correspondences for better pose estimation. Compared to square markers, they are also better at resisting occlusions.

In the evolution of fiducial markers, topological patterns struggled against other types. However, recent studies show they may outshine even the steadfast ArUco.

4. DL-based Markers

The previous markers used traditional computer vision methods for detection. In contrast, the DL-based systems utilize trained models.

Few DL-based systems can match the best marker models yet, this field is still evolving. The recent work is E2ETag, as well as the findings of the DeepFormableTag

E2ETag, appearance (left), complex case detection (right)

Automated processes generate structures and consist of various textures with diverse forms and colors. The E2ETag can tackle tough scenes with poor exposure, motion blur, and noise.

The DeepFormableTag uses RGB info. This model can be detected on convex surfaces, which is tough for non-DL-based fiducials. In contrast, neither system supports pose estimation.

DeepFormableTagappearance (left) complex case detection (right)

Approaches supporting the existing markers mentioned earlier have also been developed. One recent proposal is DeepTag. It is a deep learning-based framework designed for the creation and detection of fiducial markers.

Its authors experimentally proved that DeepTag may detect fiducials more precisely than classical methods, even at complex angles. This framework also pulls more key points from a marker’s internal structure, making pose estimation more accurate.

DeepTag, qualitative detection results

Another enhancement is DeepArUco++, which improves upon classical ArUco markers by integrating convolutional networks for robust detection, corner refinement, and decoding. It particularly excels under adverse lighting conditions where traditional pipelines often fail.

DeepArUco++ framework

A recent innovation is YoloTag, a real-time detection system built on YOLOv8, primarily aimed at UAV navigation. Rather than designing a new marker structure, it treats the fiducial markers as generic objects. These are detected using object detection and localized via a PnP pose estimation algorithm.

This system enables efficient, marker-based localization in large-scale outdoor environments without relying on precise marker geometry.

5. Non-visual Markers

Not all markers are meant to be seen. A growing line of research explores fiducials that operate outside the visible spectrum, quietly supporting perception where cameras may struggle or aesthetics matter.

Scene with iMarkers highlighted and magnified (b)

 

iMarkers, introduced in 2025, are designed to blend in. They are entirely invisible to the human eye, yet detectable by specialized sensors. Invisible fiducial markers offer a discreet way to embed localization cues into everyday spaces, functional in environments like homes or public installations where visual clutter is unwelcome.

L-PR, on the other hand, speaks to machines in 3D. Developed for LiDAR-based systems, it encodes information into geometric patterns that remain effective even when views are sparse or misaligned. When visual cues fall short, it is a practical robotics, mapping, and 3D reconstruction tool.

Key fiducial marker properties:

  • Circular markers, such as CCTag and a more advanced version – RuneTag, excel in precision tasks like camera calibration due to their robustness to perspective distortion. 
  • Square markers, like ArUco and AprilTags, are widely used for their simplicity and effectiveness in AR and robotics, though they may struggle with occlusions.
  • Topological markers, exemplified by TopoTag, offer high robustness and scalability, supporting thousands to millions of unique IDs for complex applications.
  • DL-based markers, like those using DeepTag or DeepArUco++, leverage deep learning for flexible detection. They may provide greater robustness, but demand increased computational resources.
  • Non-visual markers, such as iMarkers and L-PR, operate beyond the visible spectrum through infrared, LiDAR, or other sensing modalities. They enable detection where vision fails or visibility is not an option. 

Want to know how best to apply fiducial markers in your case?

Reach out to our experts for personalized advice on boosting your innovative project or existing business with a full-scale development solution. 

Drop us a line


Comparison of Fiducial Markers

Different applications require different fiducial marker properties. 

The following table provides the pros and cons of fiducial markers, namely square, circular, topological, DL-based, and non-visual:

Market type Circular Square Topological DL-based Non-visual 
Examples CCTag, CCC, Cho Matrix, CyberCode, VisualCode, ARToolkitPlus, ARTag, AprilTag, ChromaTag D-Touch, ReactVision E2ETag, DeepFormableTag, YoloTag iMarkers, L-PR
Top examples RuneTag ArUco TopoTag DeepTag, DeepArUco++ iMarkers
Design  concentric circles or dot patterns square with binary patterns topological patterns (connectivity-based) custom patterns (consisting of various textures, diverse forms, and colors) or  copies of existing non-human-visible markers (e.g., infrared, LiDAR geometry)
Detection  method traditional CV (e.g., ellipse detection) traditional CV (e.g., edge detection) topological and geometrical analysis deep learning (e.g., CNNs or similar trained models)  sensor-specific (e.g., infrared, LiDAR feature matching)
Robustness to  occlusion high (resistant to distortion and blur) moderate (sensitive to partial occlusion) very high (handles partial occlusion well) very high (adapts to occlusions) high (not affected by visible light occlusion)
Speed moderate fast moderate slow variable (depends on sensor type and data processing)
Advantages robust to perspective distortion, ideal for precision simple to implement, widely supported, and efficient near-perfect detection accuracy, scalable (millions of IDs), robust in dynamic settings robust, flexible, 

performs reliably even in challenging conditions such as poor lighting, motion blur, and image noise.

unaffected by lighting, aesthetics preserved, work in darkness or clutter
Limitations limited marker diversity, limited in 2D point localization, requires careful placement in cluttered environments  limited by occlusion, extreme angles and long distances requires specialized algorithms requires training, sometimes high resources (e.g., hardware) requires specialized sensors and hardware
Computational cost moderate low to moderate moderate to high high moderate to high (depends on sensing and decoding method)
Use cases (main applications) pose estimation, calibration, precision tracking, etc. AR, robotics, camera calibration, etc. AR, robot navigation,  biomedical imaging, robot navigation, warehouse automation, etc. advanced applications, medical imaging, research, complex environments, etc. robotics in low-light or cluttered areas, 3D mapping, AR

By analyzing these properties of fiducial markers, developers and engineers can tailor their systems for optimal performance.

You may consider the size, shape, and detectability of fiducials as well as the following factors when selecting the appropriate fiducial marker system:

1. Environment

Answer the question: “Will the fiducial be used indoors, outdoors, or in low-light conditions?”. 

Consider durability, which can be especially important in long-term or harsh environments. Square markers suit controlled settings, while circular and topological markers excel in challenging conditions. DL-based markers offer maximum flexibility for extreme variability.

2. Precision Needs

Circular markers like ChArUco or CCTag are best for sub-pixel accuracy (e.g., calibration). For general tracking, ArUco or AprilTags suffice. Topological and DL-based markers provide robust alternatives for complex scenes.

3. Speed Requirements

Square markers are the fastest for real-time applications, followed by circular and topological markers. DL-based markers are the slowest but are improving with optimization.

4. Scalability

Topological markers are ideal for applications needing millions of unique IDs, while square and circular markers support smaller sets.

5. Computational resources

If you are working with limited processing power (e.g., mobile devices), ArUco is more efficient than AprilTags. Square markers are lightweight; circular markers are moderately demanding; topological markers require specialized algorithms, and DL-based markers need significant computational power.

6. Cost

Square markers benefit from mature libraries (e.g., OpenCV), while topological and DL-based systems may require custom development. Passive markers like ArUco or QR codes are inexpensive, while DL-based markers require investment in hardware.

Fiducial Markers: Applications & Use Cases

Fiducial markers are versatile. They can be used in many industries and research areas. Below are some typical applications of fiducial markers within the computer vision domain:

1. Augmented and Virtual Reality

In our experience, fiducial markers are widely used in augmented reality services. It enables the integration of digital content, such as virtual 3D objects, into real-world environments.

The main goals of AR apps are to analyze live camera feeds and accurately overlay virtual elements into the real scene using tracking data. AR systems can also accurately find real-world position, orientation, and scale.

They can do this by using fiducial markers with specific patterns and sizes. Marker-based AR tracking is a widely adopted method in AR. It offers high precision by using visual references, ensuring a stable, precise alignment between the virtual and real-world visuals.

Use cases: AR gaming, training simulations, interactive museum exhibits, etc.

2. Robotics and Automation 

Fiducial markers play a key role in improving robotic skills. They help with localization, object recognition, and path planning.

High-contrast patterns like ArUco help with navigation. They are camera-detectable and work well where feature detection algorithms may fail. Research highlights how they boost robot autonomy in the industrial, medical, and logistics fields.

Use cases: Warehouse robotics, drone and auto navigation, robotic arms

3. Manufacturing and Quality Control 

Fiducial markers are used to maintain high-quality manufacturing standards. They boost efficiency, reduce errors, and guarantee high-quality results in many fields, especially in electronics manufacturing.

They help improve assembly by guiding where to place components. They also inspect and verify product quality and assist with calibrating machines for accurate measurements.

Use cases: 3D printing calibration, parts and products verification

4. Motion Capture and Animation 

Fiducial markers are common in motion capture (mocap) and animation. They help record human and object movements accurately. This data is used in areas like film production, sports science, and biomechanics.

High-speed cameras detect their positions in 3D space, enabling detailed motion reconstruction.

Use cases: Animation, athletic performance analysis, etc.

To Conclude About Fiducial Markers

The fiducial marker technology is a foundational tool for the interface between physical and digital systems. Fiducial marker systems impress with their variety of shapes, appearances, and detection approaches. 

In this article, we reviewed some popular makers: square, circular, topological, DL-based, and even invisible. These systems cater to different needs, with no single type being universally superior. The choice depends on your specific application. Developers and engineers can select or design the optimal solution tailored to their particular needs through an informed comparison of fiducial markers.

To conclude, fiducial marker technology continues to evolve, integrating advances in materials science, computer vision, and custom AI solutions. These innovations promise even greater benefits in emerging fields like personalized medicine, autonomous vehicles, and immersive computing.


Still unsure about which marker type fits properly for your project?

Let us help you make the best choice during a detailed consultation. Explore our computer vision expertise and get in touch to start integrating CV and fiducial markers into your next project.

Contact the It-Jim team


 

RoomPlan is Awful and it’s Great!

RoomPlan is a powerful framework from Apple designed for the fast and convenient creation of 3D models of rooms, using augmented reality (AR) technologies and LiDAR scanning capabilities. In our previous article, we reviewed the basic functions of RoomPlan, such as session setup, the structure of core components, and the specifics of output data. We explored how this tool can interact with the surrounding space to transform your rooms into a 3D model.

At first glance, RoomPlan is an impeccable tool for modeling rooms and indoor spaces. Its features might seem exhaustive for many tasks: automatic object recognition, real-time 3D model creation, and export capabilities. All this provides broad possibilities for developers, interior designers, and AR enthusiasts seeking a tool for quick and efficient work with room spaces, visualization, and presentation.

RoomPlan Framework by Apple

However, like many modern technologies, RoomPlan has darker sides worth considering. Despite its progressive features, this framework has several limitations and drawbacks that can significantly impact the final result and may require developers to put in extra effort to overcome them. In this article, we will look at the key issues one might encounter when working with RoomPlan and explain why this tool may not be as perfect as it appears.

Today, we’ll attempt to look beyond the mirror of RoomPlan and examine its limitations. This is an important step for everyone planning to use this tool in their projects, as understanding RoomPlan’s shortcomings will help you prepare for potential problems in advance and devise ways to address them.

Approximately correct, almost accurate

Although RoomPlan is positioned as a tool for professional spatial measurement tasks, in practice, its capabilities are limited by several important aspects that affect the final accuracy of the models.

Apple claims:

“RoomPlan outputs in USD or USDZ file formats that include dimensions of each component recognized in the room, such as walls or cabinets, as well as the type of furniture detected. (https://developer.apple.com/augmented-reality/roomplan/)”

In practice, various factors greatly distort the scanning results.

Limited Object Recognition

Although RoomPlan offers automatic object recognition, its capabilities in this area are quite limited. The tool can only identify basic interior elements, such as tables, chairs, sofas, and some household appliances.

Object Category Detection by RoomPlan

However, more complex or less common objects – like air conditioners, boilers, shelves, wall lamps, or decorative elements – remain beyond RoomPlan’s detection capabilities. Consequently, these objects simply do not appear in the model or are replaced with simplified shapes, leading to significant detail loss and affecting the overall spatial accuracy.

Example of Limited Object Recognition by RoomPlan Apple
Example of Limited Object Recognition by RoomPlan Apple

Rectangular Simplifications

A significant limitation of RoomPlan is that the system attempts to reduce all objects and surfaces to a set of rectangles. This approach ensures processing speed but significantly impacts the quality and detail of the 3D model.

For instance, unique architectural elements, such as semicircular arches and sloped or non-flat walls, are simplified into primitive rectangular blocks, which noticeably distorts the model and reduces its actual accuracy.

Additionally, there is an issue with handling height variations, sloped ceilings, moldings, and baseboards, as these elements are almost always ignored when creating the model.

Rectangular Simplifications by RoomPlan

Ceilings and Skylights

RoomPlan does not capture any ceiling data, meaning you won’t be able to include ceilings in your model. This limitation is especially critical for tasks involving lighting design or calculations of room volume, as ceiling data is essential for these applications.

Ceilings and Skylights Recognition by RoomPlan

Furthermore, RoomPlan does not detect skylights, which are often integral to the functionality and aesthetics of attic or loft spaces. This lack of ceiling and skylight recognition further reduces RoomPlan’s applicability for projects requiring comprehensive architectural detail.

Measurement Errors

RoomPlan has accuracy issues when absolute precision, rather than relative precision, is required, resulting in dimensional discrepancies. An error of ±5 cm in a 1-meter wall may seem minor, but it’s important to remember that such errors accumulate. For example, in a space with multiple partitions, a divided bathroom, or a hallway, the deviation in each wall/window/door compounds, leading to a much more pronounced distortion overall.

In the example below, you can see the dimensions of a wall with a window embedded within it.

Measurement Errors by RoomPlan Framework

For the demo space in this article, the length deviation reached more than 37 cm, with the actual length being 6.45 meters, compared to RoomPlan’s measurement of 6.821 meters.

Incorrect Wall Thickness Representation

RoomPlan sometimes fails to calculate the actual thickness of walls, simplifying them to standard partitions (~16 cm), and only in cases where merging is performed can thicknesses be increased to better match the actual geometry.

Additionally, all exterior walls in your space are guaranteed to be represented as 16 cm. As a result, thick exterior or interior walls appear too thin in the model, which can distort scale and other aspects of the model critical for accurate interior planning.

Incorrect Wall Thickness Representation by RoomPlan

Incorrect Wall Thickness Representation by Apple RoomPlan Framework

Issues with Doors and Windows

When it comes to working with doors, whether they are combined door-window units or double doors, RoomPlan may interpret them as a single plane or merge them incorrectly, compromising the model’s realism. Although RoomPlan does differentiate between doors and openings, this distinction is not visually represented in the 3D model. In 3D, an “opening” is merely a hole in the wall, while a “door” is intended to represent an actual door. However, in practice, both appear identical, offering no distinction in the data or model view.Apple RoomPlan Issues with Doors and Windows

In order to get data on Openings – sizes, positions and determine the parent component, you need to work with the CapturedRoom JSON data file.

Additionally, for doors, factors such as the direction they swing open or even the exact placement within the opening are not captured. This impacts the model’s accuracy and can create mismatched expectations, as knowing the door’s orientation and position is crucial for many professional applications. The lack of this information diminishes the usefulness of the model, as the distinction between doors and openings becomes almost meaningless when there are no visual or data differences.

A further complication arises with double doors when one side is open and the other closed; in this case, RoomPlan often visualizes the closed side as part of the wall. Conversely, if both doors are open, creating a wide passage, it may register this as an opening rather than a door. This leads to inconsistencies in the representation, affecting both the visual model and spatial data.

For windows, RoomPlan often trims frames if they are sectioned or multi-level.

In cases where doors have a complex configuration or non-standard design, the tool may fail to represent them accurately, adding difficulties in further work with the model.

Doors and Windows Recognition by Apple Roomplan

Large Mirror Surfaces

Floor-to-ceiling mirrors and mirrored wardrobe doors pose a particular challenge for RoomPlan. Due to their optical properties, LiDAR often fails to accurately process these reflective surfaces, resulting in significant distortions or errors in the scan.

For example, large mirrors can cause “gaps” in the model, their absence (as if the wardrobe isn’t there), or the creation of phantom objects that don’t exist in the real space.

Each of these issues reduces the accuracy and reliability of models created using RoomPlan and requires developers to invest additional effort to refine and adjust the completed 3D scenes.

Walls Encroaching on Space

In iOS 17, walls in RoomPlan may encroach on the interior space, covering objects that are placed closely against them. This is especially noticeable when furniture or other items are flush with the walls.

This behavior has been improved in iOS 18, where wall boundaries are handled more accurately.

Wall Thickness Limitations

RoomPlan has a restriction on wall thickness, which cannot exceed approximately 50 cm. Walls that are thicker than this limit are treated as two separate thin walls, which can result in incorrect structural representation for spaces with very thick walls.

Inconsistent Wall Heights

Wall heights within a single room can vary, especially at corners where walls of different heights may converge. This issue is primarily seen in rooms with decorative elements, arches, or transitions near the ceiling, which cause height discrepancies.

Inconsistent Wall Heights by RoomPlan
Inconsistent Wall Heights by RoomPlan

Curved Walls and Floor Gaps

RoomPlan struggles with accurately representing curved walls. The system simplifies floors by aligning to the wall’s extreme points, resulting in gaps between the wall and the floor where a curve exists.

Curved Walls and Floor Gaps by RoomPlan

Simplification of Columns and Niches

Columns, niches, and other structural details are typically simplified or removed entirely in the RoomPlan model, which affects the accuracy of the final representation and loses critical architectural elements.

Native merge

One of RoomPlan’s features is the automatic process of merging individual elements of a room or space into a unified 3D model. However, while this function seems beneficial, in practice, it introduces considerable distortions, as RoomPlan attempts to optimize the final model’s appearance, often at the expense of accuracy. As a result, individual rooms may appear reasonably accurate and detailed after scanning, but the combined model often exhibits serious distortions. This makes the final 3D model less suitable for professional use, where precise measurements and proportions are critical.

Merging Floors of Different Rooms

RoomPlan automatically combines all floors into a single plane, which can significantly compromise the model’s realism. This merging largely depends on wall parameters and on how accurately the walls are combined into a shared space.

Merging Floors of Different Rooms with RoomPlan by Apple

Another issue arises from how RoomPlan treats level differences—it does not account for steps or platforms within rooms. In these cases, each room may look reasonably accurate, but upon merging, all these simplifications create additional discrepancies and mismatches between the separate areas. The combined floor gives the impression that all rooms are on the same level and share a uniform appearance.

Lack of Support for Multi-level Structures

RoomPlan is limited to working within a single floor, with merging possible only within a single horizontal plane. This means that for multi-story buildings, it is necessary to create separate models for each floor, treating each as an independent model.

The inability to merge floors into a single model complicates projects where it’s essential to represent all levels of a structure. This limitation makes RoomPlan less convenient for tasks requiring an overall view or when calculating volumes across multiple floors.

Automatic Wall Angle Alignment

RoomPlan automatically adjusts wall angles to make them perpendicular if there are minor deviations, even if, in real space, the angles are not perfectly right. This optimization is aimed at standardizing the model, but it often distorts the geometry of the room. Consequently, the model loses unique architectural features that may be essential for preserving the individuality and accuracy of the space.

Automatic Wall Angle Alignment with Apple RoomPlan

The problem becomes even more pronounced when dealing with spaces featuring complex structures or non-standard wall geometries, such as oval or slanted walls (like those in attics), where automatic angle straightening changes the room’s appearance and is not suitable.

Thus, although RoomPlan’s automatic merging aims to simplify and streamline the model creation process, in practice, it can significantly reduce accuracy. This requires users to put in extra effort to adjust the merged model so that it aligns with real conditions and architectural requirements.

Developers’ suffering

Preview Customization

RoomPlan provides a built-in preview view during scanning, but it is fixed and does not support customization. By default, you will always have an AR session with a visualization of the scanned space and a preview in the middle of the bottom. You can only add elements to the standard view, such as buttons, indicators, etc.

For real-world tasks you might want to go beyond the standard RoomCaptureView,  you can create your own custom view (we’ve already presented this in a previous article) from scratch.

That is, you can completely define the appearance, corners, and colors, for example, by coloring the floor and walls separately, or ignore objects if you are only interested in the outline of the room.

Preview Customization with RoomPlan

Export Issues

When attempting to export data after working with RoomPlan, be prepared for potential errors if file names start with numbers, such as “1234,” or if UUIDs are used for name generation. This issue results in failed exports.

To fix this, just add any Latin letter or word to the beginning of the word, for example, *export_*.

While this bug is resolved starting with iOS 18, earlier versions still exhibit this problem, so it’s important to be cautious with file naming when exporting RoomPlan data on older iOS versions.

Custom AR Session problem

If you want to integrate a custom AR session to work with your own configurations and pass it into the RoomCaptureView initializer, you may encounter several issues once your application runs, including:

  • Incorrect operation due to missing depth data
  • Stuttering and lag
  • Premature session termination if the app is minimized

This bug is also resolved starting from iOS 18, but it remains on earlier versions. If you need to use a custom AR session, it may be best to create a fully custom preview to ensure stable functionality.

Separate Coordinate Systems for Rooms

Each room scanned by RoomPlan has its own local coordinate system, which complicates integrating rooms within a unified space.

Developers must resort to workaround solutions to handle these transformations, making it challenging to work with multiple rooms cohesively in a single environment.

Summary

RoomPlan is an innovative framework that offers the ability to quickly create 3D models of spaces but brings with it many significant challenges. Although it is marketed as a convenient tool for design and visualization, its functions have notable limitations that should be considered.

The simplification of shapes, measurement inaccuracies, merging issues, and lack of easy customization preview support make RoomPlan less versatile than it might initially seem. For professional use, where high precision and detail are required, RoomPlan may prove insufficiently reliable and demand additional processing of the generated models or even the development of custom post-processing solutions.

Fortunately, there are ways to enhance RoomPlan’s capabilities. By combining RoomPlan’s output with raw data from iOS sensors, refining RoomPlan’s data structures through custom C++ integrations, or applying advanced computer vision algorithms, it’s possible to achieve higher accuracy and improve the reliability of the generated models. Some solutions addressing these issues are already emerging, providing a pathway for those looking to maximize RoomPlan’s potential in their applications.

It’s worth noting, however, that this tool is relatively new, and Apple continues to improve it. Even now, we see a significant difference in RoomPlan’s performance between iOS 17 and iOS 18, with the latter offering noticeable improvements. Despite current shortcomings, RoomPlan has great potential and will likely become more functional as technology advances and updates are released.

Thus, using RoomPlan today requires a thorough assessment of its capabilities and limitations, as well as a willingness from developers to adapt to its specific requirements. For those prepared to put in the extra effort, this tool may still open up new possibilities in creating interactive and rich AR experiences.

Barcode Safari: Exploring the iOS Scan Frontier

Recently, we encountered a task in one of our projects involving the development of a product management system for a large warehouse. The system needed accurate and efficient barcode detection to streamline inventory tracking, reduce human errors, and optimize workflows.

We have different options to tackle this problem. Should we use a dedicated barcode detection technology, or integrate barcode detection within an Optical Character Recognition (OCR) framework? Let’s try both and find out!

After thorough investigation, we selected four libraries for detailed research: Vision, MLKit, ZXingObjC, and SwiftyTesseract.  The main challenge was ensuring that the system could scan and identify multiple types of barcodes quickly and with high accuracy. Given the scale of the warehouse operations, performance and reliability were critical factors.

During our investigation, we faced several challenges, including:

  • Accurately identifying different types of barcodes
  • Determining the position of barcodes in photos
  • Handling scenarios where multiple barcodes appear in the same frame
  • Achieving high performance with minimal lag during scanning
  • Ensuring that the selected solution is well-supported and actively maintained for compatibility with future Swift and iOS updates
  • Considering cross-platform compatibility for potential future Android implementation

Picking the right barcode detection solution is key. Every project has its own needs, and by understanding them, we can decide on the best technology for barcode detection in iOS.

Vision

The Vision framework, provided by Apple, offers built-in support for barcode detection, allowing easy implementation with minimal code and no additional dependencies. It integrates seamlessly with AVCaptureSession, making it straightforward to add barcode scanning capabilities to iOS apps.

One of the major advantages of Vision is its seamless integration with the Apple ecosystem, ensuring that you don’t need to rely on external libraries or frameworks. It also provides high performance, with an average barcode detection processing time of just 0.07 seconds, which makes it highly efficient. Additionally, it generally offers high accuracy in barcode detection. However, in some cases, Vision may add a leading zero to barcodes, especially when the barcode starts with zero, so it becomes two zeros. This behavior could require additional handling to account for such scenarios.

Such a barcode, for example, will be detected as 0036000291452 in the picture below.

Barcode example

For the demo app, we created a minimalist UI with a choice of recognition modes and a display of results or errors on the tether for instant feedback.

The framework supports a wide variety of barcode formats, including both linear and 2D barcodes, and provides useful extra details, such as the bounding box and symbology of detected barcodes.

Another benefit is that Vision allows the detection of multiple barcodes within the same frame, which can be crucial for scanning large volumes of barcodes. Furthermore, Vision gives you the ability to specify a region of interest for barcode detection, which removes the need to crop the image beforehand.

You can also customize the barcode detection to focus on specific barcode symbologies or image orientation, which helps reduce unnecessary processing and false positives. With abundant resources like tutorials and official documentation available, integration and troubleshooting are made easier.

Barcodes Recognition with Apple Vision Framework

However, Vision does come with some limitations. It is exclusive to iOS, so if you’re aiming for cross-platform compatibility, it may not be the best fit. Additionally, handling edge cases, such as damaged barcodes, can be challenging. Also the issue of handling leading zeros in certain barcodes might require extra coding effort to ensure accuracy in all cases.

To use barcode detection in your app, you only need to import Vision framework and add the code for barcode detection.

Here’s a simple example that demonstrates how to implement barcode detection with Vision.


func detectWithVision(
        photo: URL,
        completion: @escaping ((String, CGRect)?) -&amp;amp;amp;amp;amp;gt; ()
    ) {
        guard
            let image = UIImage(contentsOfFile: photo.path),
            let cgImage = image.cgImage
        else {
            completion(nil)
            return
        }
        
        let request = VNDetectBarcodesRequest { (request, error) in
            guard
                let results = request.results as? [VNBarcodeObservation],
                error == nil
            else {
                completion(nil)
                return
            }
            
            let detectedBarcodes: [(String, CGRect)] = results.compactMap {
                guard let payloadStringValue = $0.payloadStringValue else {
                    return nil
                }
                return (payloadStringValue, $0.boundingBox)
            }
            
            completion(detectedBarcodes.first)
        }
        
        let handler = VNImageRequestHandler(
            cgImage: cgImage,
            orientation: image.cgImagePropertyOrientation
        )
        try? handler.perform([request])
    }

MLKit

MLKit, developed by Google, provides robust barcode detection for both iOS and Android, offering a cross-platform solution that supports multiple barcode formats.

One of its standout features is the ability to handle multiple barcodes in a single frame, making it ideal for scanning several items at once.

In addition to detecting barcodes, MLKit also provides detailed information for each result, including the barcode’s frame, format, and any specific data it contains – such as URLs, phone numbers, emails, or Wi-Fi credentials. It supports a wide range of barcode formats, covering both linear and 2D types.

The framework provides solid performance with an average processing time of around 0.16 seconds, which is still relatively fast. It has high accuracy and, unlike Vision, does not add extra leading zeros to barcodes. Additionally, it performs well in detecting damaged barcodes, making it a versatile choice for real-world scenarios. MLKit also offers comprehensive documentation and is regularly updated by Google. You can also specify the specific barcode formats you’re interested in, helping optimize performance and reduce unnecessary processing.

For example, with damaged barcodes like those shown below, MLKit still works reliably, whereas other solutions might struggle.

Damaged Barcode Detection with MLKit by Apple

However, MLKit does come with some drawbacks. For iOS, integration requires using Cocoapods, as it is not available through Swift Package Manager (SPM), which can make the initial setup more complicated.

Additionally, while it supports multiple barcode detections, if you need to specify a region of interest, you will either have to crop the image beforehand to focus on that area or implement additional filtering logic after detection. This extra step can increase the complexity of the handling process.

To integrate MLKit into your iOS project follow official documentation. Once MLKit is integrated into your project, you can implement barcode detection using the following code example.

func detectWithMLKit(
        photo: URL,
        completion: @escaping ((String, CGRect)?) -&amp;amp;amp;amp;amp;gt; ()
    ) {
        guard let image = UIImage(contentsOfFile: photo.path) else {
            completion(nil)
            return
        }
        
        let visionImage = VisionImage(image: image)
        visionImage.orientation = imageOrientation
        
        let barcodeScanner = BarcodeScanner.barcodeScanner()
        
        barcodeScanner.process(visionImage) { (barcodes, error) in
            guard let barcodes = barcodes, error == nil else {
                completion(nil)
                return
            }
            
            var detectedBarcodes: [(String, CGRect)] = []
            detectedBarcodes = barcodes.compactMap {
                guard let value = $0.displayValue else {
                    return nil
                }
                return (value, $0.frame)
            }
            
            completion(detectedBarcodes.first)
        }
    }

ZXingObjC

ZXingObjC is an open-source library for barcode scanning on iOS, and it’s part of the broader ZXing (Zebra Crossing) project. It supports a wide range of barcode formats – including some not covered by Vision or MLKit – such as RSS14 and Maxicode, making it a good fit for projects that need specialized or legacy barcode support.

To integrate ZXingObjC, you can use CocoaPods or Carthage. For barcode-focused apps, the ZXCapture class offers a straightforward way to implement real-time scanning without setting up your own AVCapture session.

However, the integration process is more complex compared to other solutions. ZXingObjC can also add leading zeros to barcodes and struggles with detecting damaged barcodes. Its performance is slower than Vision and MLKit, with an average processing time of 0.3 seconds. Additionally, the accuracy of barcode detection can be inconsistent, especially when the barcode is at a non-optimal angle. This can make scanning barcodes challenging, as it may require the user to adjust the angle for detection. ZXingObjC does not support multiple barcode detection simultaneously, and it lacks the ability to specify image rotation. Furthermore, while it can provide the coordinates of a detected barcode, it only returns two points, meaning you don’t get the full bounding box or frame of the barcode. Another downside is that ZXingObjC is no longer actively maintained, and there have been no updates for some time, which raises concerns about its long-term reliability.

Another concern is that ZXingObjC is no longer actively maintained, raising questions about future compatibility. As shown in the example below, the detection results can vary depending on the angle, lighting, and visibility of smaller elements.

Barcodes scanning on iOS with ZXingObjC Open-Source Library

To add ZXingObjC to your project the instructions in the GitHub repository. Once you have ZXingObjC integrated into your project, you can use the following code example to implement barcode detection.

func detectWithZXing(
        photo: URL,
        completion: @escaping ((String, CGRect)?) -&amp;amp;amp;amp;amp;gt; ()
    ) {
        guard
            let image = UIImage(contentsOfFile: photo.path),
            let cgImage = image.cgImage else {
            completion(nil)
            return
        }
        
        DispatchQueue.global().async {
            let source = ZXCGImageLuminanceSource(cgImage: cgImage)
            let binarizer = ZXHybridBinarizer(source: source)
            let bitmap = ZXBinaryBitmap(binarizer: binarizer)
            let reader = ZXMultiFormatReader()
            
            let hints = ZXDecodeHints()
            hints.tryHarder = true
            hints.addPossibleFormat(kBarcodeFormatEan13)
            
            reader.hints = hints
            
            do {
                let result = try reader.decode(bitmap, hints: hints)
                if let value = result?.text {
                    completion((value, .null))
                } else {
                    completion(nil)
                }
            } catch {
                log.error(error: error)
                completion(nil)
            }
        }
    }

SwiftyTesseract

SwiftyTesseract, built on Google’s Tesseract OCR, is primarily designed for optical character recognition (OCR), but it can be adapted for extracting barcode numbers when that is the main goal. It integrates easily with Swift Package Manager (SPM), but it requires additional setup, such as downloading the appropriate language training files and adding them to your project. Since SwiftyTesseract is not specifically tailored for barcode detection, its capabilities are quite limited in this context. To achieve optimal results, the image must first be cropped to the region containing the barcode, and it should be free of additional text. Furthermore, the image quality must be high otherwise, the results may be inconsistent or inaccurate.

However, even when the image is cropped properly and of good quality, it may still miss some numbers or produce completely inaccurate results. Its performance is also a major concern, with an average processing time of around 2 seconds for a cropped image and approximately 12 seconds for the original image, making it unsuitable for real-time or high-performance barcode detection.

Additionally, it cannot be used for non-text-based barcodes. The library is quite old and is no longer actively maintained, further limiting its reliability and support.

In the example below, it sometimes reads a text-based barcode correctly, but other times it produces an entirely incorrect result.

Barcodes Recognition on iOS with SwiftyTesseract

To integrate this library into your project follow the steps outlined in GitHub repository. Be sure to pay attention to the “Additional configuration” section,  as you will need to add language training files to your project.

After completing the setup, you can use the following code example to implement barcode detection.

func detectWithTesseract(
        photo: URL,
        rectOfInterest: CGRect,
        completion: @escaping ((String, CGRect)?) -&amp;amp;amp;amp;amp;gt; ()
    ) {
        guard
            let image = UIImage(contentsOfFile: photo.path),
            let croppedImage = image.cropping(to: rectOfInterest)
        else {
            completion(nil)
            return
        }

        let tesseract = Tesseract(
            language: .english,
            dataSource: Bundle.main
        )
        tesseract.allowList = "0123456789"

        DispatchQueue.global().async {
            let result = tesseract.performOCR(on: croppedImage)
            switch result {
            case .success(let text):
                completion((text, .null))
            case .failure(let error):
                log.error(error: error)
                completion(nil)
            }
        }
    }

 

Final Comparison

Based on the challenges we faced and the requirements for our barcode detection system, we developed a list of criteria to compare each technology.

After researching and testing the selected technologies, we are able to conduct a comparative analysis of their performance.

Let’s do this in the form of a bar chart, with the horizontal axis showing the time taken to process the image, and the vertical axis showing the selected technologies and their results:

Solutions Comparison for Barcode Recognition on iOS

The difference between the results is significant and, in some cases, critical.

If we project these results to the user experience, we can accurately indicate that Vision and MLKit show high performance and can definitely be offered for inclusion in a project. Instead, ZXingObjC offers processing in 300 ms, which is significantly longer than its predecessors, but can still provide a comfortable user experience when working in real time.

SwiftyTesseract shows the worst performance in terms of frame processing time, so it definitely cannot be used in real-time processing applications, but it can be used with photos or for background tasks if available. This is also due to the peculiarities of the general OCR approach to recognize all characters and then process the ones we have selected.

Below is a detailed comparison of Vision, MLKit, ZXingObjC, and SwiftyTesseract based on key factors:

Criteria Vision MLKit ZXingObjc SwiftyTesseract
Ease of integration High Medium Medium Medium
Supported formats Codabar 

Code 39 

Code 93 

Code 128

EAN-8

EAN-13 

ITF
UPC-A

UPC-E

Aztec 

Data Matrix 

PDF417 

QR-code

Codabar 

Code 39 

Code 93 

Code 128

EAN-8

EAN-13 

ITF
UPC-A

UPC-E

Aztec 

Data Matrix 

PDF417 

QR-code

Maxicode

RSS-14

 

Only text-based
Performance 0.07 sec 0.16 sec 0.3 sec 2 sec
Accuracy High High Medium Low
Cross-platform No Yes Yes Yes
Additional info
Barcode format

+ frame

Barcode format 

+ frame

Only 

barcode format

None
Multiple detection
Yes Yes No No
Tutorials / docs
High High Low Medium
Library support and updates Yes Yes No No

Barcodes Recognition on iOS: Conclusion

Each barcode detection library for iOS has its advantages and disadvantages, making the choice dependent on specific project requirements.

Vision: Ideal for projects that prioritize ease of integration, high performance, and simplicity over cross-platform support and ultra-high accuracy. It offers a seamless experience with good results, making it the best choice for applications that don’t require support for multiple platforms and where barcode detection is essential but not necessarily perfect.

MLKit: The go-to solution for cross-platform applications, especially when accuracy is critical and the ability to detect even damaged barcodes is required. It is highly supported with comprehensive documentation and frequent updates, making it an excellent choice for applications that need reliable performance across both iOS and Android.

ZXingObjC: A solid option for projects needing support for barcode formats not available in Vision or MLKit, such as Maxicode and RSS-14. However, the integration is more complex, and the lack of ongoing support could lead to issues in the future. It is a good option for projects with specific barcode format requirements but less ideal for projects requiring long-term stability and maintenance.

SwiftyTesseract: Not recommended for traditional barcode detection. It’s more suitable for projects where OCR is the primary focus, with barcode detection as a secondary task. It can handle only text-based barcodes and has slower performance, making it unsuitable for high-performance barcode scanning.

Ultimately, the choice depends on your project’s goals and constraints. Will you opt for the simplicity and speed of Vision, the cross-platform power of MLKit, the extended format support of ZXingObjC, or the OCR focus of SwiftyTesseract? The decision is yours.

This exploration has been a real challenge, showing us that a seemingly simple question can lead to complex answers. Which solution would you choose?

Apple’s ARKit vs. Eye Fatigue

In today’s world, digital devices dominate our daily lives, with significant time spent in front of screens – computers, smartphones, tablets, etc. While this lifestyle is an inevitable part of modern life, it also places substantial strain on our eyes. For many, eye fatigue has become a routine part of life, and if ignored, it can result in serious health issues. Key symptoms of the problem are: poor sleep, light sensitivity, reduced productivity

Obviously when having respective symptoms one should, first and foremost, reduce the screentime. However this is not always possible. Another way is to do Eye Exercises. An application which guides a person through a set of exercises would be beneficial. And that’s what we’re going to create today.

Eyes tracking

Key feature of an eye training app would be eye tracking. Eye movement tracking helps accurately assess exercise completion and ensures appropriate feedback for the user.

To implement the eye tracking function, we compared several potential solutions:

Tracking Type Vision MLKit ARKit
Process Time* ±7.3 ms ±14.25 ms ±8.6 ms
Output Data Type 2D 2D 3D
Individual Pupil Tracking +
Setup Code Small Many Small
Guides and Tutorials Many A little Many
Multiplatform +

* – 1080p 60 fps iPhone 14 Pro, Front Camera, median

  • Vision Framework: Provides extensive capabilities for 2D face tracking and keypoint detection, such as eye tracking. However, its accuracy and functionality when working with pupils are limited compared to ARKit.
  • Google ML Kit: A cross-platform solution with basic face and eye area tracking capabilities. The main drawbacks include slower frame processing on iOS compared to native tools and challenges in working with pupil tracking.
  • ARKit (ARFaceTracking): An Apple platform offering powerful tools for eye tracking in a 3D space. ARKit delivers precise data through the use of the TrueDepth camera and provides the best native implementation for pupil tracking.

Currently, there is no requirement for cross-platform implementation, as our focus is solely on iOS, where frame processing speed is critical. Additionally, ARKit’s output in a 3D format offers a more advanced implementation, providing deeper visualization options, better customization, and a more comprehensive picture of user actions.

Based on the above considerations, we have chosen ARKit (ARFaceTracking) to implement the eye tracking service.

First, we will define the ARSessionManager protocol and data models for processing results.

We will create the EyeTrackingData model to store data about the position of each eye in all expected states, enabling us to process the results from ARFaceAnchor and retain them:

final class EyeTrackingData {
    // MARK: - Properties
    var eyeLookInLeft: Float
    var eyeLookOutLeft: Float
    var eyeLookInRight: Float
    var eyeLookOutRight: Float
    var eyeLookUpLeft: Float
    var eyeLookDownLeft: Float
    var eyeLookUpRight: Float
    var eyeLookDownRight: Float
    var eyeBlinkLeft: Float
    var eyeBlinkRight: Float
    var eyeWideLeft: Float
    var eyeWideRight: Float
    
    // MARK: - Init
    init(...) { ... }
}

Now let’s describe the ARSessionManager protocol and the ARSessionManagerDelegate delegate, which will return the results for further use:

protocol ARSessionManager: AnyObject {
    // MARK: - Funcs
    func setDelegate(_ delegate: ARSessionManagerDelegate)
    func setupSession() -&amp;amp;amp;amp;gt; ARSCNView
    func startSession()
    func pauseSession()
}

protocol ARSessionManagerDelegate: AnyObject {
    func didUpdateEyeTrackingData(_ data: EyeTrackingData)
}

When implementing ARSessionManager, it is important to consider the following configurations:

  • Using arSessionQueue to isolate the service’s operation queue from the UI, preventing interface blocking;
  • Using ARFaceTrackingConfiguration to explicitly specify the type of tracking.

final class ARSessionManagerImpl: NSObject, ARSessionManager {
    // MARK: - Delegate
    private var delegate: ARSessionManagerDelegate?
    
    // MARK: - Properties
    private var configurations: ARConfiguration?
    private let arSessionQueue = DispatchQueue(
        label: "ar-session-queue",
        qos: .userInitiated,
        attributes: [],
        autoreleaseFrequency: .workItem
    )
    
    // MARK: - ARSceneView
    private var sceneARView = ARSCNView()
    
    // MARK: - Set
    func setDelegate(_ delegate: ARSessionManagerDelegate) {
        self.delegate = delegate
    }
    
    func setupSession() -&amp;amp;amp;amp;gt; ARSCNView {
        configurations = ARFaceTrackingConfiguration()
        sceneARView.delegate = self
        return sceneARView
    }
}

The methods startSession() and pauseSession() are provided for session management:

// MARK: - Controls
extension ARSessionManagerImpl {
    func startSession() {
        arSessionQueue.async {
            guard let config = self.configurations else { return }
            self.sceneARView.session.run(config, options: [
                .resetTracking, .removeExistingAnchors
            ])
        }
    }
    
    func pauseSession() {
        arSessionQueue.async {
            self.sceneARView.session.pause()
        }
    }
}

To accomplish the primary function – tracking the user’s eye state and transmitting the relevant data – it is necessary to utilize the appropriate method from ARSCNViewDelegate. This method enables the retrieval of ARFaceAnchor and the associated data set, ensuring accurate and efficient processing of the required information.

One of the key components returned by ARFaceAnchor is blendShapes. These are a set of parameters that describe specific facial positions and states, such as blinking, eye movements, or changes in mouth shape. Each of these positions is represented as a numeric value ranging from 0.0 to 1.0, indicating the intensity of a particular action or position.

BlendShapes are crucial for accurately determining the user’s eye state. For instance, the parameters eyeBlinkLeft and eyeBlinkRight indicate the blinking level of the left and right eyes, while eyeLookUpLeft or eyeLookOutRight show the gaze direction. Apple provides visualizations and documentation for these parameters, which greatly simplifies their integration into application development.

Eye Blinking

// MARK: - ARSCNViewDelegate
extension ARSessionManagerImpl: ARSCNViewDelegate {
    func renderer(
        _ renderer: SCNSceneRenderer,
        didUpdate node: SCNNode,
        for anchor: ARAnchor
    ) {
        guard let faceAnchor = anchor as? ARFaceAnchor else { return }
        let blendShapes = faceAnchor.blendShapes
        
        let eyeTrackingData = EyeTrackingData(
            eyeLookInLeft: blendShapes[.eyeLookInLeft]?.floatValue,
            eyeLookOutLeft: blendShapes[.eyeLookOutLeft]?.floatValue,
            eyeLookInRight: blendShapes[.eyeLookInRight]?.floatValue,
            eyeLookOutRight: blendShapes[.eyeLookOutRight]?.floatValue,
            eyeLookUpLeft: blendShapes[.eyeLookUpLeft]?.floatValue,
            eyeLookDownLeft: blendShapes[.eyeLookDownLeft]?.floatValue,
            eyeLookUpRight: blendShapes[.eyeLookUpRight]?.floatValue,
            eyeLookDownRight: blendShapes[.eyeLookDownRight]?.floatValue,
            eyeBlinkLeft: blendShapes[.eyeBlinkLeft]?.floatValue,
            eyeBlinkRight: blendShapes[.eyeBlinkRight]?.floatValue,
            eyeWideLeft: blendShapes[.eyeWideLeft]?.floatValue,
            eyeWideRight: blendShapes[.eyeWideRight]?.floatValue
        )
        
        delegate?.didUpdateEyeTrackingData(eyeTrackingData)
    }
}

We have created the EyeTrackingData model and defined the complete logic for ARSessionManager, which works with ARFaceTrackingConfiguration and provides the expected data. Now, we will focus on implementing the service that will process the results and determine whether the selected exercises have been completed.

To begin, it is necessary to create appropriate working models to describe the exercises and the criteria for their completion, such as eye positions. In our case, exercises will define the direction of the gaze relative to the center, meaning that the exercise name and the eye position can match:

enum EyeExercise: CaseIterable {
    case right
    case left
    case up
    case down
    case topLeft
    case topRight
    case bottomLeft
    case bottomRight
    case blink
}

Next, we need to define the criteria for the ExerciseService, i.e., its protocol. In our case, it will have combined functionality, meaning it will both create the training sequence and verify whether the current exercise is completed, then switch to the next one.


protocol ExerciseService {
    func regenerateExercises(type: TrainingSetType)
    func isCurrentExerciseCompleted(
        inputData: EyeTrackingData,
        user: UserData?
    ) -&amp;amp;gt; Bool
}

The implementation of the isCurrentExerciseCompleted() method is critical to the functionality of our app, as this method determines whether the current exercise has been successfully completed:

func isCurrentExerciseCompleted(
    inputData: EyeTrackingData,
    user: UserData?
) -&amp;gt; Bool {
    /// We’ll check the input data value of each eye separately and determine
    /// its position to make sure that the exercise is being completed.
    /// For blinks, we will check whether the eyes were closed
    /// (i.e., no pupils are visible)
}

 

In our specific case, we employ the MVP architectural pattern, where data from ARSessionManager is returned via a delegate to the Presenter. In the Presenter, the data is processed using the ExerciseService class, which is responsible for structuring the training sequence and verifying the completion of the current exercise. These results are then processed to provide the user with appropriate feedback.

Calibration: A Crucial Step

Before a user begins using the app regularly, it is critical to perform a calibration process. Each individual is unique, with different eye positions, varying limits on rotation and movement, varying eye depth in the skull, and other physiological differences.

To ensure the comprehensive and high-quality functionality of our app, we must include a dedicated calibration feature. This involves creating a specific training sequence — a set of exercises that accounts for a maximum number of positions and states.

Additionally, an informational Best Practices screen should be implemented to educate and guide the user effectively.

At the end of the calibration (as with every workout), it’s worth adding a rewards screen to highlight the end of the workout and give the user a sense of accomplishment.

Best Calibration Tips

To achieve this, we will proceed with the following steps:

  1. Perform two cycles of EyeExercise with a pause of 5-10 seconds between each exercise. This will allow us to determine typical eye deviations and their positions for each exercise.
  2. Save these results in the corresponding values of UserData with a coefficient of 0.8. This adjustment will account for the natural imperfections in human movements and the variability of results.

Eye Tracking 1

Eye Tracking 2

And after this user is guided to do a set of various exercises where they have to move their eyes in all directions.

More about application

Data Input Form and Its Purpose

For personalized user interaction and efficient data storage and management, we utilize Apple’s CoreData framework. This allows for seamless operation with a local database and offers flexibility in handling data.

We create a UserData models to store essential user information and its child entities to manage and track exercises (look at relationship diagram bellow):

Data Input Schemes

During the initial setup (onboarding), the user is prompted to enter the following information:

  • Working hours: Start time and duration of the workday spent at the computer;
  • Working days: The days of the week when the user is actively working.

 

Application Onboarding Setup

This data is essential for personalizing notifications to align with the user’s work schedule and ensure they are not intrusive during non-working hours.

Notifications

Regularity of breaks and exercises is really important. So a simple function like scheduled reminders throughout the day is a must.

To handle notification creation and management, we first define a protocol NotificationService, where we outline the required functionality:

protocol NotificationService: AnyObject {
    func scheduleNotifications(user: UserData, timeReminder: Int)
    func rescheduleNotifications(user: UserData)
}

 

Next, we will implement the methods scheduleNotifications() and rescheduleNotifications(), which will handle creating notifications based on the user’s onboarding questionnaire and updating them if the user completes eye exercises between reminders.

func scheduleNotifications(
    user: UserData,
    timeReminder: Int   /// numbers of hours between notifications
) { 
    let workingHours = Int(user.workingTime)
    let startHour = Calendar.current.component(.hour, from: lastWorkout)
    UNUserNotificationCenter.current().removeAllPendingNotificationRequests()
    
    for day in workDays {
        for hour in stride(
            from: startHour + timeReminder,
            to: startHour + workingHours,
            by: timeReminder
        ) {
            addNotification(day: day, hour: hour, lastWorkout: lastWorkout)
        }
    }
}

A private method addNotification() has been added to create a request. This method provides the context and trigger for the notification and adds it to the general notification pool.

private func addNotification(day: Int, hour: Int, lastWorkout: Date) {
    var dateComponents = DateComponents()
    dateComponents.weekday = day
    dateComponents.hour = hour
    
    if let notificationDate = Calendar.current.nextDate(
        after: lastWorkout,
        matching: dateComponents,
        matchingPolicy: .nextTime
    ) {
        /// Set notification content
        let content = UNMutableNotificationContent()
        content.title = Strings.NotificationService.title
        content.body = Strings.NotificationService.body
        
        /// Set notification trigger
        let trigger = UNCalendarNotificationTrigger(
            dateMatching: Calendar.current.dateComponents(
                [.year, .month, .day, .hour, .minute, .second],
                from: notificationDate
            ),
            repeats: false
        )
        
        let request = UNNotificationRequest(
            identifier: UUID().uuidString,
            content: content,
            trigger: trigger
        )
        
        UNUserNotificationCenter.current().add(request) { (error) in
            if let error = error {
                /// handling the error
            }
        }
    }
}

The implementation of rescheduleNotifications() remains similar, with the consideration that current notifications will be recreated for the remainder of the workday.

For example, if a user works from 9:00 AM to 5:00 PM with a reminder interval of every 2 hours, notifications will be sent at 11:00 AM, 1:00 PM, and 3:00 PM. Notifications will not be sent during non-working hours or days, ensuring they are non-intrusive and aligned with the user’s personal schedule.

Colors

Last but not the least is the UI color scheme. User interface design and user experience are critical for eye health applications, as the right color scheme can reduce eye strain and enhance user perception (DevTo). UI colors for the app were chosen based on the principles of color psychology and their impact on users (MockFlow, HappyDesign).

Eye Tracking App Colors

Conclusion

In today’s world, digital devices dominate our lives, yet we often overlook the long-term impact of prolonged screen time on our eyes. Symptoms like migraines, disrupted sleep, light sensitivity, and reduced productivity may begin subtly but can escalate into significant health issues. Apps like ours aim to address these challenges proactively, promoting better eye health and well-being.Eye Tracking Meme

Building an app to combat eye fatigue requires more than technical expertise; it demands thoughtful design. Eye-tracking technology must balance performance, accuracy, and platform compatibility for seamless integration. Equally vital is the user experience – interfaces should reduce eye strain with adaptive color schemes and feel intuitive to use. Notifications play a key role in encouraging regular breaks, fostering healthier habits.

Challenges remain, such as hardware limitations (e.g., TrueDepth camera availability) and the need for robust onboarding and calibration processes to personalize the experience. User education is also critical, ensuring awareness of the importance of eye care and exercises.

Our app leverages ARKit with ARFaceTracking for precise, efficient three-dimensional eye tracking. The ARSessionManager isolates session handling, ensuring smooth data flow to the Presenter, where exercises are monitored. Adaptive color schemes reduce strain, while smart notifications remind users to take breaks, tailored to their schedules.

This demonstrates how technology can address real-world health issues. However, opportunities abound – whether through integrating third-party platforms or enhancing functionality with machine learning for greater precision and personalization.

How would you implement eye tracking in your app? 

Perhaps it’s time to explore the possibilities that machine learning could bring to the table. After all, the future of eye tracking is only limited by the scope of our imagination.