From Pixels to Market Outcomes: A Framework for Image Analytics in Marketing

Our digital world has become increasingly visual. Firms increasingly rely on images to convey brand identity, signal quality, evoke emotions, and influence consumer decisions across digital advertising, social media, and e-commerce platforms. At the same time, consumers actively generate and share images of themselves, products, and experiences on social media and online review sites. Visual content also plays a central role on platforms such as Airbnb, LinkedIn, charity crowdfunding sites, freelancing marketplaces, dating apps, and resale platforms, where images shape outcomes ranging from bookings and hiring to donations and sales.

As a result, the ability to systematically analyze visual content has become essential for both managers and academic researchers. Image analytics enables firms to move beyond subjective evaluations of creative assets. By quantifying visual characteristics at scale, firms can evaluate, optimize, and personalize visual communication strategies across markets and customer segments. For researchers, image analytics provides a way to incorporate visual data into empirical analysis. By transforming images into structured, analyzable variables, researchers can investigate how visual elements shape consumer behavior and market outcomes across contexts, thereby advancing theory in domains where visual design plays a central role.

However, unlike structured data, images do not come with predefined variables. Researchers and managers must first decide which aspects of visual content matter for a given outcome and how to extract those variables from unstructured images in a reliable and scalable way.

Download Article

Get this article as a PDF

Download

Image variables can be broadly organized into three categories based on the level of visual meaning:

Low-level features capture visual properties derived directly from pixel values, such as color, brightness, and composition.
Mid-level features capture what is present in the image, such as objects, people, logos, or scenes.
High-level features capture how images are interpreted or evaluated, such as perceived emotion, aesthetics, or brand personality.

From a managerial perspective, these levels can be viewed as part of a broader decision process that links business outcomes, feature selection, measurement methods, and validation, as shown in Figure 1. In practice, the appropriate level depends on the objective of the image analytics task, as summarized in Table 1.

This article provides a structured framework to guide feature selection and measurement, synthesizing recent research from Journal of Marketing Research and related marketing and information systems journals to show how image analytics can inform theory and managerial actions.

Figure 1: Feature Selection and Measurement Framework for Image Analytics

Table 1: Choosing the Right Level of Visual Meaning

If Your Goal Is To…	Choose This Level	Rationale
Control for basic visual differences across images	Low-level	Fast, objective, scalable
Measure what appears in the image	Mid-level	Directly links to content decisions
Measure consumer evaluations or perceptions	High-level	Captures psychological meaning

Low-Level Features: Color, Composition, and Basic Visual Properties

Objective image attributes such as file size, resolution, orientation (portrait vs. landscape), and foundational visual features drawn from photography research, including color, composition, and figure–ground relationships, provide a natural starting point for image analytics. These attributes are either directly observable or easy to extract, follow standardized definitions, and yield consistent values for the same image, enabling systematic comparison at scale.

Among intrinsic attributes, color has received the most sustained attention in marketing research (see Labrecque [2020] for a comprehensive review). Color is commonly characterized along three dimensions: hue (e.g., red, blue), saturation (intensity or richness), and value (lightness versus darkness), together often referred to as the hue–saturation–value (HSV) space. Consistent with prior reviews of color research in marketing (Labrecque 2020), much of the foundational evidence on color effects comes from controlled laboratory experiments that offer strong internal validity but limited scalability. Nowadays, image analytics enable researchers to extend color research to large-scale field datasets by measuring color properties directly from images. Image processing packages such as Python Pillow allow researchers to extract pixel-level color values, preferably in HSV space to facilitate interpretation, and construct theory-consistent color variables. For example, dominant colors are often identified using k-means clustering to group pixels with similar color characteristics and represent each cluster by its centroid (Labrecque et al. 2025). Other studies operationalize image clarity as the proportion of pixels exceeding a brightness threshold (Zhang et al. 2022) or measure colorfulness as one minus the combined pixel share of the three most dominant colors (Dang et al. 2026; Li and Xie 2020). Although specific parameter choices vary across studies, results are generally robust to reasonable alternative specifications. Major commercial vision platforms such as Google Vision AI, Amazon Web Services (AWS) Rekognition, and Microsoft Azure offer similar color detection capabilities.

Once constructed, color-related attributes can enter regression or machine learning models directly to explain or predict outcomes. For example, Li and Xie (2020) find that colorfulness affects user engagement on social media in a category-contingent manner, while Zhang et al. (2022) show that Airbnb photos with warmer hues, greater image clarity, and more balanced color properties associate with higher demand. Using gradient-boosted regression trees, Dzyabura et al. (2023) demonstrate that nuanced color composition clusters exert substantial predictive power for product returns, indicating that subtle shade differences can meaningfully affect consumer behavior. However, Zhang and Luo (2023) find that photographic attributes such as composition and brightness are less useful in predicting restaurant survival compared to the content of the photos.

Beyond their direct effects, color-related attributes also serve as building blocks or explanatory factors for higher-level perceptual constructs. Hou, Zhang, and Zhang (2023) show that warmer hues, higher saturation, and greater brightness evoke more positive emotions in charity crowdfunding images, while Zhang et al. (2022) find that professionally verified Airbnb photos achieve higher perceived quality in part due to more appealing color properties. Yu et al. (2026) further demonstrate that saturation, brightness, brightness contrast, and image clarity increase picture-evoked arousal but not valence in restaurant reviews. Complementing these findings, Labrecque et al. (2025) show that marketers systematically pair highly saturated product images with language emphasizing potency and efficacy.

Taken together, this body of work highlights the dual role of color-related attributes in image analytics. Color properties directly explain variation in consumer responses and market outcomes, while also serving as foundational visual factors that must be accounted for when studying higher-level creative strategies. In practice, failing to control for color can lead managers and researchers to misattribute performance differences to creative content or messaging when those differences instead reflect underlying visual properties (Labrecque 2020; Li and Xie 2020; Zhang et al. 2022). This risk is particularly pronounced when images vary substantially in brightness or colorfulness, when creative elements such as people presence or emotional expressions systematically co-occur with specific color patterns, or when analyses span product categories with distinct color conventions.

Mid-Level Features: Objects and Human Faces

Mid-level visual features capture what appears in the image rather than only how it looks at the pixel level. Common examples include detected objects (e.g., product, food, car, logo, sports equipment) and the presence and characteristics of human faces (e.g., face presence, number of faces, facial expressions). These features matter because they map directly onto content elements that consumers notice and interpret, and they often proxy for managerial choices about what to show, such as products versus lifestyle contexts or people versus objects.

Researchers can extract object- and face-based features using pretrained computer vision systems that return labels and localized regions. For example, Google Vision can identify multiple objects and faces and provide bounding boxes and facial attributes, including emotion likelihoods. Similar outputs are available from platforms such as AWS Rekognition, Microsoft Azure, and Clarifai, enabling straightforward feature construction ranging from simple presence or count measures to composition-based measures that rely on object size and location.

When standard outputs are insufficient, researchers can fine-tune pretrained deep learning models for task-specific classification. For example, Hartmann et al. (2021) train a Visual Geometry Group (VGG-16) convolutional neural network (CNN) model to distinguish consumer selfies from brand selfies, and Li et al. (2022) use transfer learning with a Residual Network (ResNet-50) architecture to classify room types in Airbnb photos. Table 2 summarizes the key conceptual steps in fine-tuning pretrained CNNs for such tasks. In practice, fine-tuning is most commonly used for mid-level (e.g., object or content detection) and high-level (e.g., perception or evaluation prediction) image analytics tasks, whereas low-level visual properties such as color or brightness are typically extracted using standard image processing tools.

Table 2: Conceptual overview of fine-tuning a pretrained Convolutional Neural Networks (CNN)

Step	Key Decision	Purpose
Define the task	Specify what is being classified or detected (e.g., consumer selfie vs. brand selfie, room type, emotion, quality)	Aligns model outputs with the theoretical construct of interest
Prepare labeled data	Assemble a labeled image set consistent with the task definition	Ensures the model learns meaningful visual distinctions
Initialize pretrained model	Start from a CNN architecture (e.g., VGG-16, ResNet-50) pretrained on a large image corpus (e.g., ImageNet)	Leverages general visual representations and limits data requirements
Adapt and fine-tune model layers	Decide which layers’ parameters to hold fixed and which to fine-tune, based on factors such as task complexity and training sample size, and adjust the final classification layer to match the task categories	Balances generalization with task specificity and ensures model outputs align with the labeled data
Validate and extract outputs	Assess out-of-sample performance and extract predictions	Establishes measurement reliability for downstream analysis

In marketing contexts, including people in images is a common and highly consequential creative decision. However, evidence across settings shows that the effects of human presence are highly context dependent. Li and Xie (2020) find that images with human faces increase attention and engagement on Twitter but not on Instagram. Lu, Jung, and Peck (2024) show that in identity-relevant contexts such as vacations or weddings, including another person can reduce liking and preference by triggering psychological ownership concerns. In social media branding, Hartmann et al. (2021) document a similar trade-off: Consumer selfies generate more likes and comments, whereas product-focused brand selfies elicit stronger brand engagement and purchase intentions. In online reviews, Guan et al. (2023) find that reviewer face disclosure increases subsequent product ratings by reducing uncertainty about product fit. Together, these findings indicate that human presence does not uniformly enhance image effectiveness; its impact depends on platform norms, consumption goals, and the role the image plays in the decision process. Beyond direct effects, human and object presence also shape image effectiveness indirectly by influencing emotional responses. In charity crowdfunding, Hou, Zhang, and Zhang (2023) show that images featuring people heighten excitement while suppressing awe and selectively amplify or reduce negative emotions, whereas images with animals evoke a distinct emotional profile. Taken together, this work underscores the importance of explicitly modeling human and object presence in image analytics, both as direct predictors and as drivers of emotional and evaluative mechanisms that influence downstream outcomes.

High-Level Features: Emotion, Quality, Aesthetics, and beyond

High-level visual features capture how people respond to and evaluate an image or an object within an image. These features reflect subjective interpretations, such as the emotions an image evokes (Hou, Zhang, and Zhang 2023), the quality or aesthetic appeal it conveys (Zhang et al. 2022, Guan et al. 2023) and person-related attributes such as celebrity potential (Feng et al. 2025) or attractiveness (Malik, Singh, and Srinivasan 2023).

Prior research demonstrates that these perceptual constructs play an important role across a wide range of contexts. For example, images that evoke specific emotions influence engagement and donation behavior in charity crowdfunding (Hou, Zhang, and Zhang 2023). Perceived visual quality and aesthetic appeal shape evaluations in hospitality and online review settings (Guan et al. 2023; Zhang et al. 2022). Face-related attributes inferred from images, such as celebrity potential or attractiveness, affect influencer selection, hiring decisions, and long-term career outcomes (Feng et al. 2025; Malik, Singh, and Srinivasan 2023; Troncoso and Luo 2023). Together, these findings show that high-level visual features capture meaningful variation in how images shape evaluations and decisions, even though their effects often depend on context and task.

High-level visual features offer three key advantages for image analytics:

They align measurement with how marketing theory conceptualizes decision making. Many theories emphasize perceptions and judgments as the link between marketing stimuli and outcomes.
They provide a compact way to summarize complex visual information, improving stability and making comparisons easier across platforms, categories, and context.
They improve interpretability for both researchers and managers by translating visual variation into psychologically meaningful constructs that are easier to explain and act on.

Researchers typically construct high-level visual features using the same CNN-based framework applied to other task-specific image analytics, consistent with the workflow summarized in Table 2. The main difference lies in task definition and labeling: Instead of predicting objects or content categories, models infer perceptual judgments or attributes based on human evaluations or validated proxies. The resulting predictions then serve as quantitative measures that can be incorporated directly into empirical models.

Advertising and consumer behavior research has long emphasized that visual and verbal elements are processed jointly and that their congruence shapes consumer responses (Heckler and Childers 1992). Advances in image analytics now allow researchers and managers to measure these relationships directly and at scale. Methodologically, this stream of research uses deep learning models to generate representations for images and text and then constructs measures that capture how visual and verbal content relates across modalities. These measures can be validated against human judgments and incorporated into empirical models to study how visual and verbal cues jointly shape perceptions and decisions.

A central insight from this literature is that consumer responses depend critically on how image and text content relate to one another. Shin et al. (2020) show that image–text similarity substantially improves prediction of social media content popularity and consumer engagement. Li and Xie (2020) find that stronger image–text fit increases user engagement on Twitter but not on Instagram, underscoring platform-specific processing differences. In online reviews, Ceylan, Diehl, and Proserpio (2024) and Yu et al. (2026) show that alignment between photos and text in both content and emotional valence and arousal improves review helpfulness by enhancing processing fluency. Extending beyond reinforcement, Cao, Li, and Zhang (2025) uncover a U-shaped effect of image–text congruence in product representations, showing that both high congruence driven by relevance and deliberate incongruence driven by surprise can enhance consumer preference. Together, this steam of work highlights the importance of coordinating visual and verbal cues rather than optimizing them in isolation.

Putting the Framework into Practice: A Multilevel View of a Marketing Image

Figure 2: Example of Multilevel Image Measurement in a Social Media Post

To illustrate how image analytics can support managerial decision making, consider the Nike social media post shown in Figure 2. The same image can be analyzed at multiple levels depending on the business objective. Rather than extracting every possible visual feature, the goal is to select image variables and measurements that match the decision being supported.

At the low level, managers can measure visual style and properties such as color distribution, brightness, contrast, and background uniformity using standard image-processing tools (e.g., Python image libraries, vision APIs). These measures help ensure visual consistency within and across campaigns and help isolate the effects of higher-level creative decisions. In this example, the dark background and strong contrast visually isolate the product and increase visual salience. More broadly, by quantifying background tone and contrast across posts, Nike can test when high-contrast, minimalist imagery enhances engagement or conversion relative to visuals featuring brighter or more visually complex backgrounds. These insights allow managers to tailor visual style across platforms, product categories, and campaign objectives.

At the mid-level, managers can measure what is present in the image. Object detection or classification models can identify the product, logo, product components, and the presence or absence of human. These measures support decisions about product-focused versus lifestyle-centered creative strategy, brand visibility, and content tagging. In this example, the absence of people and the sole visual focus on the shoe signal a produce-centric creative strategy that emphasizes technology and performance. Prior research shows that such content choices have meaningful consequences: Hartmann et al. (2021) find that consumer selfies generate more likes and comments, whereas product-focused brand selfies elicit stronger brand engagement and purchase intentions. By quantifying whether images feature products alone or include people, Nike can align its creative strategy with campaign objectives. Product-centric imagery can be emphasized when the goal is to strengthen purchase intent and brand evaluation. In contrast, incorporating human elements may be more effective when the objective is to increase social interaction.

At the high level, managers can gauge how consumers interpret the image. Custom models or human-assisted coding can be used to measure perceived excitement, performance intensity, or innovation cues. These constructs are directly linked to engagement, purchase intent, and brand perception (Guan et al. 2023; Hou, Zhang, and Zhang 2023; Zhang et al. 2022). In the Nike example, the dark background, strong contrast, and focused presentation of the shoe collectively convey a high-performance and technologically advanced impression. Such measures allow managers to evaluate whether an image communicates the intended brand meaning or emotional tone before deployment at scale.

When text is present, managers can also evaluate image–text alignment. Here, the performance-focused copy aligns closely with the high-energy technical visual, reinforcing the product message. This alignment strengthens the overall consumer interpretation. Prior research demonstrates that alignment between visual and verbal cues can enhance engagement and processing fluency, while strategic incongruence may also increase attention in some contexts (Cao, Li, and Zhang 2025; Ceylan, Diehl, and Proserpio 2024; Shin et al. 2020; Li and Xie 2020; Yu et al. 2026). Extending this approach across posts allows Nike to evaluate when alignment between images and text enhances engagement and conversion outcomes and when alternative strategies may be more effective.

Together, these levels provide complementary insights, moving from visual style, to content strategy, to consumer interpretation and market outcomes.

Summary

This article offers a practical framework for making sense of visual content in digital marketing. It shows how images can be analyzed at three levels—basic visual properties, content elements such as products and people, and higher-level perceptions such as emotion and quality—and explains when each level is most useful for understanding performance. The article also highlights the growing importance of analyzing images together with text, since consumers often interpret visual and verbal cues jointly. By synthesizing recent research and outlining scalable analytic approaches, the framework helps managers and researchers choose the right visual features, avoid misleading conclusions, and design images that communicate more effectively across platforms and contexts.

References

Cao, Jingcun, Xiaolin Li, and Lingling Zhang (2025), “Is Relevancy Everything? A Deep-Learning Approach to Understand the Effect of Image-Text Congruence,” Management Science, 71 (12), 10579–10602. https://doi.org/10.1287/mnsc.2022.01896

Ceylan, Gizem, Kristin Diehl, and Davide Proserpio (2024), “Words Meet Photos: When and Why Photos Increase Review Helpfulness,” Journal of Marketing Research, 61 (1), 5–26. https://doi.org/10.1177/00222437231169711

Dang, Ivy Chu, Canice M.C. Kwan, Jayson S. Jia, and Yang Shi (2026), “When Words Meet Visuals: How Content Composition Drives Social Media Engagement for Marketer-Generated Content,” Journal of Marketing Research, 63 (1), 167–90. https://doi.org/10.1177/00222437251373042

Dzyabura, Daria, Siham El Kihal, John R. Hauser, and Marat Ibragimov (2023), “Leveraging the Power of Images in Managing Product Return Rates,” Marketing Science, 42 (6), 1125–42. https://doi.org/10.1287/mksc.2023.1451

Feng, Xiaohang, Shunyuan Zhang, Xiao Liu, Kannan Srinivasan, and Cait Lamberton (2025), “An AI Method to Score Celebrity Visual Potential,” Journal of Marketing Research, 62 (5), 757–75. https://doi.org/10.1177/00222437251323238

Guan, Yue, Yong Tan, Qiang Wei, and Guoqing Chen (2023), “When Images Backfire: The Effect of Customer-Generated Images on Product Rating Dynamics,” Information Systems Research, 34 (4), 1641–63. https://doi.org/10.1287/isre.2023.1201

Hartmann, Jochen, Mark Heitmann, Christina Schamp, and Oded Netzer (2021), “The Power of Brand Selfies,” Journal of Marketing Research, 58 (6), 1159–77. https://doi.org/10.1177/00222437211037258

Heckler, Susan E. and Terry L. Childers (1992), “The Role of Expectancy and Relevancy in Memory for Verbal and Visual Information: What Is Incongruency?” Journal of Consumer Research, 18 (4), 475–92. https://doi.org/10.1086/209275

Hou, Jian-Ren, Jie Zhang, and Kunpeng Zhang (2023), “Pictures That Are Worth a Thousand Donations: How Emotions in Project Images Drive the Success of Online Charity Fundraising Campaigns? An Image Design Perspective,” MIS Quarterly, 47 (2), 535–84. https://doi.org/10.25300/MISQ/2022/17164

Labrecque, Lauren (2020), “Color Research in Marketing: Theoretical and Technical Considerations for Conducting Rigorous and Impactful Color Research,” Psychology & Marketing, 37 (7), 855–63. https://doi.org/10.1002/mar.21359

Li, Hanwei, David Simchi-Levi, Michelle Xiao Wu, and Weiming Zhu (2022), “Estimating and Exploiting the Impact of Photo Layout: A Structural Approach,” Management Science, 69 (9), 5209–33. https://doi.org/10.1287/mnsc.2022.4616

Li, Yiyi and Ying Xie (2020), “Is a Picture Worth a Thousand Words? An Empirical Study of Image Content and Social Media Engagement,” Journal of Marketing Research, 57 (1), 1–19. https://doi.org/10.1177/0022243719881113

Lu, Zoe Y., Suyeon Jung, and Joann Peck (2024), “It Looks Like ‘Theirs’: When and Why Human Presence in the Photo Lowers Viewers’ Liking and Preference for an Experience Venue,” Journal of Consumer Research, 51 (2), 321–41. https://doi.org/10.1093/jcr/ucad059

Malik, Nikhil, Param Vir Singh, and Kannan Srinivasan (2023), “When Does Beauty Pay? A Large-Scale Image-Based Appearance Analysis on Career Transitions,” Information Systems Research, 35 (4), 1524–45. https://doi.org/10.1287/isre.2021.0559

Shin, Donghyuk, Shu He, Gene Moo Lee, Andrew B. Whinston, Suleyman Centintas, and Kuang-Chih Lee (2020), “Enhancing Social Media Analysis with Visual Data Analytics: A Deep Learning Approach,” MIS Quarterly, 44 (4), 1459–92. https://doi.org/10.25300/MISQ/2020/14870

Troncoso, Isamar and Lan Luo (2022), “Look the Part? The Role of Profile Pictures in Online Labor Markets,” Marketing Science, 42 (6), 1080–1100. https://doi.org/10.1287/mksc.2022.1425

Yu, Yifan, Xinyao Wang, Jinghua Huang, and Yong Tang (2026), “The Pleasant Visual Path to Review Helpfulness: Picture-Evoked Emotional Valence and Picture-Text Alignment,” MIS Quarterly, 50 (1), 243–68. https://doi.org/10.25300/MISQ/2025/17965

Zhang, Mengxia and Lan Luo (2022), “Can Consumer-Posted Photos Serve as a Leading Indicator of Restaurant Survival? Evidence from Yelp,” Management Science, 69 (1), 25–50. https://doi.org/10.1287/mnsc.2022.4359

Zhang, Shunyuan, Dokyun Lee, Param Vir Singh, and Kannan Srinivasan (2022), “What Makes a Good Image? Airbnb Demand Analytics Leveraging Interpretable Image Features,” Management Science, 68 (8), 5644–66. https://doi.org/10.1287/mnsc.2021.4175

Go to the Journal of Marketing Research

More IMPACT at JMR