Urban Feature Extraction For Building Performance Studies

Author: Nada Tarkhan
Affiliation: MIT
Date: 12/09/2021

In this post, a brief review and experiments on facade parsing methods for building performance evaluation are presented.

Overview

In this post, deep learning methods for Urban feature extraction are presented and discussed. Computer Vision methods have greatly affected the way in which we measure contextual attributes from our urban environments.

Extensive literature has assessed the application of different CNN (Convolutional Neural Network)-based methods to extract facade elements and segment different features. However, how do these methods translate to building performance studies, specifically with the kinds of assessments that are typically used to evaluate different building performance phenomena (such as building energy use)? Moreover, what is the kind of data that needs to be collected and at what resolution? The post intends to break this down along with a discussion on data acquisition. Finally, a case study will be presented for window extraction.

Challenge 1: Urban Diversity and Scale

The Urban environment is very diverse. There is no universal way to standardize the rules needed to interpret different contexts and architecture around the world. Some of those diversities are represented in the form of the following parameters; building heights, density, urban vegetation, design styles, construction period and building materials.

Challenge 2: Method Matching Needed

The challenge here lies in selecting extraction methods that match the task at hand. This depends on the granularity of information that is needed such as precise area, dimensions and location.. This is expanded upon in the experiments section when an example extraction task is carried out and assessed.

Challenge 3: Data Quality

As with most machine learning tasks, models are data hungry. As this application represents a more domain specific application, the data quality and quantity is imperative to make sound assumptions in simulaiton models. In this application, street view imagery has been the most promising data source, yet does not come without challenges and biases. In the Data Acquisition section, the data retrieval and labelling process is discussed.

Image Source

The below blog aims to lay out the following:

  • Provide an overview of the building attributes that are required to conduct different types of urban level analyses
  • Give a general breakdown of scene segmentation and extraction methods
  • Outline a brief history of building and facade parsing literature
  • Highlight the main parts of the data acquisition procedures
  • Conduct an experiment in window feature extraction from buildings
  • Discuss main insights, assessment techniques and results
  • Urban & Building Analyses: Input Features

    In this section, the main parameters needed for 3 distinct types of building performance studies are outlined. For each study, a set of inputs are required that are often very tedious to collect through manual methods. At the same time, in the absence of this data many high level assumptions need to be made which can be inaccurate. Hence, automating this data retrieval in the context of urban environments is imperative to the accuracy of these assessment models.

    Expand the nodes to view the building attributes required for each analysis!

    Energy

    This mainly involves whole building/urban energy simulations that looks to quantify annual energy use over the course of a year. These simulations take in various building properties as well as the specific urban context and climate files to conduct multi-physics calculations

    Daylighting

    This assessment involves simulating building models to evaluate the percentage of occupied time in which a certain daylight threshold is met or exceeded. This is highly dependent on window design and placement as well as annual sun exposure.

    Outdoor Thermal Comfort

    This assessment factors in thermal properties of an environment to give a measure of thermal satisfaction based on human physiological models.

    Indoor Thermal Comfort

    This study aims to achieve the same goal as the above but with internal thermal properties while utilizing indoor thermal comfort thresholds.

    Extraction Methods

    As noted earlier, many computer vision methods have been applied to the area of urban feature extraction. Below are the most notable contributions to this area, categorized under their respective methods and models:

    Grammar Based Methods

    These approaches center on procedural shape grammar definitions. At a high level, this method is usually made up of a set of hand-crafted rules of basic shapes which represents structured geometries encoded as parametric sub-divisions. These rules can then be applied to images to make deductions on hierarchies and the presence of certain elements. Although intuitive to construct, hand-crafted features, based on methods such as sobel operators, usually suffer when handling more diverse conditions, especially when dealing with noisy street view imagery and architectural styles that fall outside the constrained set of pre-defined rules. In addition to this, the extensive computational time does not justify the limited inference capability.

    Learning Based Methods
    CNNs

    Recently, Convolutional Neural Networks (CNNs) have dominated image analysis tasks due to their superior performance across classification, object detection, semantic segmentation and image segmentation tasks. CNNs have been applied extensively to facade parsing tasks with varying architectures and model set-ups. CNNs take in an input image, assign learnable weights and biases to various features in the image and output segmented images based on the detected classes. This is mainly carried out through the application of relevant filters. Below, the variations of CNNs applied to facade parsing applications is covered.

    Polygon RNN

    Contrasting current object segmentation tasks under CNNs where they are treated as pixel-labeling problems, Polygon RNN casts this as a polygon prediction task, matching the annotation style of most annotation datasets.

    Panoptic Segmentation

    Panoptic segmentation unifies the typically distinct tasks of semantic segmentation (assign a class label to each pixel) and instance segmentation (detect and segment each object instance).

    Hover on the below to view the respective method category and click to be navigated to the research on the method!



    A History of Facade Parsing

    More specific to facade parsing, below is a brief literature review that gives an outlook into the methods utilized and the dataset used. Below, the transfer learning component is highlighted as this has been shown to improve efficieny as it reduces the resources and amount of labelled data required to train new models.

    Architecture Categorization of Main framework
    Grammar
    -based
    CNN
    Mask
    R-CNN
    Dataset Sources
    Transfer
    Learning
    Annotated
    Dataset

    Zhao, et al., 2010

    proposed parsing using procedural grammars

    Wendel, et al., 2010

    Utilized Scale–Invariant Feature Trans-form (SIFT)

    Mathias, 2011

    Utilized inverse procedural modeling

    Schmitz, et al., 2016

    Transfer Learning and small model size

    Fathalla, et al., 2017

    Introduced used appearance &layout cues + VGG model.

    Koch, et al., 2018

    Utilizes multi-scale patch-based pattern extraction + CNN

    , Bacharidis et al., 2020

    Combined generative adversarial networks (GANs)

    Hu, et al., 2020

    Uses bounding boxes detection by using
    YOLO architecture in real-time

    Wenguang et al., 2020

    FasterR-CNN architecture for window detection.

    Nordmark et al., 2021

    Utilizes Regional Proposal Network

    Method Matching & Trends

    The brief history of facade parsing methods shows the following trends:

  • A heavy emphasis on CNN networks due to their performance in learning capacity and inference capabilities
  • Mask R-CNN provides additional capabilities when it comes to providing a segmentation mask as well for area extraction
  • Datasets are becoming richer as more facade classes are also making appearances in public datasets (such as balconies, cornices, staircases etc.)
  • Transfer Learning and data augmentation are becoming more standard practices in the field
  • To present a degree of specificity to the subject matter, the below presents the customized extraction task of window to wall ratios for energy and daylight studies. In these studies, the window to wall ratio is a large driver behind thermal envelope performance as well as daylight access. For more granular assessments, deriving the exact window area would be needed hence, bounding box methods indicating location of windows (with imprecise annotation geometry) would not be sufficient in this case. Similarly, we can argue that for applications requiring height estimation, it would be beneficial to have calibrated (actual) building heights rather than a ratio/proportion between the different object heights in a scene (Trees, buildings etc.)

    Data Set: Acquisition & Labelling

    Automating the collection of semantic data regarding building features is required to carry out this task. Manual data acquisition methods can be extremely challenging and time consuming. For example, information regarding location, shapes and sizes, materials of building structures can change over time and hence datasets would need to be both reliable and up to date. In this section we discuss the data resolutions, properties and how it is sourced.

    Data Considerations

    RELIABILITY

    Street View imagery is often quite noisy and requires heavy preprocessing and rectification. The process of data acquisition starts with automating the extraction of frontal facade images. This is done through calculating the camera bearing angle and identifying the centerpoint of the building of interest. Next, a rectification procedure requiring homography is applied such that an orthogonal fronto-parallel image is produced.


    AVAILABILITY

    Drawing on the previous point, while several platforms have enabled wide access to street-view data, data can sometimes still be quite sparse in various locations around the world. In the case of this example, the data was obtained from Google Street View Imagery.


    QUALITY/QUANTITY

    As the goal is to collect data for building performance studies, we must be presented with a sufficient data set that widely encompasses different building typologies, uses, construction periods and neighbourhoods. As with all learning problems, sparcity of data may result in biased deductions and inferences. The below diagram summarizes additional data challenges that are specific to urban contexts and buildings.

    imgCH
    Labelling

    The training data included annotated images that were sourced from the following public facade datasets; TSG-20○2, TSG-60○3, ZuBuD○4, CMP and ECP. To improve model performance, additional labeled data was added. These manual annotations of window geometry were labelled using the labelme tool.Retaining the order of corner points was imperative to this labelling process- an example on a facade is shown below:

    As we can see, this can be a very time-consuming process that could benefit from automation and publicly avaialable annotated datasets.

    Experiments

    Model Description

    To experiment with the use of CNNS in window detection, we utilize a framework developed by Li, Chuan-Kang, et al [1]. The method employs a key-point detection framework that utilizes CNNs to make predictions on the presence of window geometry by identifying corners. Resnet is utilized as the backbone network followed by 2 prediction branches. One provides heatmaps (for corner points) while the other provides tagmaps (based on grouping of windows and center verification).

    The input image is fed into the CNN architecture, called the backbone. It is a pre-trained standard convolutional neural network, using ResNet to detect low resolution features. The backbone network acts as a feature extractor over an entire image.

    This workflow along with the model parameters is highlighted below.

    imgCH

    Summary of CNN detection method utilized

    Training

    The training process involved a few iterations. Below, the iterations that yielded performance improvements are plotted. The number of training epochs against the loss are displayed along with the specific iteration changes. The iterations mainly involved different data augmentation methods, more annotated data as inputs and longer training durations. We see a steady decline in loss values as we increase training time and feed the model more data, which is as expected.

    imgCH imgCH

    Model Training Iterations for CNN Model

    Quantitative Scoring

    Following the model training procedures, this experiment involved segmenting the data into different batches of assessment and then measuring performance on these batches. The batches of images were organized into three categories; Design variations, Environmental variations and Contextual Variations. These are explained further in the table below.

    DESIGN Description: Includes design variations including conditions where model showed low prediction performance. This includes; small windows and dense facades, fully glazed facades, large buildings, special conditions such as window insets, heavy facade textures etc.
    ENVIRONMENT Description: Includes environmental variations such as lighting conditions, shadow occlusions and low resolution images from street view imagery.
    CONTEXT Description: Even though diversity in architectural styles is a core goal, not all design styles can be captured. The training data images have facades from all around the world however since we want to conduct analysis in specific locations, data from New York and Lisbon were a bigger focus.

    CNN Model Performance Scores- Comparison of Iteration 1 and 4

    IOU Metric:The Intersection over Union (IoU) metric is a method to quantify the percent overlap between the target mask and our prediction output. A higher IOU indicates better prediction performance

    Precision (over 50% IoU): The amount of positive predictions that were correct.Calculation: true positives / number of predicted positives.

    Recall (over 50% IoU): The percentage of positive cases that were caught. Calculation: true positives / number of actual positives

    Results:The results we see here compare the above scores from 2 models; iteration 1 and iteration 4 (as highlighted under the training section). We see significant performance across all three metrics especially when looking at the IoU scores.

    Qualitative Scoring

    How then can we evaluate the CNN method performance against more traditional non-learning based methods? To answer this question a comparison between two methods is carried out; Method 1 uses grammar based sobel operators for edge detection as carried out by Szcześniak et al [2]. And Method 2 utilizes CNNs as highlighted under model description. Below is a checklist of evaluation criteria that captures how effective the method is at sufficiently evaluating important attributes.

    Criteria

    Method 1: Pixel Counting

    Method 2: Keypoint Detection using CNNs

    Non-uniform window alignment Windows must be on uniform gridding Achieved ✓
    Non-uniform Window geometry Angled non-rectangular forms not detected Arched forms detected as orthogonal
    Multi-material facades Misclassification as windows can occur Achieved ✓
    Textures and patterns Texture lines may be mis-classified as edges Achieved ✓
    Lighting conditions and shadow occlusions Achieved ✓ Achieved ✓
    Fully Glazed Facades Achieved ✓ Additional training data was required to improve detection in this condition
    Façade obstruction elements (balconies, staircases) Mis-classification as windows can occur Achieved ✓
    Densely populated facades Achieved ✓ Additional training data was required to improve detection in this condition
    Depth perception (inset windows) This may sometimes not be detected by algorithm Achieved ✓
    Visual Inspection & Results

    Below is a set of facades of different conditions and diversities. The Green box annotation captures the detected window extents. The grammar based method represents the utilization of sobel operators. As summarized in the table above, we see the pixel based method failing in more complex facade designs, where other facade elements are mistaken for windows.

    Method Comparison Results

    Images showing detection differences between Grammar based method and CNN Model

    Results showing detection in Facade Diversity

    Results from CNN Model

    Results showing CNN- Model Improvement Results

    Detection differences across CNN models- from Iteration 1 (Old Model) to Iteration 4 (Improved Model)

    Insights


    The format of this blog intends to reflect on the methods while carrying out an experiment to compare a grammar based method against a CNN based method. The discussion intends to reflect on the outputs and their relevance to the building performance field. The below citations and linked studies provide information on more specific model architectures and parameters. Based on the above literature review and experiments, below are some discussion points for method enhancements and expanding the application of this work:

  • Extracting window areas in urban contexts provides key insight into energy, daylighting and thermal comfort studies. To expand this work and improve the resolution of urban analysis models, the detection of building heights and materials could be investigated.
  • The tested CNN method provides enhanced performance over rigid rule-based frameworks yet also displays some weaknesses. Circular window geometries cannot be detected and annotated with exact geometry. Methods such as panoptic segmentation could add this level of granularity to detection
  • Data is becoming more accessible in urban environments, yet noise and low image quality issues may still persist. The selection of datasets must be given close attention as this is a large determinant of model and prediction performance.
  • The largest model performance boost came from feeding in additional labelled data and training for longer periods of time (higher number of Epochs). The experiments section provides an example of testing images against categorical batches to make sure model improvements are enhancing detection in more diverse facade conditions.
  • Citations

    Methods used for Experimentation:

    [1] TLi, Chuan-Kang, et al. "Window Detection in Facades Using Heatmap Fusion." Journal of Computer Science and Technology 35.4 (2020): 900-912.

    [2] Szcześniak, Jakub T., et al. "A method for using street view imagery to auto-extract window-to-wall ratios and its relevance for urban-level daylighting and energy simulations." Building and Environment (2021): 108108.

    Other references:

    Zhao, Peng et al. (2010). “Rectilinear parsing of architecture in urban environment”. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp. 342–349

    Wendel, Andreas, Michael Donoser, and Horst Bischof (2010). “Unsupervised facade segmentation using repetitive patterns”. In: Joint Pattern Recognition Symposium. Springer, pp. 51–60.

    Mathias, Markus et al. (2011). “Automatic architectural style recognition”. In: ISPRS-International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences 3816, pp. 171– 176.

    Schmitz, Matthias and Helmut Mayer (2016). “A convolutional network for semantic facade segmentation and interpretation”. In: The International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences 41, p. 709.

    Koch, David et al. (2018). “Visual estimation of building condition with patch-level ConvNets”. In: Proceedings of the 2018 ACM Workshop on Multimedia for Real Estate Tech, pp. 12–17.

    Bacharidis, Konstantinos, Froso Sarri, and Lemonia Ragia (2020). “3D building façade reconstruction using deep learning”. In: ISPRS International Journal of Geo-Information 9(5), p. 322.

    Hu, Han et al. (2020). “Fast and Regularized Reconstruction of Building Fa\c {c} ades from Street-View Images using Binary Integer Programming”. In: arXiv preprint arXiv:2002.08549

    Ma, Wenguang and Wei Ma (2020). “Deep window detection in street scenes”. In: KSII Transactions on Internet and Information Systems (TIIS) 14(2), pp. 855–870.

    Castrejon, Lluis, et al. "Annotating object instances with a polygon-rnn." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

    Kirillov, Alexander, et al. "Panoptic segmentation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

    Liu, Hantang, et al. "Deepfacade: A deep learning approach to facade parsing." (2017).

    Koziński, Mateusz, and Renaud Marlet. "Image parsing with graph grammars and Markov Random Fields applied to facade analysis." IEEE Winter Conference on Applications of Computer Vision. IEEE, 2014.

    The following is the sourcerepository.