class: inverse, center, middle <!-- background-image: url(figs/p_and_p_cover.png) --> <!-- background-size: cover --> # Exploring automatic evaluation of statistical graphics <!-- <img src="" width="150px"/> --> <img src="figures/Red_camera_eye.svg.png" width="25%" /> .large[Adam Loy | Graphics Group @ ISU | 3 Oct 2019] --- class: center, middle # Have you ever fit a linear mixed-effects model? -- # Did you check residual plots? -- # Have you ever seen a residual plot you weren't sure how to interpret? --- # Example Suppose you fit a standard two-level, continuous response linear-mixed effects model and see the following residual plots .pull-left[ .center[ `\(\widehat{\varepsilon}_i\)` vs. x ] <img src="img/residual4.png" width="35%" style="display: block; margin: auto;" /> .center[ `\(\widehat{\varepsilon}_i\)` vs. x ] <img src="img/homogeneous-dots-icon.png" width="35%" style="display: block; margin: auto;" /> ] .pull-right[ .center[ `\(\widehat{\varepsilon}_i\)` by group ] <img src="img/cyclone-good-icon.png" width="35%" style="display: block; margin: auto;" /> .center[ distribution of `\(\widehat{b}_j\)` ] <img src="img/residual1.png" width="35%" style="display: block; margin: auto;" /> ] --- class: middle # apophenia ### the tendency to perceive a connection or meaningful pattern between unrelated or random things (such as objects or ideas) .footnote[ "apophenia” Meriam-Webster Dictionary Online, September 2019, merriam-webster.com ] --- background-image: url(figures/usual_suspects.jpg) background-size: cover # <font color="DimGray">The lineup protocol</font> --- ## Which residual plot is not like the others? <img src="img/radon_cyclone10.png" width="70%" style="display: block; margin: auto;" /> ??? Observed plot: 10 --- class: center, middle # What did we just do? -- ## We compared the **data plot** with **null plots** of samples where, by construction, there really is nothing going on -- ## This allows us to make decisions from our graphics on a firm foundation --- class: middle, center # That's a neat idea, but where does it fit into my workflow? --- # Applications of visual inference .large[ 1. Model diagnostics 2. Interpreting unfamiliar plots 3. When large-sample theory breaks down 4. Conducting research in (statistical) graphics ] --- # Example from Hofmann et al. (2012) .pull-left[ <img src="img/hofmann-polar.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ 1. Create Lineup Data 2. Create lineups from competing designs 3. Evaluate lineups (via MTurk) 4. Evaluate competing designs `\begin{eqnarray} Y_i &\sim& \text{Bernoulli}(\pi_i) \nonumber\\ g(\pi_i) &=& \mu + \underbrace{\tau_{j(i)}}_{\text{plot design}} + \underbrace{\nu_{s(i)}}_{\text{sample size}} + \cdots\nonumber\\ &+& \underbrace{u_{u(i)}}_{\substack{\text{individual} \\ \text{ability}}} + \underbrace{d_{d(i)}}_{\substack{\text{lineup} \\ \text{difficulty}}} \nonumber \end{eqnarray}` ] --- class: middle, center # That's great! -- # Can you train a computer do this? --- # How? .large[ We have to turn plot evaluation into a classification problem... .pull-left[ .center[Good residual plot] <img src="iastate_talk_oct2019_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ .center[Bad residual plot] <img src="iastate_talk_oct2019_files/figure-html/bad residuals-1.svg" style="display: block; margin: auto;" /> ] ] --- # How can we model this? .large[ 1. Statistical paradigm + Logistic regression + .bold[Random forests] + Support vector machines (SVMs) + ...more.. 2. Computer vision paradigm + Convolutional neural networks ] --- class: middle, center # Statistical modeling ## work with Cari Comnick, Logan Crowl, Sophia Gunn, Aidan Mullan --- # Major questions .pull-left[ 1. What's the response? 2. **What are the predictors?** 3. How are we going to relate the two? 4. How well does this model perform? ] <br> Good or bad class label -- Hmm... that's hard -- We'll use a random forest -- <br> Need to gauge predictive accuracy --- class: middle, center # How can we summarize key features/characteristics of scatterplots? -- # Summary statistics are not the answer! --- class: center, middlep background-image: url(figures/Datasaurus12.png) background-size: cover --- # Scagnostics (scatterplot diagnostics) .large[ - Originally proposed by Tukey & Tukey (1985) - Original idea was to provide indices to help guide exploration of a large scatterplot matrix - Wilkinson, Anand, & Grossman (2005) proposed 9 graph-theoretic metrics - Wilkinson & Wills (2008) explored the distribution of scagnostics on different classes of scatterplots ] --- # Geometric graph - A graph is a set of vertices `\(V\)` which are related by edges `\(e(v,w)\)` in `\(E\)` and `\(v,w\)` in `\(V\)` - A geometric graph can be represented as points and lines in a metric space `\(S\)` .pull-left[ <img src="img/geometric_graph.png" width="100%" /> ] .pull-right[ `\(V = \lbrace A,B,C,D,E \rbrace\)` `\(E = \lbrace(A,B), (A,C), (B,C), (C,D), (D,E)\rbrace\)` `\(S = 2\)` dimensional space ] --- # Graphs for Scagnostics .pull-left[ - Undirected - Simple - Planar - Straight - Finite ] .pull-right[ <img src="img/geometric_graph.png" width="100%" /> ] --- # Measuring features .content-box-blue[ **Length(e)** is the Euclidean distance between the vertices of an edge `\(e\)` **Length(G)** is the total length of all edges of a graph `\(G\)` ] .pull-left[ <img src="img/graph_length.png" width="100%" /> ] .pull-right[ `\(Length(AB) = 3\)` `\(Length(G) = 16\)` ] --- # Measuring features .content-box-blue[ A **path** is a list of vertices such that all successive pairs are an edge ] .pull-left[ <img src="img/graph_path.png" width="70%" /> ] .pull-right[ <img src="img/graph_closed_path.png" width="70%" /> ] .content-box-blue[ A **polygon** is the boundary of a closed path ] --- # Measuring features .content-box-blue[ .bold[Area(P)] is the area of polygon `\(P\)` .bold[Perimeter(P)] is the length of the boundary of polygon `\(P\)` ] <img src="img/graph_area.png" width="50%" style="display: block; margin: auto;" /> --- # Scagnostic graphs .pull-left[ <img src="img/convex_hull.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ .bolder[Convex hull] - all possible edges connecting any pair of points lie entirely in the interior of the polygon or form its boundary - Intuition: stretch a rubber band around the points - captures a lot of white space ] --- # Scagnostic graphs .pull-left[ <img src="img/alpha_hull.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ .bolder[Alpha hull] - an edge exists between any pair of points that can be touched by disk of radius alpha that contains no other points - Intuition: roll a coin of radius `\(\alpha\)` around your scatterplot, connect any two points that the coin touches at the same time - captures overall shape ] --- # Scagnostic graphs .pull-left[ <img src="img/mst.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ .bolder[Minimum spanning tree] - all vertices are connected by the minimum number of edges possible and any pair of vertices are connected by exactly one path - lowest total edge length - focuses on the interior structure ] --- .left-column[ # Shape ## Stringy ] .right-column[ <br> <br> The diameter of a MST is the longest shortest path `\(c_{\text{stringy}} = \dfrac{\text{diameter}(T)}{\text{length}(T)}\)` `\(0 \le c_{\text{stringy}} \le 1\)` <img src="img/stringyLow.png" width="33%" /><img src="img/stringyMedium.png" width="33%" /><img src="img/stringyHigh.png" width="33%" /> ] --- .left-column[ # Shape ## Stringy ## Skinny ] .right-column[ <br> <br> `\(c_{\text{skinny}} = 1 - \dfrac{\sqrt{4\pi \cdot \text{area}(A)}}{\text{perimeter}(A)}\)` circle: `\(c_{\text{skinny}} = 0\)`; square: `\(c_{\text{skinny}} \approx 0.12\)` <img src="img/skinnyLow.png" width="33%" /><img src="img/skinnyMedium.png" width="33%" /><img src="img/skinnyHigh.png" width="33%" /> ] --- .left-column[ # Shape ## Stringy ## Skinny ## Convex ] .right-column[ <br> <br> Ratio of the areas of the alpha and convex hulls `\(c_{\text{skinny}} = \dfrac{\text{area}(A)}{\text{area}(H)}\)` <img src="img/convexLow.png" width="33%" /><img src="img/convexMedium.png" width="33%" /><img src="img/convexHigh.png" width="33%" /> ] --- .left-column[ # Association ## Monotonic ] .right-column[ <br> <br> `\(\qquad c_{\text{monotonic}} = r^2_{\text{Spearman}}\)` <img src="img/monotonicLow.png" width="33%" /><img src="img/monotonicMedium.png" width="33%" /><img src="img/monotonicHigh.png" width="33%" /> ] --- .left-column[ # Density ## Outlying ] .right-column[ <br> <br> Proportion of the total edge length due to extremely long edges connected to singletons Outlier criterion: `\(q_{.75} +1.5(q_{.75} - q_{.25})\)` `\(c_{\text{outlying}} = \dfrac{\text{length}(T_{outliers})}{\text{length}(T)}\)` <img src="img/outlyingLow.png" width="33%" /><img src="img/outlyingMedium.png" width="33%" /><img src="img/outylingHigh.png" width="33%" /> ] --- .left-column[ # Density ## Outlying ## Sparse ] .right-column[ <br> <br> Measures whether points in a 2D scatterplot are confined to a lattice or a small number of locations on the plane `\(c_{\text{sparse}} = q_{.9}(T)\)` <img src="img/sparseLow.png" width="33%" /><img src="img/sparseMedium.png" width="33%" /><img src="img/sparseHigh.png" width="33%" /> ] --- .left-column[ # Density ## Outlying ## Sparse ## Skewed ] .right-column[ <br> <br> `\(c_{\text{skew}} = \dfrac{q_{.9}(T) - q_{.5}(T)}{q_{.9}(T) - q_{.1}(T)}\)` <img src="img/skewedLow.png" width="33%" /><img src="img/skewedMedium.png" width="33%" /><img src="img/skewedHigh.png" width="33%" /> ] --- .left-column[ # Density ## Outlying ## Sparse ## Skewed ## Clumpy ] .right-column[ <br> <br> Based on the RUNT statistic `\(c_{\text{clumpy}}= \displaystyle\max_j \left[ 1 - \dfrac{\max_k(e_k)}{\text{length}(e_j)}\right]\)` <img src="img/clumpyLow.png" width="33%" /><img src="img/clumpyMedium.png" width="33%" /><img src="img/clumpyHigh.png" width="33%" /> ] --- .left-column[ # Density ## Outlying ## Sparse ## Skewed ## Clumpy ## Striated ] .right-column[ <br> <br> Number of adjacent edges whose cosine is less than −0.75 `\(c_{\text{clumpy}} = \dfrac{1}{|V|} \displaystyle \sum_{v \in V^{(2)}} I(\cos \theta_{e(v,a)e(v,b)} < -0.75)\)` <img src="img/striatedLow.png" width="33%" /><img src="img/striatedMedium.png" width="33%" /><img src="img/StriatedHigh.png" width="33%" /> ] --- .pull-left[ <img src="iastate_talk_oct2019_files/figure-html/unnamed-chunk-26-1.svg" width="75%" style="display: block; margin: auto;" /> <table> <tbody> <tr> <td style="text-align:left;color: black !important;background-color: white !important;"> Outlying </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.0865 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: silver !important;"> Skewed </td> <td style="text-align:right;font-weight: bold;color: black !important;background-color: silver !important;"> 0.7921 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: silver !important;"> Clumpy </td> <td style="text-align:right;font-weight: bold;color: black !important;background-color: silver !important;"> 0.2708 </td> </tr> <tr> <td style="text-align:left;color: black !important;background-color: white !important;"> Sparse </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.0835 </td> </tr> <tr> <td style="text-align:left;color: black !important;background-color: white !important;"> Striated </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.1639 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: gainsboro !important;"> Convex </td> <td style="text-align:right;font-weight: bold;color: black !important;background-color: gainsboro !important;"> 0.2138 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: silver !important;"> Skinny </td> <td style="text-align:right;font-weight: bold;color: black !important;background-color: silver !important;"> 0.7661 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: silver !important;"> Stringy </td> <td style="text-align:right;font-weight: bold;color: black !important;background-color: silver !important;"> 0.5314 </td> </tr> <tr> <td style="text-align:left;color: black !important;background-color: white !important;"> Monotonic </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.0051 </td> </tr> </tbody> </table> ] .pull-right[ <img src="iastate_talk_oct2019_files/figure-html/unnamed-chunk-28-1.svg" width="75%" style="display: block; margin: auto;" /> <table> <tbody> <tr> <td style="text-align:left;color: black !important;background-color: white !important;"> Outlying </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.1231 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: gainsboro !important;"> Skewed </td> <td style="text-align:right;font-weight: bold;color: black !important;background-color: gainsboro !important;"> 0.6137 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: gainsboro !important;"> Clumpy </td> <td style="text-align:right;font-weight: bold;color: black !important;background-color: gainsboro !important;"> 0.0898 </td> </tr> <tr> <td style="text-align:left;color: black !important;background-color: white !important;"> Sparse </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.0814 </td> </tr> <tr> <td style="text-align:left;color: black !important;background-color: white !important;"> Striated </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.0703 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: silver !important;"> Convex </td> <td style="text-align:right;font-weight: bold;color: black !important;background-color: silver !important;"> 0.5133 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: gainsboro !important;"> Skinny </td> <td style="text-align:right;font-weight: bold;color: black !important;background-color: gainsboro !important;"> 0.5628 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: gainsboro !important;"> Stringy </td> <td style="text-align:right;font-weight: bold;color: black !important;background-color: gainsboro !important;"> 0.3490 </td> </tr> <tr> <td style="text-align:left;color: black !important;background-color: white !important;"> Monotonic </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.0041 </td> </tr> </tbody> </table> ] --- background-image: url(figures/ex_random_forest.png) background-size: contain # Random forests --- # Training data: 14865 signal plots <img src="iastate_talk_oct2019_files/figure-html/unnamed-chunk-31-1.svg" width="100%" /> --- #Training data: 14865 null plots <img src="iastate_talk_oct2019_files/figure-html/unnamed-chunk-32-1.svg" style="display: block; margin: auto;" /> --- # Model evaluation .large[ - Generated 1000 signal plots - Calculated the 9 scagnostics on each - Used random forest to classify as "signal" or "noise" ] -- .content-box-blue[ .large[972 of the 1000 signal plots were correctly classified] ] -- .content-box-yellow[ .large[ Also evaluated on 1000 lineups of size 20, with unspecified number of signal plots - 93.5% of linear plots identified as signal; 5.3% false positive rate ] ] --- class: middle, center # Computer vision models --- # Motivation: Giora Simchoni's blog post <img src="figures/giora-blog.png" width="789" /> --- ## Giora trained a computer vision model two ways .Large[ - classification: significant correlation vs not - regression: to predict the correlation ] --- class: clear ## Success, picked plot 16 <img src="img/simchoni_test1.png" width="100%" /> --- class: clear ## Success, failed to pick plot 4 <img src="img/simchoni_test2.png" width="100%" /> --- class: clear ## Fail! Doesn't see the strong nonlinear association. Picks the most linear. <img src="img/simchoni_test3.png" width="90%" /> --- # Deep learning <img src="img/image_classification.png" width="95%" style="display: block; margin: auto;" /> .footnote[ Source: [Abdellatif Abdelfattah](https://medium.com/@tifa2up/image-classification-using-deep-neural-networks-a-beginner-friendly-approach-using-tensorflow-94b0a090ccd4) ] --- ## Neural networks - `\(x_i\)` = input variable - `\(v_j\)` = function of linear combinations of the inputs (e.g. sigmoid) - `\(y_k\)` = function of linear combinations of the `\(v_j\)` (e.g. softmax) <img src="img/nn_simple.png" width="70%" style="display: block; margin: auto;" /> .footnote[ .scriptsize[Source: [Cheng & Titterington (1994)](https://projecteuclid.org/euclid.ss/1177010638)] ] ??? Derived features Zm are created from linear combinations of the inputs, and then the target Yk is modeled as a function of linear combinations of the Zm --- # Regression as a neural network <img src="img/nn_regression.png" width="70%" style="display: block; margin: auto;" /> .footnote[ .scriptsize[Source: [Cheng & Titterington (1994)](https://projecteuclid.org/euclid.ss/1177010638)] ] --- # Images as data <img src="img/Corgi3.png" width="100%" style="display: block; margin: auto;" /> --- background-image: url(img/nn_cat.png) background-size: cover --- # Filtering patterns <img src="img/nn_filter.png" width="100%" style="display: block; margin: auto;" /> .footnote[ Source: [Deep Learning with R](https://www.manning.com/books/deep-learning-with-r) ] --- # Pooling to find spatial hierarchies <img src="img/covnet2.jpg" width="100%" style="display: block; margin: auto;" /> .footnote[ Source: Di Cook's [DSSV slides](http://www.dicook.org/files/dssv19/slides#1) ] ??? to induce spatial-filter hierarchies --- # Rinse & repeat to reveal other hierarchies <img src="img/covnet1.jpg" width="80%" style="display: block; margin: auto;" /> .footnote[ Source: Di Cook's [DSSV slides](http://www.dicook.org/files/dssv19/slides#1) ] --- # Approach I .large[ - Save the scatterplots as images, then train your CNN - Di Cook and Shuofan Zhang at Monash University worked on this to detect linearity - Trained Keras model with 60,000 training data sets for each class: linear vs. not - Accuracy with simulated test data, 93% + null error 0.0179 + linear error 0.1176 ] --- # Approach II .large[ - Create images showing the "shape" of the scatterplot, then train your CNN .pull-left[ <img src="img/nn_hulls1.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/nn_hulls2.png" width="70%" style="display: block; margin: auto;" /> ] - Elliot Pickens and I worked on this last spring ] --- # Approach II .large[ Trained Keras model on simulated 300 plots for each class 4200 total plots in the training set .pull-left[ - **Uniform** - **Spherical** - **Binormal** - **Funnel** - **Exponential** ] .pull-right[ - **Quadratic** - **Clustered** - **Doughnut** - **Stripe** - **Sparse** ] ] ??? 1. **Uniform** (2D Poisson process) 2. **Spherical** (spherical normal) 3. **Binormal** (bivariate normal with `\(\rho = \pm 0.6\)`) 4. **Funnel** (bivariate log-normal with `\(\rho = \pm 0.6\)`) 5. **Exponential** (exponential growth/decay plus random error) 6. **Quadratic** (positive/negative quadratic function plus random error) 7. **Clustered** (three separated spherical normals at the vertices of an equilateral triangle) 8. **Doughnut** (two polar uniforms separated by a moat of white space) 9. **Stripe** (product of Uniform and integer [1, 5]) 10. **Sparse** (product of integer [1, 3] with itself) --- ## Approach II .pull-left[ .large[ - 2100 images in the testing set (150 per class) - Precision = true positive - Recall = sensitivity ] ] .pull-right[ Class | Precision | Recall ------|---|--- Uniform | 0.97 | 0.99 Spherical | 0.70 | 0.61 Binormal | 0.85 | 0.81 N. Binormal | 0.96 | 0.87 Funnel | 0.96 | 0.90 N. Funnel | 0.94 | 0.93 N. Expo | 0.97 | 0.99 Expo | 0.97 | 0.97 Quadratic | 0.96 | 0.97 Clustered | 0.73 | 0.83 Doughnut | 0.73 | 0.83 Stripe | 0.89 | 0.98 Sparse | 1.00 | 1.00 Logarithmic | 0.99 | 0.92 ] ??? Avg. precision and recall is about 90% Precision = proportion of plots classified as uniform that are actually uniform Recall = proportion of uniform plots that were actually predicted to be uniform --- # Discussion .large[ - It's possible to automate the detection of plots, but the training sets are key - If you use a statistical model/algorithm, you need to carefully consider your predictors - Computer's haven't beaten human ability to detect plot type (Cook & Zhang) - Promising results for model diagnosis, exploring large data sets, and prototyping new statistical graphics ] --- # Joint work .large[ #### Deep learning: [Giora Simchoni](http://giorasimchoni.com/2018/02/07/2018-02-07-book-em-danno/); [Di Cook](http://www.dicook.org/) & Shuofan Zhang; Elliot Pickens #### Inference: Di Cook, Heike Hofmann, [Mahbub Majumder](http://mamajumder.github.io/html/experiments.html), Andreas Buja, Hadley Wickham, Eric Hare, [Susan Vanderplas](https://srvanderplas.netlify.com/), Niladri Roy Chowdhury, Nat Tomasetti #### Contact: [
](https://aloy.rbind.io/) aloy@carleton.edu [
](https://github.com/aloy) aloy ] --- # Further reading .large[ - Buja et al (2009) Statistical Inference for Exploratory Data Analysis and Model Diagnostics, *Roy. Soc. Ph. Tr., A* - Majumder et al (2013) Validation of Visual Statistical Inference, Applied to Linear Models, *JASA* - Wickham et al (2010) Graphical Inference for Infovis, *InfoVis* - Hofmann et al (2012) Graphical Tests for Power Comparison of Competing Design, *InfoVis* - Loy et al (2017) Model Choice and Diagnostics for Linear Mixed-Effects Models Using Statistics on Street Corners, *JCGS* ]