# Seminari del DiSIA

## Abstract

Salvo indicazioni contrarie, i seminari si terranno in sala riunioni 205 (ex 32)

18/06/2024 ore 11.00

### New Menger-like dualities in digraphs and applications to half-integral linkages

### Raul Lopes

In the k-Directed Disjoint Paths (k-DDP) problem we receive a digraph D together with a set of pairs of terminal vertices s_1, t_1, … s_k,t_k and the goal is to decide if D contains a set of pairwise vertex-disjoint paths P_1, …, P_k such that each P_i is a path from s_i to t_i. The k-DDP problem finds applications in the design of VLSI circuits, high speed network routing, collision-free routing of agents, and others. Although this problem is hard to solve in general for even k=2 paths, polynomial-time algorithms are known for fixed k and some variations of the problem. A common relaxation allows for some degree of congestion in the vertices of the given digraph. In the context of providing algorithms for the congested version of k-DDP, we present new min-max relaxations in digraphs between the number of paths satisfying certain conditions and the minimum order of an object intersecting all such paths. Applying our tools, we manage to improve and simplify several previous results regarding relaxations of k-DDP in particular classes of digraphs.

Referente: Prof. Andrea Marino

29/05/2024 ore 12.00

### Studying gambling behavior with Structural Equation Models

### Kimmo Vehkalahti (Centre for Social Data Science, University of Helsinki, Finland)

In a recent paper (authored jointly with Maria Anna Donati and Caterina Primi from UniFi) we used Structural Equation Models (SEM) for studying gambling behavior of Italian high school students. We specified path models and tested indirect (serial mediation) hypotheses of how selected cognitive variables (correct knowledge of gambling and gambling-related cognitive distortions) and affective variables (positive economic perception of gambling and expectation and enjoyment and arousal towards gambling) are related to gambling frequency and gambling problem severity. SEMs conducted with adolescent gamblers attested two indirect effects from knowledge to problem gambling: One through gambling-related cognitive distortions and one through gambling frequency. Overall, our results confirmed that adolescent problem gambling is a complex phenomenon explained by multiple and different factors. In this talk, we will discuss the assumptions, choices, practices, and results of the SEM modeling process.

Referente: Chiara Bocci

24/05/2024 ore 12.00 - On-site and online seminar

### Causal Modelling in Space and Time

### Marco Scuteri (Dalle Molle Institute)

The assumption that data are independent and identically distributed samples from a single underlying population is pervasive in statistical modelling. However, most data do not satisfy this assumption. Regression models have been extended to deal with structured data collected over time, spaces, and different populations. But what about causal network models, which are built on regression? In this talk, we will discuss how to produce causal models that can answer crucial causal questions in environmental sciences, epidemiology and other challenging domains that produce data with complex structures.

Referente: Florence Center for Data Science

10/05/2024 ore 14.00 - On-site and online seminar

### Does the supply network shape the firm size distribution? The Japanese case

### Corrado Di Guilmi (University of Florence)

The paper presents an investigation on how the upward transmission of demand shocks in the Japanese supply network influences the growth rates of firms and, consequently, shapes their size distribution. Through an empirical analysis, analytical decomposition of the growth rates’ volatility, and numerical simulations, we obtain several original results. We find that the Japanese supply network has a bow-tie structure in which firms located in the upstream layers display a larger volatility in their growth rates. As a result, the Gibrat’s law breaks down for upstream firms whereas downstream firms are more likely to be located in the power law tail of the size distribution. This pattern is determined by the amplification of demand shocks hitting downstream firms, and the magnitude of this amplification depends on the network structure and on the relative market power of downstream firms. Finally, we observe that in an almost complete network, in which there are no upstream or downstream firms, the power-law tail in firm size distribution disappears. An important implication of our results is that aggregate demand shocks can affect the economy both directly, through the reduction in output for downstream firms, and also indirectly by shaping the firm size distribution.

Referente: Florence Center for Data Science

07/05/2024 ore 12.00

### Does access to regular work affect immigrants’ integration outcomes? Evidence from an Italian amnesty program

### Chiara Pronzato (Università di Torino, Collegio Carlo Alberto)

Economic inclusion is often seen as a tool for social inclusion and integration of immigrants. In this paper, we estimate the impact of regular work, within one year of arriving in Italy, on the long-term integration of immigrant individuals, after a period of approximately 10 years. How important is it to guarantee a solid start for their integration and, therefore, for the social balance of the society as a whole? To answer this question, we analyze a sample of immigrants involved in the ISTAT Social Condition and Integration of Foreign Citizens survey that took place in Italy in 2011-12. Our impact estimates are based on instrumental variables, exploiting a 2002 amnesty that improved the probability of getting a regular job depending on the time of arrival. We find beneficial effects of early engagement in regular employment on various integration indicators, including trust in institutions, language proficiency, cultural assimilation.

Referente: Raffaele Guetto

03/05/2024 ore 11.30

### Laying the Foundations for the Design and Analysis of Experiments with Large Amounts of Ancillary Data: Part 2

### Geoff Vining (Department of Statistics, Virginia Tech)

The origins of the design and analysis of experiments required the analyst to evaluate the effects of treatments applied to properly defined experimental units. Fisher’s fundamental principles underlying the proper design of an experiment required: randomization, replication, and local control of error. Randomization assured that each experimental unit available for the experiment has exactly the same probability for being selected for each of the possible treatments. Replication allowed the analyst to evaluate the effects of the treatments by comparing treatment means. Local control of error represented the attempt to minimize the impact of other possible sources of variation. The fact that Fisher could not directly observe the effect of the “chance causes” of the variation forced the focus on comparing treatment means within his overall framework. Modern sensor technology allows the experimenter to observe the effects of many of the chance causes that Fisher could not. However, incorporating this information requires the analyst to model the data through proper linear or non-linear models, not by comparing treatment means. The resulting implications for the proper analysis, taking into account the available ancillary variables, are fascinating, with far-reaching implications for the future of the design and analysis of experiments. Part 2 extends the theoretical foundation to “pseudo” experiments based on the examples in Part 1. Part 2 “extracts” an experimental design from one example to illustrate the analysis as if it were a properly conducted experiment. The second example illustrates how to plan an experiment to follow up on the other example in Part 1 for a future 2^4 with four center runs experiment.

Referente: Rossella Berni

30/04/2024 ore 10.30

### Laying the Foundations for the Design and Analysis of Experiments with Large Amounts of Ancillary Data: Part 1

### Geoff Vining (Department of Statistics, Virginia Tech)

The origins of the design and analysis of experiments required the analyst to evaluate the effects of treatments applied to properly defined experimental units. Fisher’s fundamental principles underlying the proper design of an experiment required: randomization, replication, and local control of error. Randomization assured that each experimental unit available for the experiment has exactly the same probability for being selected for each of the possible treatments. Replication allowed the analyst to evaluate the effects of the treatments by comparing treatment means. Local control of error represented the attempt to minimize the impact of other possible sources of variation. The fact that Fisher could not directly observe the effect of the “chance causes” of the variation forced the focus on comparing treatment means within his overall framework. Modern sensor technology allows the experimenter to observe the effects of many of the chance causes that Fisher could not. However, incorporating this information requires the analyst to model the data through proper linear or non-linear models, not by comparing treatment means. The resulting implications for the proper analysis, taking into account the available ancillary variables, are fascinating, with far-reaching implications for the future of the design and analysis of experiments. Part 1 lays the theoretical foundation for a modern approach to the analysis of the experiment, taking full advantage of standard linear and non-linear model theory. Two real examples illustrate the concepts. In the process, it becomes quite clear why many people have recently noted serious issues with hypothesis tests in general.

Referente: Rossella Berni

17/04/2024 ore 12.00

### Learning Gaussian graphical models for paired data with the pdglasso

### Saverio Ranciati

In this talk we present the pdglasso, an approach for statistical inference with Gaussian graphical models on paired data, that is when there are exactly two dependent groups and the interest lies on learning the two networks together with their across-graph association structure. The modeling framework contains coloured graphical models and, more precisely, a subfamily of RCON models suited to deal with paired data. Algorithmic implementation, relevant submodels, and maximum likelihood estimates are discussed. We also illustrate the associated R package 'pdglasso', its main contents and usage. Results on simulated and real-data environments are discussed at the end.

Referente: Maria Francesca Marino

11/04/2024 ore 11.00

### Double machine learning for sample selection models

### Michela Bia ( Liser and University of Luxembourg)

This paper considers the evaluation of discretely distributed treatments when outcomes are only observed for a subpopulation due to sample selection or outcome attrition. For identification, we combine a selection-on-observables assumption for treatment assignment with either selection-on-observables or instrumental variable assumptions concerning the outcome attrition/sample selection process. We also consider dynamic confounding, meaning that covariates that jointly affect sample selection and the outcome may (at least partly) be influenced by the treatment. To control in a data-driven way for a potentially high dimensional set of pre- and/or post-treatment covariates, we adapt the double machine learning framework for treatment evaluation to sample selection problems. We make use of (a) Neyman-orthogonal, doubly robust, and efficient score functions, which imply the robustness of treatment effect estimation to moderate regularization biases in the machine learning-based estimation of the outcome, treatment, or sample selection models and (b) sample splitting (or cross-fitting) to prevent overfitting bias. We demonstrate that the proposed estimators are asymptotically normal and root-n consistent under specific regularity conditions concerning the machine learners and investigate their finite sample properties in a simulation study. We also apply our proposed methodology to the Job Corps data for evaluating the effect of training on hourly wages which are only observed conditional on employment. The estimator is available in the causalweight package for the statistical software R.

Referente: Alessandra Mattei

28/03/2024 ore 12.00 - On-site and online seminar

### Interplay between Privacy and Explainable AI

### Anna Monreale (Università di Pisa)

In recent years we are witnessing the diffusion of AI systems based on powerful machine learning models which find application in many critical contexts such as medicine, financial market, credit scoring, etc. In such contexts, it is particularly important to design Trustworthy AI systems while guaranteeing the interpretability of their decisional reasoning, and privacy protection and awareness. In this talk, we will explore the possible relationships between these two relevant ethical values to take into consideration in Trustworthy AI. We will answer research questions such as: how explainability may help privacy awareness? Can explanations jeopardize individual privacy protection?

Referente: Florence Center for Data Science

20/03/2024 ore 12.00

### Two tales of the information matrix test

### Gabriele Fiorentini (University of Florence)

The talk is based on the results of two related notes in which we derive explicit expressions for the information matrix test of two rather popular models: The multinomial logit and the finite mixture of multivariate Gaussians. Information matrix tests for multinomial logit models In this paper we derive the information matrix test for multinomial logit models in which the explanatory variables are common across categories, but their effects are not. We show that the vectorised sum of the outer product of the score and the Hessian matrix coincides with the Kronecker product of the outer product of the generalised residuals minus their covariance matrix conditional on the explanatory variables times the outer product of those variables. Therefore, we can reinterpret it as a multivariate version of White's (1980) heteroskedasticity test, which agrees with Chesher's (1983) interpretation of the information matrix test as a Lagrange multiplier test for unobserved heterogeneity. Our Monte Carlo experiments confirm that using the theoretical expressions for the covariance matrices of the influence functions involved leads to substantial reductions in the size distortions of our testing procedures in finite samples relative to the outer product of the score versions, and that the parametric bootstrap practically eliminates them. We also show that the information matrix test has good power against various misspecification alternatives. The information matrix test for Gaussian mixtures In incomplete data models the EM principle implies the moments the Information Matrix test assesses are the expectation given the observations of the moments it would assess were the underlying components observed. This principle also leads to interpretable expressions for their asymptotic covariance matrix adjusted for sampling variability in the parameter estimators under correct specification. Monte Carlo simulations for finite Gaussian mixtures indicate that the parametric bootstrap provides reliable finite sample sizes and good power against various misspecification alternatives. We confirm that 3-component Gaussian mixtures accurately describe cross-sectional distributions of per capita income in the 1960-2000 Penn World Tables.

Referente: Monia Lupparelli

15/03/2024 ore 11.00 - On-site and online seminar

### Bayesian modelling for spatially misaligned health areal data

### Silvia Liverani (Queen Mary University of London)

The objective of disease mapping is to model data aggregated at the areal level. In some contexts, however, (e.g. residential histories, general practitioner catchment areas) when data is arising from a variety of sources, not necessarily at the same spatial scale, it is possible to specify spatial random eﬀects, or covariate eﬀects, at the areal level, by using a multiple membership principle. In this talk I will investigate the theoretical underpinnings of these application of the multiple membership principle to the CAR prior, in particular with regard to parameterisation, properness and identiﬁability, and I will present the results of an application of the multiple membership model to diabetes prevalence data in South London, together with strategic implications for public health considerations.

Referente: Florence Center for Data Science

11/03/2024 ore 14.00

### Trajectories of loneliness in later life – Evidence from a 10-year English panel study

### Giorgio Di Gessa (University College London)

Loneliness is generally defined as the discrepancy between individuals’ desired and actual social interactions and emotional support. Although the prevalence of loneliness is high among older people and is projected to rise, few studies have examined longitudinal patterns of loneliness. Moreover, most studies have focused on more “objective” risk factors for loneliness such as partnership status and frequency of contact, overlooking the quality of the relationship with and support from family and friends. Using data from six waves of the English Longitudinal Study of Ageing (2008/09 to 2018/19, N=4740), we used group-based trajectory modelling to identify distinctive trajectories of loneliness. Multinomial regression models were then used to examine characteristics associated with these trajectories, with a particular focus on size, support, closeness, and frequency of contact with social network members. We identified 5 groups of loneliness trajectories in later life, representing “stable low” (40% of the sample), “medium/low” (26%), “stable high” (11%), and “increasing” (14%) or “decreasing” (9%) levels of loneliness over time. Although there are socioeconomic and demographic differences across these trajectories of loneliness, health and relationship quality are their main drivers. Respondents with poor and deteriorating health were more likely to be classified as having “stable high” or “increasing” loneliness. Even if not having social networks is undoubtedly associated with higher risks of persistent loneliness, having friends and family is not enough: Respondents with low quality of relationships with both friends and family were also significantly more likely to be classified as having “stable high” or “increasing” levels of loneliness.

Referente: Raffaele Guetto

27/02/2024 ore 12.00

### Ersilia Lucenteforte: Spreading evidence: a heterogeneous journey in Medical Statistics

Chiara Marzi: Interdisciplinary biomedical research: exploring Brain Complexity, Machine Learning, and Environmental Epidemiology

### Welcome seminar: Ersilia Lucenteforte, Chiara Marzi

**Ersilia Lucenteforte:**

My past research has explored into various aspects of medical statistics, spanning from cancer epidemiology to pharmacoepidemiology and clinical research, with a strong emphasis on Evidence-Based Medicine. In this welcome seminar, I will provide a brief overview of my past activities and discuss my recent focus on a crucial aspect of pharmacoepidemiology: the analysis of medication adherence.**Chiara Marzi:**

This seminar offers a brief journey through the diverse facets of interdisciplinary biomedical research, as seen through the eyes of a young researcher. I will delve into my main research themes - past, present, and future - spanning from understanding the complexity of the brain to exploring the practical applications of machine learning in medicine, and investigating the impacts of environmental factors on health.

Referente: Raffaele Guetto

23/02/2024 ore 12.00 - On-site and online seminar

### Nonhomogeneous hidden semi-Markov models for environmental toroidal data

### Francesco Lagona (University of Roma Tre)

A novel hidden semi-Markov model is proposed to segment bivariate time series of wind and wave directions according to a finite number of latent regimes and, simultaneously, estimate the influence of time-varying covariates on the process’ survival under each regime. The model integrates survival analysis and directional statistics by postulating a mixture of toroidal densities, whose parameters depend on the evolution of a semi-Markov chain, which is in turn modulated by time-varying covariates through a proportional hazards assumption. Parameter estimates are obtained using an EM algorithm that relies on an efficient augmentation of the latent process. Fitted on a time series of wind and wave directions recorded in the Adriatic sea, the model offers a clear-cut description of sea state dynamics in terms of latent regimes and captures the influence of time-varying weather conditions on the duration of such regimes.

Referente: Florence Center for Data Science

15/02/2024 ore 12.00

### Do Intergeneration Household Structures Reflect Differences in American Middle School Students' School Experiences and Engagement in Schoolwork?

### Peter Brandon (University at Albany)

American children grow up in a variety of household structures. Across these households, resources, parenting styles, household composition, and surrounding neighborhoods can vary. Studies suggest that the intermingling of these social, economic, and demographic factors affects children’s well-being and later transitions into adulthood. Thus, households in which children find themselves are consequential and shape their future opportunities. Among the households in which American children grow up, two of the more significant types are three- and skipped-generation households. Our understanding of these particular households has expanded, but there is still much to learn, especially about the everyday experiences of children growing up in these two types of households. Among those everyday experiences worth investigating further are those related to schooling. Positive schooling experiences and a child’s interest in learning are crucial to their development and identity. Preliminary findings from this study suggest schooling experiences and engagement in schoolwork, outside of the classroom, for children in intergenerational households may differ from their peers growing up in other households. The study speculates about interventions focused on the home environment or at the school that might ensure children in intergenerational households are not educationally disadvantaged.

Referente: Giammarco Alderotti

12/01/2024 ore 14.30 - Please register here to participate online: https://docs.google.com/forms/d/e/1FAIpQLSdkfhnDMP2j5cI32B38DC4oACXej9W7pKj2keSwVDPtybvahw/viewform?usp=pp_url

### A multi-fidelity method for uncertainty quantification in engineering problems

### Lorenzo Tamellini (CNR-IMATI Pavia)

Computer simulations, which are nowadays a fundamental tool in every field of science and engineering, need to be fed with parameters such as physical coefficients, initial states, geometries, etc. This information is however often plagued by uncertainty: values might be e.g. known only up to measurement errors, or be intrinsically random quantities (such as winds or rainfalls). Uncertainty Quantification (UQ) is a research field devoted to dealing efficiently with uncertainty in computations. UQ techniques typically require running simulations for several (carefully chosen) values of the uncertain input parameters (modeled as random variables/fields), and computing statistics of the outputs of the simulations (mean, variance, higher order moments, pdf, failure probabilities), to provide decision-makers with quantitative information about the reliability of the predictions. Since each simulation run typically requires solving one or more Partial Differential Equations (PDE), which can be a very expensive operation, it is easy to see how these techniques can quickly become very computationally demanding. In recent years, multi-fidelity approaches have been devised to lessen the computational burden: these techniques explore the bulk of the variability of the outputs of the simulation by means of low-fidelity/low-cost solvers of the underlying PDEs, and then correct the results by running a limited number of high-fidelity/high-cost solvers. They also provide the user a so-called “surrogate-model” of the system response, that can be used to approximate the outputs of the system without actually running any further simulation. In this talk we illustrate a multi-fidelity method (the so-called multi-index stochastic collocation method) and its application to a couple of engineering problems. If time allows, we will also briefly touch the issue of coming upwith good probability distributions for the uncertain parameters, e.g. by Bayesian inversion techniques. References: 1) C. Piazzola, L. Tamellini, The Sparse Grids Matlab Kit – a Matlab Implementation of Sparse Grids for High-Dimensional Function Approximation and Uncertainty Quantification, ACM Transactions on Mathematical Software, 2023 2) C. Piazzola, L. Tamellini, R. Pellegrini, R. Broglia, A. Serani, and M. Diez. Comparing Multi-Index Stochastic Collocation and Multi-Fidelity Stochastic Radial Basis Functions for Forward Uncertainty Quantification of Ship Resistance. Engineering with Computers, 2022 3) M. Chiappetta, C. Piazzola, L. Tamellini, A. Reali, F. Auricchio, M. Carraturo Data-informed uncertainty quantification for laser-based powder bed fusion additive manufacturing arXiv:2311.03823

Referente: Florence Center for Data Science - Prof.ssa Anna Gottard

21/12/2023 ore 11.00 - Il Seminario si terrà nell'Aula 003 del DiSIA

### Dalla Statistica descrittiva alla Teoria statistica delle decisioni

### Bruno Chiandotto (DISIA, Università di Firenze)

Descrizione del processo evolutivo della Statistica: non più *“Statistica per le decisioni”* ma *“Statistica e decisioni”*.

La *Teoria statistica delle decisioni* per un verso evidenzia i limiti dell’inferenza statistica sia classica che bayesiana per altro verso fornisce una soluzione soddisfacente e operativa ai problemi di:

1. dimensionamento ottimale della numerosità campionaria;

2. ricerca di nessi causali tra le variabili analizzate;

3. attivazione di interventi in grado di modificare, a proprio vantaggio, l’evoluzione naturale dei fenomeni analizzati.

Referente: Raffaele Guetto

15/12/2023 ore 12.00

### Forking paths, fishing expeditions and sensitivity analysis

### Andrea Saltelli (UPF Barcelona School of Management)

Based on recent works on the potential multiplicity of results from statistical or mathematical models (Breznau et al., 2022), we show how sensitivity analysis (Saltelli et al., 2021) and the related approach of sensitivity auditing can help in a quest that we call “Modelling of the modelling process.” We shall revisit some general principles and practices of sensitivity analysis, then introduce sensitivity auditing, also following a recent volume published by Oxford in August 2023 (Saltelli and Di Fiore, 2023).

Referente: Raffaele Guetto

30/11/2023 ore 14.00

### Working from home and work-family conflict revisited: Longitudinal evidence from Australia pre- and post-pandemic

### Inga Lass (Federal Institute for Population Research - BiB)

The COVID-19 pandemic saw a marked increase in the incidence of working from home in many countries. Yet, longitudinal evidence on whether working from home affects the conflict between the demands of the work sphere and the family sphere is still scarce. Using 19 waves of data from the Household, Income and Labour Dynamics in Australia Survey covering the period 2001 to 2021, this study investigates the association between working from home and work–family conflict among parents (9,859 persons, 54,893 observations). Thereby, it focuses on both directions of conflict, namely work-to-family (WTFC) and family-to-work conflict (FTWC). It also investigates whether the relationships were affected by the pandemic. Fixed-effects regression models reveal that working from home is associated with lower WTFC for both genders. By contrast, the association between working from home and FTWC differs between mothers and fathers, with FTWC increasing for fathers but decreasing for mothers when working at home. These associations mostly did not change during the pandemic, with the exception of both WTFC and FTWC becoming more severe for mothers with small children who worked from home during COVID-19.

Referente: Raffaele Guetto

27/11/2023 ore 15:00

### III modulo di formazione: la Terza Missione nella pianificazione dipartimentale

### Dott.ssa Rosa Gini - Ing. Juri Bellucci

PROGRAMMA Terza missione: definizione ed obiettivi. La Terza Missione vista da fuori: Case Study 1: ARS - Dott.ssa Rosa Gini; CaseStudy 2: Morfo Design - Ing. Juri Bellucci. Il presente e il futuro della Terza Missione per il DiSIA: esperienze di Terza Missione nel Dipartimento e idee per nuovi progetti. Chiusura dei lavori e coffee break

Referente: Monia Lupparelli - Maria Cecilia Verri

24/11/2023 ore 16.00 - Please register here to participate online: https://us02web.zoom.us/meeting/register/tZAlcO2rqTosHt24vJGv1h1M9sJcO_jGzpDO#/registration

### General Artificial Intelligence can be OK, Artificial General Intelligence cannot

### Flavio Soares Correa da Silva (University of São Paulo – Brazil)

Machine Learning (ML) results have been propelled forward in recent years, attracting a lot of attention from the general press and large corporations. ML is one subfield belonging to the broader scientific and technological initiative named Artificial Intelligence (AI) in the mid-1950s. In this presentation, we shall go briefly across the history and foundations of AI, in order to (1) appreciate from where it came and how, to some extent, it has preserved methodological coherence since its infancy, (2) attempt to interpret recent results from a broader perspective, and (3) formulate possible future steps that can ensure that the relevance of AI continues to grow and be judged as positive from both scientific and technological perspectives. Along the way we shall dispel some myths about the practicality of pursuing the development of an Artificial General Intelligence.

Referente: Florence Center for Data Science - Prof.ssa Anna Gottard

16/11/2023 ore 11.00

### Average-case behaviour of mixing times for Gibbs samplers through Bayesian asymptotics

### Filippo Ascolani (Department of Decision Sciences, Bocconi University, Milan)

Gibbs samplers are popular algorithms to approximate posterior distributions arising from Bayesian hierarchical models. Despite their popularity and good empirical performances, however, there are still relatively few quantitative theoretical results on their scalability or lack thereof, e.g. much less than for gradient-based sampling methods. We introduce a novel technique to analyse the asymptotic behaviour of mixing times of Gibbs Samplers, based on tools of Bayesian asymptotics. Our methodology applies to high-dimensional regimes where both number of datapoints and parameters increase, under random data-generating assumptions. This allows us to provide a fairly general framework to study the complexity of Gibbs samplers fitting Bayesian hierarchical models. The methodology is illustrated on two-level hierarchical models with generic likelihood. For this class, we are able to provide dimension-free convergence results for Gibbs Samplers under mild conditions. If time permits, extensions to other coordinate-wise scheme (e.g. Metropolis-within-Gibbs) will be also discussed.

Referente: Veronica Ballerini

10/11/2023 ore 16.00 - Please register here to participate online: https://us02web.zoom.us/meeting/register/tZUvduGsqDMuHNH4L_VZ7r6lTMMPltX3WwZu#/registration

### When statistics meets AI: Bayesian modeling of spatial biomedical data

### Qiwei Li (University of Texas at Dallas)

Statistics relies more on human analyses with computer aids, while AI relies more on computer algorithms with aids from humans. Nevertheless, expanding the statistics concourse at each milestone provides new avenues for AI and creates new insides in statistics. This part incubates the findings initiated from either side of statistics or AI and benefits the other. In this talk, I will demonstrate how the marriage between spatial statistics and AI leads to more explainable and predictable paths from raw spatial biomedical data to conclusions. The first part concerns the spatial modeling of AI-reconstructed pathology images. Recent developments in deep-learning methods have enabled us to identify and classify individual cells from digital pathology images at a large scale. The randomly distributed cells can be considered from a marked point process. I will present two novel Bayesian models for characterizing spatial correlations in a multi-type spatial point pattern. The new method provides a unique perspective for understanding the role of cell-cell interactions in cancer progression, demonstrated through a lung cancer case study. The second part concerns the spatial modeling of the emerging spatially resolved transcriptomics data. Recent technology breakthroughs in spatial molecular profiling have enabled the comprehensive molecular characterization of single cells while preserving their spatial and morphological contexts. This new bioinformatics scenario advances our understanding of molecular and cellular spatial organizations in tissues, fueling the next generation of scientific discovery. I will focus on how to integrate information from AI tools into Bayesian models to address some key questions in this field, such as spatial domain identification and gene expression reconstruction at the single-cell level.

Referente: Florence Center for Data Science - Prof.ssa Anna Gottard

27/10/2023 ore 16.00

### Proximal MCMC for Approximate Bayesian Inference of Constrained and Regularized Estimation

### Eric Chi ((Rice University))

In this talk I will introduce some extensions to the proximal Markov Chain Monte Carlo (Proximal MCMC) – a flexible and general Bayesian inference framework for constrained or regularized parametric estimation. The basic idea of Proximal MCMC is to approximate nonsmooth regularization terms via the Moreau-Yosida envelope. Initial proximal MCMC strategies, however, fixed nuisance and regularization parameters as constants, and relied on the Langevin algorithm for the posterior sampling. We extend Proximal MCMC to the full Bayesian framework with modeling and data-adaptive estimation of all parameters including regularization parameters. More efficient sampling algorithms such as the Hamiltonian Monte Carlo are employed to scale Proximal MCMC to high-dimensional problems. Our proposed Proximal MCMC offers a versatile and modularized procedure for the inference of constrained and non-smooth problems that is mostly tuning parameter free. We illustrate its utility on various statistical estimation and machine learning tasks.

Referente: Florence Center for Data Science - Prof.ssa Anna Gottard

13/10/2023 ore 16.00 - Please register here to participate online: https://us02web.zoom.us/meeting/register/tZ0rceyhqzIoE90cdni3tyeY49-yKlvdbLl0#/registration

### Nonparametric Copula Models for Multivariate, Mixed, and Missing Data

### Daniel Kowal (Rice University)

Modern datasets commonly feature both substantial missingness and many variables of mixed data types, which present significant challenges for estimation and inference. Complete case analysis, which proceeds using only the observations with fully-observed variables, is often severely biased, while model-based imputation of missing values is limited by the ability of the model to capture complex dependencies among (possibly many) variables of mixed data types. To address these challenges, we develop a novel Bayesian mixture copula for joint and nonparametric modelling of multivariate count, continuous, ordinal, and unordered categorical variables, and deploy this model for inference, prediction, and imputation of missing data. Most uniquely, we introduce a new and computationally efficient strategy for marginal distribution estimation that eliminates the need to specify any marginal models yet delivers posterior consistency for each marginal distribution and the copula parameters under missingness-at-random. Extensive simulation studies demonstrate exceptional modelling and imputation capabilities relative to competing methods, especially with mixed data types, complex missingness mechanisms, and nonlinear dependencies. We conclude with a data analysis that highlights how improper treatment of missing data can distort a statistical analysis, and how the proposed approach offers a resolution.

Referente: Florence Center for Data Science - Prof.ssa Anna Gottard

04/10/2023 ore 12.00

### Spatially constrained co-clustering to detect latent patterns in mass spectrometry imaging

### Giulia Capitoli (Department of Medicine and Surgery, University Milan-Bicocca, Monza)

Over the last ten years, Matrix-Assisted Laser Desorption Ionisation (MALDI) Mass Spectrometry Imaging (MSI) has been one of the key technologies for cancer biomarker discovery directly in-situ. This technology partitions a tissue sample into a grid of pixels, and a mass spectrum is acquired for each pixel. However, the standard statistical methods do not fully address the pixel spatial dependencies. In this work, we investigate a co-clustering statistical model that partitions the molecules' spatial expression profiles, considering the spatial dependencies between neighbouring pixels. We aim to detect groups of pixels that present similar patterns to extract interesting insights, such as anomalies that one cannot capture from the morphological original tissue. We perform co-clustering Non-negative Matrix Tri-Factorization (NMTF), to infer the latent block structure of the data and to induce two types of clustering: 1) of the molecules, using their expression across the tissue, and 2) of the image areas, using the coordinates of the pixels. Our proposed methodology is investigated with a series of simulation experiments. We evaluate the ability of the model to recover the subgroups (pixels) and the variables (molecules) that drive the clustering in a real-case application study.

Referente: Monia Lupparelli

28/09/2023 ore 12.00

### KnowledgePanel Europe. Set-up and development challenges of Ipsos random probability panel

### Livia Ridolfi (Ipsos Italia)

In today's digital age and high demands for speed and cost-effectiveness, traditional random probability methods are under pressure. These methods are often associated with high costs, are time-consuming, and show increasing difficulty in reaching hard-to-reach subgroups (low income, low education, youth, minorities, offline population). While the acceptance of online access panels came about because of the need for faster turnaround time, the struggle to publish results that can withstand academic scrutiny remains. KnowledgePanel covers many of these aspects and differentiates from traditional opt-in online panels by its underlying methodology. KnowledgePanel is a random-probability online panel and has been operating in the US since 1999 and in the UK since 2020. Building on this work, we are now expanding the offering across Europe. Although probability-based recruitment is considered the gold standard in research, it is not without its challenges. Recruiting hard-to-reach populations (e.g. youth/65+, low-educated, those not internet-savvy), while maintaining a random probability strategy, remains an ongoing challenge, as well as panel engagement and attrition. An additional challenge for KnowledgePanel in Europe is related to its cross-national design. While the cross-national set-up makes it possible to unify the approach, differences between countries - such as local framework, available sampling resources and constraints - must also be taken into account. Solutions that work in some countries are not feasible in others, emphasizing the need for adopting a more flexible approach. In this seminar, we highlight the methodological, practical and cross-national challenges faced during the KnowledgePanel set-up and development.

Referente: Raffaele Guetto

07/09/2023 ore 14:00

### Costationary Bootstrap Inference for Locally Stationary Time Series

### Alessandro Cardinali (University of Plymouth, UK)

In this presentation we illustrate some bootstrap methods to estimate and test time-varying parameters of locally stationary time series models. These methods are based on costationary combinations, that is, timevarying deterministic linear combinations of (possibly) locally stationary time series that are second-order stationary. We first review the concept of costationarity and the estimation of the coefficient vectors, which is obtained by optimizing suitable loss-functions. Costationary Bootstrap is based on repeating the optimization from random starting points, obtaining a system of costationary series that can be used to device further inferential methodologies. We use this framework for both estimation and testing purposes. First we illustrate an efficient estimator for the time-varying covariance of locally stationary time series. We show that the new covariance estimator has smaller variance than the available estimators exclusively based on the timevarying cross-periodogram, and is therefore appealing in a large number of applications. We confirm our findings through a simulation experiment. We also show that, when applied to real data, our new estimator compares favorably with existing approaches, being capable to highlight well known economic shocks in a clearer manner. We also use the above frameworks to derive a bootstrap stationarity test for locally stationary time series. The finite sample performances of this test are assessed through simulations showing that our method successfully controls rejection rates for stationary and locally stationary processes with both Gaussian and Student-t distributed innovations. When we apply our test to financial returns, we are able to associate the detection of non-stationarities to the occurrence of known economic shocks.

Referente: Monia Lupparelli

06/09/2023 ore 12.00

### Giammarco Alderotti: A journey, just begun, into family demography

Marco Cozzani: The heterogeneous consequences of adverse events

Maria Veronica Dorgali: Understanding health behaviours through the lens of psychology

### Welcome seminar:

Giammarco Alderotti, Marco Cozzani, Maria Veronica Dorgali

**Giammarco Alderotti:**

My past research has explored various aspects related to fertility, fertility intentions (especially their correlation with employment uncertainty), migrant fertility, union formation, and union dissolution (with a particular emphasis on grey divorces). In this welcome seminar, I will present the key findings from my previous research endeavours and offer insights into my ongoing and future projects in the field of family demography.**Marco Cozzani:**

In this seminar, I will provide an overview of my past research, focusing on three studies on the heterogenous consequences of exogenous adverse events on socio-demographic outcomes such as birth outcomes and fertility. The first study examines the impact of a terrorist attack to estimate the effect of prenatal maternal stress on birth outcomes such as low birth weight and preterm births. The second study explores the effect of extreme temperatures and birth outcomes. The third study centers on the implications of the COVID-19 pandemic for fertility and birth outcomes. Finally, I will provide a brief preview of my new research.**Maria Veronica Dorgali:**

My previous research primarily focused on studying human behaviour in relation to health. Specifically, I am interested in combining psychological and statistical methods to achieve a more holistic comprehension of factors influencing health and overall human behaviours. I will provide a broad overview of my previous research works during this seminar. The discussion will focus on studies investigating individual choices regarding vaccination and overall health behaviours. The talk will then focus on the PNRR project, which aims to study housing characteristics’ role in older adults’ perceived health. While the presentation will emphasise past research projects, an overall outline of my current research interests was also offered.

Referente: Raffaele Guetto

19/06/2023 ore 14.00

### On Optimal Correlation-Based Prediction

### George Luta (Georgetown University)

We consider the problem of obtaining predictors of a random variable by maximizing the correlation between the predictor and the predictand. For the case of Pearson’s correlation, the class of such predictors is uncountably infinite and the least-squares predictor is a special element of that class. By constraining the means and the variances of the predictor and the predictand to be equal, a natural requirement for some situations, the unique predictor that is obtained has the maximum value of Lin’s concordance correlation coefficient (CCC) with the predictand among all predictors. Since the CCC measures the degree of agreement, the new predictor is called the maximum agreement predictor. The two predictors are illustrated for three special distributions: the multivariate normal distribution; the exponential distribution, conditional on covariates; and the Dirichlet distribution.

15/06/2023 ore 12.00 - Il seminario si terrà in Aula 003 (Ex Aula A)

### L'indagine ITA.LI quantitativa: Disegno di campionamento, pesi campionari e varianza degli stimatori

### Maurizio Pisati e Mario Lucchini (Università di Milano-Bicocca)

Lo scopo di questo seminario è illustrare alcuni aspetti essenziali della prima rilevazione dell'indagine ITA.LI quantitativa. In primo luogo verrà descritto il disegno di campionamento, soffermandosi sulle caratteristiche di ciascuno stadio. Successivamente sarà presentata la procedura di costruzione dei pesi campionari, con particolare attenzione agli aggiustamenti per la non risposta e alla calibrazione post-stratificazione. L'ultima parte sarà dedicata all'incertezza degli stimatori in una prospettiva design-based.

Referente: Raffaele Guetto

08/06/2023 ore 12.00

### Modeling grouped data via finite nested mixture models: an application to calcium imaging data

### Laura D'Angelo (Dipartimento di Economia, Metodi Quantitativi e Strategie d’Impresa, Università Milano Bicocca)

Recent advancements in miniaturized fluorescence microscopy have made it possible to investigate neuronal responses to external stimuli in awake behaving animals through the analysis of intracellular calcium signals. We propose a nested Bayesian finite mixture specification that allows estimating the underlying spiking activity and, simultaneously, reconstructing the distributions of the calcium transient spikes' amplitudes under different experimental conditions. The proposed model leverages two nested layers of random discrete mixture priors to borrow information between experiments and discover similarities in the distributional patterns of neuronal responses to different stimuli. We show that nested finite mixtures provide a valid alternative to priors based on infinite formulations and can even lead to better performances in some scenarios. We derive several prior properties and compare them with other well-known nonparametric nested models analytically and via simulation.

Referente: Matteo Pedone

05/06/2023 ore 14.00

### An introduction to Quantitative Algebras

### Matteo Mio (CNRS, ENS-Lyon)

Equational reasoning and equational manipulations are widespread in all areas of computer science.
Consider, for example, the optimisation steps performed by a compiler which replaces blocks of code with "equivalent", but more efficient, blocks.
In recent years it has become apparent that sometimes "approximate" equational reasoning techniques are useful and/or necessary.
A block of code might be replaced by another block which is not truly equivalent but, say, equivalent 99% of the time (in a certain statistical sense).
In this talk I will present the basic ideas of the mathematical framework of "Quantitative Algebras", recently proposed by Mardare et. al. in [1], aiming at formally developing some of the intuitions mentioned above.

[1] Radu Mardare, Prakash Panangaden, Gordon D. Plotkin: Quantitative Algebraic Reasoning. LICS 2016: 700-709

Referente: Michele Boreale

01/06/2023 ore 12:00

### Fiammetta Menchetti: From high school creativity to cultural heritage conservation: a journey in causal inference Marta Pittavino: A tale on statistical methods, and their applications, developed around Europe

### Welcome Seminar: Fiammetta Menchetti & Marta Pittavino

From high school creativity to cultural heritage conservation: a journey in causal inference In this talk, I will provide an overview of my research activity in causal inference for time series data. Starting with C-ARIMA and Bayesian multivariate structural time series models that were part of my PhD thesis, I'll then give you a glimpse into my recent collaborations, including a randomized control trial to assess the impact of FABLAB's courses on the creativity of Italian high-school students and a machine learning method for counterfactual forecasting for short panels in the absence of controls. The talk will then focus on the PNRR research activity, which aims to study the evolution of the web cracks on Brunelleschi's Santa Maria del Fiore Dome as part of an ongoing and fascinating project on cultural heritage conservation. A tale on statistical methods, and their applications, developed around Europe In this talk, I will present some statistical methods that I exploited for my research. I will begin by presenting the Additive Bayesian Network (ABN) multivariate methodology, a data-driven technique particularly suitable for inter-dependent data. I will show two applications of ABN in the veterinary epidemiology field. Then, I will move to the illustration of a Bayesian hierarchical model applied to nutritional epidemiology. Specifically, this model relates a measurement error model with a disease model via an exposure model. Afterwards, I will provide examples of quantile regression and forecasting techniques applied to a specific philanthropic-social dataset on charitable deductions for tax incentives for the Canton of Geneva population. Last, but not least, I will conclude this statistical modelling journey by introducing the current demographic project for the "Forecasting kinship networks and kinless individuals" and show the first preliminary results of kinless of countries around Europe.

Referente: Monia Lupparelli

22/05/2023 ore 11.00 - The seminar will be also on-line, please register here to participate online: https://us02web.zoom.us/webinar/register/WN_4-1IeZgrS6WpY-En-JKoZw

### Young Researchers Seminar FDS: Matt DosSantos DiSorbo - Harvard Business School

### YR Seminar Series FDS:

Matt DosSantos DiSorbo: TBA (Harvard Business School, Boston)

Matt DosSantos DiSorbo: TBA

Referente: Florence Center for Data Science

16/05/2023 ore 12:00

### Bayesian time-interaction point process

### Rosario Barone (University of Rome "Tor Vergata")

In the temporal point process framework, behavioral effects correspond to conditional dependence of event rates on times of previous events. This can result in self-exciting processes, where the occurrence of an event at a time point increases the probability of occurrence of a later event, or in self-correcting processes, where the occurrence of an event at a time point decreases the probability of occurrence of a later event. Altieri et al. (2022) defined as time-interaction process a temporal point processes that is the combination of self-exciting and self-correcting point processes, allowing each event to increase and/or decrease the likelihood of future ones. From the Bayesian perspective, we generalize the existing model in several directions: we account for covariates and propose a nonparametric baseline, which guarantees more flexibility and allows to control for heterogeneity. Also, we let the model parameters be modulated by a discrete state continuous time latent Markov process. Posterior inference is performed via efficient Markov chain Monte Carlo (MCMC) sampling, avoiding the implementation of discretization methods like Forward-Backward or Viterbi algorithm. Indeed, by extending Hobolth and Stone (2009), we propose a data augmentation approach that allows to simulate the continuous time latent Markov trajectories. We present applications to simulated and terrorist attacks data. REFERENCES Altieri, L., Farcomeni, A., and Fegatelli, D. A. (2022). Continuous timeinteraction processes for population size estimation, with an application to drug dealing in Italy. Biometrics. Hobolth, A. and Stone, E. A. (2009). Simulation from endpoint-conditioned, continuous-time Markov chains on a finite state space, with applications to molecular evolution. The Annals of Applied Statistics, 3(3): 1204-1231.

Referente: Veronica Ballerini

11/05/2023 ore 16.30

### Estimand and analysis strategies for recurrent event endpoints in the presence of a terminal event

### Special Guest Seminar Series: Björn Bornkamp (Statistical Methodologist at Novartis)

Recurrent event endpoints are commonly used in clinical drug development. One example is the number of recurrent heart failure hospitalizations, which is used in the context of clinical trials in the chronic heart failure (CHF) indication. A challenge in this context is that patients with CHF are at an increased risk of dying. For patients who died, further heart failure hospitalizations can no longer be observed. As a treatment may affect both mortality and the number of hospitalizations, a naive comparison of the number of hospitalizations across treatment arms can be misleading even in a randomized clinical trial. An investigational treatment may, for example, reduce mortality compared to a control, but this may lead to more observed hospitalizations if severely ill patients with high risk of repeated hospitalizations die earlier under the control treatment. In this talk we will review this issue and different estimand and analysis strategies. We will then describe a Bayesian modelling strategy to target a principal stratum estimand in detail. The model relies on joint modelling of the recurrent event and death processes with a frailty term accounting for within-subject correlation. The analysis is illustrated in the context of a recent randomized clinical trial in the CHF indication.

Referente: Florence Center for Data Science

08/05/2023 ore 10.00

### Youtube, Giochi e Scratch per imparare meglio

### Tierno Guzman

Possiamo usare Youtube, Shorts, Giochi e il linguaggio Scratch per fare imparare meglio la matematica e l'informatica? Vi racconto la mia esperienza: un libro di giochi, un canale youtube di sfide matematiche, l'insegnamento di un linguaggio di programmazione nella scuola media.
**Short bio.** Guzman ha studiato matematica e ha conseguito il dottorato in matematica alla Sapienza. Poi ha preso una seconda laurea in informatica e ha lavorato come ingegnere del software per 4 anni. Infine ha deciso di lavorare come insegnante di matematica nella scuola media per provare a trasmettere ai ragazzi la passione per la matematica e per il coding. Ha collaborato con Zanichelli per la stesura dei capitoli di Informatica del libro Contaci. Ha scritto per Zanichelli il libro "Giochi e Sfide" e il libro "Tutorial Completo di Scratch" (in via di pubblicazione). Nel tempo libero ama scrivere, creare programmi al computer, tenere vive le lingue straniere e andare in mountain bike. Ha un canale Youtube dedicato ai giochi matematici: GuzMat.

Referente: Cecilia Verri

17/04/2023 ore 12

### Daniele Castellana: Machine Learning in Structured Domains

Andrea Marino: Algorithms for the Analysis of (Temporal) Graphs

Francesco Tiezzi: A Tale on Domain-Specific System Engineering: the Case of Multi-Robot Systems

### Welcome Seminar:

Daniele Castellana, Andrea Marino, Francesco Tiezzi

**Daniele Castellana:**

Machine Learning (ML) for structured data aims to build ML models which can handle structured data (e.g. sequence, trees, and graphs). In this setting, common ML approaches for flat data (i.e. vectors) cannot be applied directly. In this seminar, we show the challenges of learning from structured data and discuss how to overcome them. We introduce a tensor framework for learning from trees highlighting how tensors naturally arise in this context and the role of tensor decompositions to reduce the computational complexity of such approach. Interestingly, this framework can be applied to both probabilistic and neural models for trees. Lastly, we talk about learning from graphs. In particular, we introduce a probabilistic model for graph based on a Bayesian Non-Parametric technique showing its effectiveness in inferring suitable hyper-parameters.**Andrea Marino:**

In a society in which we are almost always “connected”, the capability of acquiring different information has become a source of huge amounts of data to be elaborated and analysed. Graphs are data structures that allow modeling the relationships in these data, effectively representing millions of nodes and billions of connections between nodes as edges. For example, a social network can be seen as a graph whose nodes are the users and whose edges correspond to the friendships between them. In such a context, discovering the most important or influential users, which communities they belong to, and which is the structure of their society can be translated into discovering suitable patterns on graphs. Sometimes in some scenarios additional constraints must be considered: for instance, in temporal graphs, connections are available only at prescribed times, in the same way as connections of a public transportation system are available according to a time schedule. In temporal graphs, walks and paths make sense only if the legs of the trip are time consistent, i.e. the departure of a leg is after the arrival of the previous one, and structural patterns must take into account this constraint. Our work is devoted to understand whether such patterns can be computed efficiently, i.e. in time polynomial wrt the size of the graph, hopefully linearly if the considered graphs are big. During the talk we will see a high level overview of our contributions in the field.**Francesco Tiezzi:**

The first part of the seminar introduces challenges and methodologies concerning the (possibly formal) engineering of domain-specific distributed systems. The second part focuses on a specific case: the development of software for robotics applications involving multiple heterogeneous robots. Programming such distributed software requires coordinating multiple cooperating tasks to avoid unsatisfactory emerging behaviors. It is then presented a programming language devised to face this challenge. The language supports an approach for programming multi-robot systems at a high abstraction level, allowing developers to focus on the system behavior, achieving readable, reusable, and maintainable code.

Referente: Andrea Marino

04/04/2023 ore 15:00

### Kernel regression estimation for a circular response and different types of covariates

### Andrea Meilan Vila (Universidad Carlos III de Madrid)

The analysis of a variable of interest that depends on other variable(s) is a typical issue appearing in many practical problems. Regression analysis provides the statistical tools to address this type of problem. This topic has been deeply studied, especially when the variables in the study are of Euclidean type. However, there are situations where the data present certain kinds of complexities, for example, the involved variables are of circular or functional type, and the classical regression procedures designed for Euclidean data may not be appropriate. In these scenarios, these techniques would have to be conveniently modified to provide useful results. Moreover, it might occur that the variables of interest can present a certain type of dependence. For example, they can be spatially correlated, where observations that are close in space tend to be more similar than observations that are far apart. This work aims to design and study new approaches to deal with regression function estimation for models with a circular response and different types of covariates. For an R^d-valued covariate, nonparametric proposals to estimate the circular regression function are provided and studied, under the assumption of independence and also for spatially correlated errors. These estimators are also adapted for regression models with a functional covariate. In the above-mentioned frameworks, the asymptotic bias and variance of the proposed estimators are calculated. Some guidelines for their practical implementation are provided, checking their sample performance through simulations. Finally, the behavior of the estimators is also illustrated with real data sets.

Referente: Agnese Panzera, Anna Gottard

30/03/2023 ore 12.00

### Co-clustering Models for Spatial Transcriptomics

### Andrea Sottosanti (Università di Padova)

Spatial transcriptomics is a cutting-edge technology that, differently from traditional transcriptomic methods, allows researchers to recover the spatial organization of cells within tissues and to map where genes are expressed in space. By examining previously hidden spatial patterns of gene expression, researchers can identify distinct cell types and study the interactions between cells in different tissue regions, leading to a deeper understanding of several key biological mechanisms, such as cell-cell communication or tumour-microenvironment interaction. During this talk, we will be presenting novel statistical tools that exploit the previously unavailable spatial information in transcriptomics to coherently group cells and genes. First, we will introduce SpaRTaCo, a new model that clusters the gene expression profiles according to a partition of the tissue. This is accomplished by performing a co-clustering, that is, a simultaneous clustering of the genes using their expression across the tissue, and of the image areas using the gene expression in the locations where the RNA is collected. Then, we will show how to use SpaRTaCo when a previous annotation of the cell types is available, incorporating biological knowledge into the statistical analysis. Last, we will discuss a new modelling solution that exploits recent advances in sparse Bayesian estimation of covariance matrices to reconstruct the spatial covariance of the data in a sparse and flexible manner, significantly reducing the computational cost of the model estimation. By applying these tools to tissue samples processed with recent spatial transcriptomic technologies, we can gain interesting and promising biological insights. The models are in fact able to reveal the presence of specific variation patterns in some restricted areas of the tissue that cannot be directly uncovered using other methods in the literature, and, inside each image cluster, they can detect genes that carry out specific and relevant biological functions.

Referente: Monia Lupparelli

28/03/2023 ore 11.30 - Auditorium A – Viale Morgagni 40, Auditorium A- Polo Didattico, Viale Morgani, 40 - Università degli Studi di Firenze Firenze, Italy

### “Combining Experimental and Observational Data”

### Special Guest Lecture Economics Lecture by Nobel Prize 2021: Guido W. Imbens (Stanford Universiy)

In the social sciences there has been an increase in interest in randomized experiments to estimate causal effects, partly because their internal validity tends to be high, but they are often small and contain information on only a few variables. At the same time, as part of the big data revolution, large, detailed, and representative, administrative data sets have become more widely available. However, the credibility of estimates of causal effects based on such data sets alone can be low.

In this paper, we develop statistical methods for systematically combining experimental and observational data to improve the credibility of estimates of the causal effects.

We focus on a setting with a binary treatment where we are interested in the effect on a primary outcome that we only observe in the observational sample. Both the observational and experimental samples contain data about a treatment, observable individual characteristics, and a secondary (often short term) outcome.

To estimate the effect of a treatment on the primary outcome, while accounting for the potential confounding in the observational sample, we propose a method that makes use of estimates of the relationship between the treatment and the secondary outcome from the experimental sample. We interpret differences in the estimated causal effects on the secondary outcome between the two samples as evidence of unobserved confounders in the observational sample, and develop control function methods for using those differences to adjust the estimates of the treatment effects on the primary outcome.

We illustrate these ideas by combining data on class size and third grade test scores from the Project STAR experiment with observational data on class size and both third and eighth grade test scores from the New York school system.

Co-author: Susan Athey and Raj Chetty

Referente: Florence Centerfor Data Science - DISIA - European University Institute EUI

23/03/2023 ore 12.00

### Concentration of discrepancy-based ABC via Rademacher complexity

### Sirio Legramanti (University of Bergamo)

There has been increasing interest on summary-free versions of approximate Bayesian computation (ABC), which replace distances among summaries with discrepancies between the whole empirical distributions of the observed data and the synthetic samples generated under the proposed parameter values. The success of these solutions has motivated theoretical studies on the concentration properties of the induced posteriors. However, current results are often specific to the selected discrepancy, and mostly rely on existence arguments which are typically difficult to verify and provide bounds not readily interpretable. We address these issues via a novel bridge between the concept of Rademacher complexity and recent concentration theory for discrepancy-based ABC. This perspective yields a unified and interpretable theoretical framework that relates the concentration of ABC posteriors to the behavior of the Rademacher complexity associated to the chosen discrepancy in the broad class of integral probability semimetrics. This class extends summary-based ABC, and includes the widely-implemented Wasserstein distance and maximum mean discrepancy (MMD), which admit interpretable bounds for the corresponding Rademacher complexity along with constructive sufficient conditions for the existence of such bounds. Therefore, this unique bridge crucially contributes towards an improved understanding of ABC, as further clarified through a focus of this theory on the MMD setting and via an illustrative simulation. (Joint work with Daniele Durante and Pierre Alquier) ArXiv: https://arxiv.org/abs/2206.06991

Referente: Monia Lupparelli

20/03/2023 ore 15.30

### Shewhart and Profile Monitoring for Industry 4.0

### G. Geoffrey Vining (Virginia Tech, USA)

Please, see the attached file

Documenti: Abstract

Referente: Prof.ssa Rossella Berni

17/03/2023 ore 14.30 - The seminar will be on-line, please register here to participate online: https://us02web.zoom.us/webinar/register/WN_DIlCeuHERia0L7SLrmuWGQ

### D2 Seminar Series:

Alberto Cassese: “Bayesian negative binomial mixture regression models for the analysis of sequence count and methylation data”

Chiara Bocci: “Sampling design for large-scale geospatial phenomena using remote sensing data”

### Doppio Seminario FDS:

Alberto Cassese & Chiara Bocci - DISIA, University of Florence

Alberto Cassese:

“Bayesian negative binomial mixture regression models for the analysis of sequence count and methylation data”

A Bayesian hierarchical mixture regression model is developed for studying the association between a multivariate response, measured as counts on a set of features, and a set of covariates. We have available RNASeq and DNA methylation data on breast cancer patients at different stages of the disease. We account for heterogeneity and over-dispersion of count data by considering a mixture of negative binomial distributions and incorporate the covariates into the model via a linear modeling construction on the mean components. Our modeling construction employs selection techniques allowing the identification of a small subset of features that best discriminate the samples, simultaneously selecting a set of covariates associated to each feature. Additionally, it incorporates known dependencies into the feature selection process via Markov random field priors. On simulated data, we show how incorporating existing information via the prior model can improve the accuracy of feature selection. In the case study, we incorporate knowledge on relationships among genes via a gene network, extracted from the KEGG database. Our data analysis identifies genes that are discriminatory of cancer stages and simultaneously selects significant associations between those genes and DNA methylation sites. A biological interpretation of our findings reveals several biomarkers that can help to understand the effect of DNA methylation on gene expression transcription across cancer stages.

Chiara Bocci:

“Sampling design for large-scale geospatial phenomena using remote sensing data”

Referente: Florence Center for Data Science

16/03/2023 ore 14.30

### How women’s employment instability affects birth transitions? The moderating role of family policies in 27 European Countries

### Chen-Hao Hsu (University of Bamberg)

Why are women in some countries more likely than others to postpone childbirth when facing employment instability? This study uses 2010-2019 EU-SILC panel data to explore whether women’s unemployment and temporary employment affect their first- and second-birth transitions and how such patterns differ across 27 European countries. Results show that while unemployment and temporary employment generally delay women’s first- and second-birth transition, such effects vary across European countries and depend on the levels of family policy provisions. More generous family cash benefits may mitigate the negative effects of women’s unemployment on the first birth and temporary employment on the second birth transitions. On the other hand, the effect of women’s employment instability depends less on the length of paid maternity/parental leaves. Most strikingly, higher childcare coverage rates are associated with more negative effects of women’s temporary employment on the first birth and unemployment on the second birth transitions.

06/03/2023 ore 11.00 - In presenza in aula 205 (ex32)

### A Theory of (co-lex) Ordered Regular Languages

### Nicola Prezza (Università Ca' Foscari, Venezia)

**Abstract:** NFAs are inherently unordered objects, but they represent regular languages on which one can very naturally define a total order: for example, the co-lexicographic order in which words are compared alphabetically from right to left. In this talk I will show that interesting things happen when one tries to map this total order to the states of an accepting NFA for the language: the resulting order of the states is a partial pre-order whose width p turns out to be an important parameter for NFAs and regular languages. For example, take the classic powerset determinization algorithm for converting an NFA of size n into an equivalent DFA: while a straightforward analysis shows that the size of the resulting DFA is at most 2^n, we prove that it is actually at most (n-p+1)*2^p. This implies that PSPACE-complete problems such as NFA equivalence or universality are actually easy on NFAs of small width p (the case p=1 - total order - is particularly interesting). Another implication of this theory is that we can compress NFAs to just O(log p) bits per transition while supporting fast membership queries in the substring closure of the language.

**Biosketch:** Nicola Prezza is an Associate professor at Ca' Foscari University of Venice, Italy. He received a PhD in Computer Science from the University of Udine in 2017 with a thesis on dynamic compressed data structures. After that, he worked as post-doc researcher at the universities of Pisa and Copenhagen (DTU) and as Assistant professor at LUISS (Rome). His current research is focused on the relations existing between data structures, data compression, and regular languages. In 2018, he received the "Best Italian Young Researcher in Theoretical Computer Science" award from the Italian chapter of EATCS. In 2021, he won an ERC starting grant on the topic of regular language compression and indexing.

Referente: Andrea Marino

03/03/2023 ore 14.30 - The seminar will be on-line, please register here to participate online: https://us02web.zoom.us/webinar/register/WN_GprXsGF7Ti6sQ8uyNfqKpQ

### D2 Seminar Series

Augusto Cerqua & Marco Letta: “Losing control (group)? The Machine Learning Control Method for counterfactual forecasting”

### Doppio seminario FDS:

Augusto Cerqua & Marco Letta - Department of Social Sciences and Economics, Sapienza University of Rome

The standard way of estimating treatment effects relies on the availability of a similar group of untreated units. Without it, the most widespread counterfactual methodologies cannot be applied. We tackle this limitation by presenting the Machine Learning Control Method (MLCM), a new causal inference technique for aggregate data based on counterfactual forecasting via machine learning. The MLCM is suitable for the estimation of individual, average, and conditional average treatment effects in evaluation settings with short panels and no controls. The method is formalized within the Rubin’s Potential Outcomes Model and comes with a full set of diagnostic, performance, and placebo tests. We illustrate our methodology with an empirical application on the short-run impacts of the COVID-19 crisis on income inequality in Italy, which reveals a striking heterogeneity in the inequality effects of the pandemic across the Italian local labor markets.

Referente: Florence Center for Data Science

28/02/2023 ore 14.30 - The seminar will be on-line, please register here to participate online: https://us02web.zoom.us/webinar/register/WN_O2wv8qTvRBWYKcyZw2qOrQ

### Special Guest Seminar:

Riccardo Michielan: “Is there geometry in real networks?”

### Special Guest Seminar Series:

Riccardo Michielan - University of Twente

Riccardo Michielan:

“In the past decade, many geometric network models have been developed, assuming that each vertex is associated a position in some underlying topologic space. Geometric models formalize the idea that similar vertices are naturally likely to connect. Moreover, these models are able reproduce many properties which are commonly observed in real networks. On the other hand, it is not always possible to infer the presence of geometry in real networks, if the edge connections are the only observables. The aim of this talk is to formalize a simple statistic which counts weighted triangles: this statistic discounts the triangles that are almost surely not caused by geometry. Then, using weighted triangles we will be able to elaborate a robust technique to distinguish whether real networks are embedded in a geometric space or not.”

Referente: Florence Center for Data Science

17/02/2023 ore 14.30 -

The Seminar will be available also online. Please register here to participate online: https://us02web.zoom.us/webinar/register/WN_SLRoRT_DQL-nCqVPJb6xLQ

### D2 Seminar Series

Elena Stanghellini: “Causal effects for binary variables: parametric formulation and sensitivity”

Gianluca Iannucci: “The interaction between emission tax and insurance in an evolutionary oligopoly”

### Doppio seminario FDS:

Elena Stanghellini - Department of Economics, University of Perugia & Gianluca Iannucci - Department of Economics and Management, University of Florence

Elena Stanghellini:

“The talk will focus on causal effects of a treatment on a binary outcome. I shall review some results for one single binary mediator, and show how these can be extended to the multiple mediator case. Particular focus shall be put on two mediators, with the aim to isolate sensitivity parameters against the identifying assumptions. If time permits, extensions to outcome dependent sampling schemes will be also addressed. This talk is based on joint work with: Paolo Berta, Marco Doretti, Minna Genbäck, Martina Raggi.”

Gianluca Iannucci:

”It is now commonly accepted that polluting companies deeply contribute to climate change. Environmental losses significantly impact companies’ profits so they have to man- age them through different strategies to survive on the market. The model assumes two types of firms, polluting and non-polluting, playing a Cournot-Nash game. Due to the different impact on the environment, polluting firms have to pay an emission tax. Both types of firms are risk averse and can cover the potential climate change loss choosing insurance coverage. From the comparative static analysis computed at the equilibrium, it emerges a substitution effect between insurance and taxation. Moreover, insurance can help clean firms to compete with dirty ones. Finally, we endogenize the market structure through an evolutionary setting and we perform comparative dynamics to confirm the interplay of taxation and insurance that arise from analytical results in order to nudge an ecological transition.”

Referente: Florence Center for Data Science

07/02/2023 ore 10.00

### Double truncation method for controlling local false discovery rate in case of spiky null

### Jaesik Jeong (Chonnam National University, Korea)

Many multiple test procedures, which control false discovery rate (FDR), have been developed to identify some cases (e.g. genes) showing statistically significant difference between groups. Highly spiky null is often reported in some data sets from practice. When it occurs, currently existing methods have a difficulty of controlling type I error due to the ‘inflated false positives’. No attention has been given to this in previous literature. Recently, a part of us has encountered the problem in the analysis of SET4 gene deletion data and proposed to model the null with a scale mixture normal distribution. However, its use is very limited due to the strong assumptions on the spiky peak (e.g. symmetric peak with respect to 0). In this paper, we propose a new multiple test procedure that can be applied to any type of spiky peak data, even to the situation with no spiky peak or with more than one spiky peaks. In our procedure, we truncate the central statistics around 0, which mainly contributes to the spike of the null, as well as two tails that are possibly contaminated by the alternative. We name it as ‘double truncation method’. After double truncation, the null density estimation is done by the doubly truncated maximum likelihood estimator (DTMLE). We numerically show that the proposed method controls the false discovery rate at the aimed level on simulated data. Also, we apply our method to two real data sets such as SET protein data and peony data.

Referente: Monia Lupparelli

03/02/2023 ore 14.30 - The seminar will be on-line, please register here to participate online: https://us02web.zoom.us/webinar/register/WN_mHAHeMr0RkKgv-eXiUsyzQ

### D2 Seminar Series

Nicola Del Sarto: “One size does not fit all. Business models heterogeneity among Internet of Things architecture layers”

Andrea Mercatanti: “A Regression Discontinuity Design for ordinal running variables: evaluating Central Bank purchases of corporate bonds.”

### Doppio seminario FDS:

Nicola Del Sarto - Department of Economics and Management, University of Florence & Andrea Mercatanti - Department of Statistical Sciences, Sapienza University of Rome

Nicola Del Sarto:

“The new paradigm known as the Internet of Things (IoT) is expected to have a significant impact on business during the next years, as it leads to the connection of physical objects and the interaction between the digital and physical worlds. While prior literature addressing the business implications arising from this paradigm has largely considered IoT as an integrated technology, in this study we examine different components of IoT and assess whether firms concerned with the development of IoT solutions have adopted original business models to exploit the opportunities offered by the specific IoT architecture layer they operate in. In turn, based on primary survey data collected on a sample of IoT Italy association’s members, we explore different dimensions of the business model and offer a reinterpretation of the business model Canvas framework adapted to the IoT environment. We show that the specificities of each IoT layer require firms to adopt adhoc business models and focus on different dimensions of the business model Canvas. We believe our research provides some important contributions for both academics and practitioners. For the latter, we provide a tool useful for making decisions on how to design the business model for IoT applications.”

Andrea Mercatanti:

”Regression discontinuity (RD) is a widely used quasi-experimental design for causal inference. In the standard RD, the assignment to treatment is determined by a continuous pretreatment variable (i.e., running variable) falling above or below a pre-fixed threshold. Recent applications increasingly feature ordered categorical or ordinal running variables, which pose challenges to RD estimation due to the lack of a meaningful measure of distance. We proposes an RD approach for ordinal running variables under the local randomization framework. The proposal first estimates an ordered probit model for the ordinal running variable. The estimated probability of being assigned to treatment is then adopted as a latent continuous running variable and used to identify a covariate-balanced subsample around the threshold. Assuming local unconfoundedness of the treatment in the subsample, an estimate of the effect of the program is obtained by employing a weighted estimator of the average treatment effect. Two weighting estimators—overlap weights and ATT weights—as well as their augmented versions are considered. We apply the method to evaluate the causal effects of the corporate sector purchase programme (CSPP) of the European Central Bank, which involves large-scale purchases of securities issued by corporations in the euro area. We find a statistically significant and negative effect of the CSPP on corporate bond spreads at issuance.”

Referente: Florence Center for Data Science

26/01/2023 ore 12.00

### Il monitoraggio delle popolazioni nascoste e la valutazione delle policy in ambito di dipendenze

### Sabrina Molinaro e Elisa Benedetti (Consiglio Nazionale delle Ricerche-CNR)

Il focus del seminario consisterà nella descrizione dei metodi di monitoraggio in uso per lo studio delle popolazioni nascoste in ambito di dipendenze da sostanze e comportamentali (es. soggetti con disturbi da uso di sostanze, giocatori d’azzardo patologici). Verranno descritti gli studi ad hoc sviluppati dal Laboratorio di Epidemiologia e ricerca sui servizi sanitari di IFC-CNR con particolare attenzione a quelli utilizzati per stimarne la prevalenza, considerando sia indagini ad hoc che studi ecologici che originano dall'integrazione di diverse fonti di dati. Ci si concentrerà poi sulle basi di dati disponibili con l’obiettivo di sviluppare metodiche di analisi nuove per stimare il fenomeno e analizzarne le caratteristiche. Verranno poi presentati alcuni esempi di studi di valutazione delle politiche sanitarie sviluppati attraverso l'uso dei dati prodotti. L'integrazione di diverse fonti di dati per la valutazione di impatto delle politiche pubbliche in ambito di dipendenze è infatti la sfida più recente che il laboratorio ha intrapreso. L’obiettivo è sviluppare una comunicazione multidisciplinare efficace fra il mondo dell’epidemiologia e quello di altre discipline, quali statistica sociale ed economia, al fine di offrire elementi per lo sviluppo di politiche pubbliche evidence-based.

Referente: Raffaele Guetto

20/01/2023 ore 14.30 - The Seminar will be available also online. Please register here to participate online: https://us02web.zoom.us/webinar/register/WN_mEFLIP8NRFeKE8mQh8BcNw

### D2 Seminar Series

Giacomo Toscano: “Central limit theorems for the Fourier-transform estimator of the volatility of volatility”

Gabriele Fiorentini: “Specification tests for non-Gaussian structural vector autoregressions”

### Doppio seminario FDS:

Giacomo Toscano – DISEI, University of Florence & Gabriele Fiorentini – DISIA, University of Florence

Giacomo Toscano:

“We study the asymptotic normality of two feasible estimators of the integrated volatility of volatility based on the Fourier methodology, which does not require the pre-estimation of the spot volatility. We show that the bias-corrected estimator reaches the optimal rate n1/4, while the estimator without bias correction has a slower convergence rate and a smaller asymptotic variance. Additionally, we provide simulation results that support the theoretical asymptotic distribution of the rate-efficient estimator and show the accuracy of the latter in comparison with a rate-optimal estimator based on the pre-estimation of the spot volatility. Finally, using the rate-optimal Fourier estimator, we reconstruct the time series of the daily volatility of volatility of the S&P500 and EUROSTOXX50 indices over long samples and provide novel insight into the existence of stylized facts about the volatility of volatility dynamics.”

Gabriele Fiorentini:

We propose specification tests for independent component analysis and structural vector autoregressions that assess the assumed cross-sectional independence of the non-Gaussian shocks. Our tests effectively compare their joint cumulative distribution with the product of their marginals at discrete or continuous grids of values for its arguments, the latter yielding a consistent test. We explicitly consider the sampling variability from using consistent estimators to compute the shocks. We study the finite sample size of our tests in several simulation exercises, with special attention to resampling procedures. We also show that they have non-negligible power against a variety of empirically plausible alternatives.

Referente: Florence Data Center

17/01/2023 ore 12.00

### Nedka Nikiforova: Design of experiments for technology and for consumers’ preferences

Valentina Tocchioni: A snapshot of my research: from childlessness to higher education research

Pamela Vignolini: Crocus sativus L. flowers valorisation as sources of bioactive compounds

### Welcome seminar:

Nedka Nikiforova,Valentina Tocchioni, Pamela Vignolini

**Nedka Nikiforova:**

Design of experiments (DoE) is a wide and fundamental methodology of the statistics theory. It plays a relevant role to improve and solve issues in the fields of technology and consumers’ behaviour. In this seminar, I will present a general overview of my research related to DoE. First, I will focus on a study related to a split-plot design in the technological field. Following, I will address computer experiments and Kriging modelling to solve complex engineering and technological issues, for which physical experimentation could be too costly, or in certain cases, impossible to be performed. Lastly, I will present my research related to innovative approaches to build optimal designs for the technological field, and for choice experiments to analyze consumers’ preferences. A further research topic related to the field of quantitative marketing will be also briefly outlined during the talk.**Valentina Tocchioni:**

During this seminar I will illustrate a general overview of my past research. In particular, I will concentrate on four socio-demographic topics I have been dealing with, such as childlessness, family dynamics – family formation, fertility, and divorce – and their interrelationship with economic uncertainty, sexual behaviours, and higher education in terms of students’ university tracks and PhD students’ and graduates’ work trajectories. Most of the presentation will be based on previous published research, but some hints of actual and future directions of my research will be given.**Pamela Vignolini:**

The application of circular economy principles is of particular interest for the agricultural and agri-food sector, given the large amount of waste matrix of some plant species. In recent decades the attention towards the cultivation of saffron (Crocus sativus L.) has been rediscovered. The saffron produced from dried stigmas of Crocus sativus L. has been known since ancient times for its numerous therapeutic properties. The spice is obtained from the stigmas of the flowers, while petals and stamens are 90% waste material.
The recovery of the flowers, considering the considerable amount of polyphenols with high antioxidant activity present in this matrix allows its use for innovative purposes in different product sectors such as foods, cosmetics and biomedical applications. In this context, the present work evaluated the polyphenol content in flowers of C. sativus grown in Tuscany, in order to characterize this product from a qualitative-quantitative point of view for various product sectors. The quali-quantitative analysis of the extracts was carried out by HPLC/DAD/MS analysis. Given the potential of this matrix, another aspect of the research consists in evaluating the possible tumor growth inhibition activity on bladder cancer cell lines by the extracts of petals.

Referente: Raffaele Guetto

12/01/2023 ore 12.00

### A fresh look to Multiple Frame Surveys for a multi data source world

### Fulvia Mecatti (Università di Milano-Bicocca)

Multiple Frame (MF) Surveys have been around since the 1960s as an effective tool to deal with traditional challenges in sample survey: to reduce costs and improve population coverage, to cope against "imperfect" (or even non-existent) sampling frame for not being able to directly representing the target population, and to increase sample size for sub-populations of interest. In recent years multiple-frame surveys are increasingly considered to deal with newer challenges and emerging needs in our multi data source world. In this seminar a fresh look to MF surveys will be given and to their potential to serve as an organising framework to untangle modern complex multi-structured data problems. Building upon the multiplicity approach as a simplified, unifying and principled approach to MF estimation, we will illustrate how the MF paradigm and reasoning can help with the general issue of integrate data from different sources, and in particular to produce good estimates upon complex panel data from large scale longitudinal surveys such as SHARE (http://www.share-project.org/home0.html).

Referente: Daniele Vignoli

Ultimo aggiornamento 16 maggio 2024.