# Publications

## Job market papers

### 1. Network regression and supervised centrality estimation

##### [ Abstract ] [ Paper ] [ arXiv ]

Networks are ubiquitous and play a crucial role in our lives. The position of an agent in the network, usually captured by the “centrality”, has implications for the agent’s behaviour and serves as an important intermediary of network effects. Therefore, the centrality is often incorporated in regression models to elucidate the network effect on an outcome variable of interest. In empirical studies, researchers often adopt a two-stage procedure to estimate the centrality and to infer the network effect – they first estimate the centrality from the observed network and then employ the estimated centrality in the regression for estimation and inference. Despite its prevalent adoption, this naive two-stage procedure lacks theoretical backing and can fail in both estimation and inference. We therefore propose a unified framework that combines a network model and a network regression model, under which we prove the short-comings of the two-stage in centrality estimation and the undesirable consequences in the network regression. We then propose a novel supervised network centrality estimation (SuperCENT) methodology that simultaneously combines the information from the two models. SuperCENT dominates the two-stage procedure in the estimation of the centrality and the true underlying network universally. In addition, SuperCENT yields superior estimation of the network effect and provides valid and narrower confidence intervals than those from the two-stage. We apply our method to predict the currency risk premium based on the global trade network. We show that a trading strategy based on SuperCENT centrality estimates yields a return three times as high as the two-stage method, and the inference drawn by SuperCENT verifies an economic theory via a rigorous statistical testing while the two-stage procedure cannot.

### 2. Ownership network and firm growth: What do forty million companies tell about the Chinese economy?

##### [ Abstract ] [ Paper ] [ SSRN ]

The finance–growth nexus has been a central question in understanding the unprecedented success of the Chinese economy. Using unique data on all the registered firms in China, we build extensive firm-to-firm equity ownership networks. Entering a network and increasing network centrality leads to higher firm growth, and the effect of global centralities strengthens over time. The RMB 4 trillion stimulus launched by the Chinese government in 2008 partially “crowded out” the positive network effects. Equity ownership networks and bank credit tend to act as substitutes for state-owned enterprises, but as complements for private firms in promoting growth

## Journal publications

### 4. Valid post-selection inference in model-free linear regression

##### [ Abstract ] [ Paper ] [ Published version ]

Modern data-driven approaches to modeling make extensive use of covariate/model selection. Such selection incurs a cost: it invalidates classical statistical inference. A conservative remedy to the problem was proposed by Berk et al. (2013) and further extended by Bachoc et al. (2016). These proposals, labeled PoSI methods'', provide valid inference after arbitrary model selection. They are computationally NP-hard and have certain limitations in their theoretical justifications. We therefore propose computationally efficient PoSI confidence regions and prove large-$p$ asymptotics for them. We do this for linear OLS regression allowing misspecification of the normal linear model, for both fixed and random covariates, and for independent as well as some types of dependent data. We start by proving a general equivalence result for the post-selection inference problem and a simultaneous inference problem in a setting that strips inessential features still present in a related result of Berk et al. (2013). We then construct valid PoSI confidence regions that are the first to have vastly improved computational efficiency in that the required computation times grow only quadratically rather than exponentially with the total number $p$ of covariates. These are also the first PoSI confidence regions with guaranteed asymptotic validity when the total number of covariates~$p$ diverges (almost exponentially) with the sample size~$n$. Under standard tail assumptions, we only require $(\log p)^7 = o(n)$ and $k = o(\sqrt{n/\log p})$ where $k (\le p)$ is the largest number of covariates (model size) considered for selection. We study various properties of these confidence regions, including their Lebesgue measures, and compare them (theoretically) with those proposed previously.

### 5. Statistical theory powering data science

##### [ Abstract ] [ Paper ] [ Published version ]

Statisticians are finding their place in the emerging field of data science. However, many issues considered “new” in data science have long histories in statistics. Examples of using statistical thinking are illustrated, which range from exploratory data analysis to mea- suring uncertainty to accommodating nonrandom samples. These ex- amples are then applied to service networks, baseball predictions and official statistics.

## Preprints

### 8. Nonparametric empirical Bayes estimation and testing for sparse and heteroscedastic signals

##### [ Abstract ] [ arXiv ]

Large-scale modern data often involves estimation and testing for high-dimensional unknown parameters. It is desirable to identify the sparse signals, the needles in the haystack'', with accuracy and false discovery control. However, the unprecedented complexity and heterogeneity in modern data structure require new machine learning tools to effectively exploit commonalities and to robustly adjust for both sparsity and heterogeneity. In addition, estimates for high-dimensional parameters often lack uncertainty quantification. In this paper, we propose a novel Spike-and-Nonparametric mixture prior (SNP) -- a spike to promote the sparsity and a nonparametric structure to capture signals. In contrast to the state-of-the-art methods, the proposed methods solve the estimation and testing problem at once with several merits: 1) an accurate sparsity estimation; 2) point estimates with shrinkage/soft-thresholding property; 3) credible intervals for uncertainty quantification; 4) an optimal multiple testing procedure that controls false discovery rate. Our method exhibits promising empirical performance on both simulated data and a gene expression case study.

### 9. Centralization or decentralization? The evolution of state-ownership in China

##### [ Abstract ] [ Paper ]

In this paper, we anatomize the state sector and its role in Chinese economy. We propose a measure of Chinese SOEs (and partial SOEs) based on the firm-to-firm equity investment relationships. We are the first to identify all SOEs among over 40 millions of all Chinese registered firms. Our measure captures a significant larger number of SOEs than the existing measure. The aggregated capital of all (partial) SOEs has climbed up to 85%, and the total state capital in all SOEs has increased to 31%, both over total capital in the economy by 2017. The state ownership shows parallel trends of decentralization (authoritarian hierarchy) and indirect control (ownership hierarchy) over time. In addition, we find mixed ownership is associated with higher firm growth and performance; while hierarchical distance to governments is associated with better firm performance but lower growth. Drawing a stark distinction between SOEs and privately-owned enterprises (POEs) could lead to misperceptions of the role of state ownership in Chinese economy

### 11. All of Linear Regression

##### [ Abstract ] [ Paper ] [ arXiv ]

Least squares linear regression is one of the oldest and widely used data analysis tools. Although the theoretical analysis of ordinary least squares (OLS) estimator is as old, several fundamental questions are yet to be answered. Suppose regression observations $(X_1,Y_1),...,(X_n,Y_n)$ (not necessarily independent) are available. Some of the questions we deal with are as follows: under what conditions, does the OLS estimator converge and what is the limit? What happens if the dimension is allowed to grow with $n$? What happens if the observations are dependent with dependence possibly strengthening with $n$? How to do statistical inference under these kinds of misspecification? What happens to OLS estimator under variable selection? How to do inference under misspecification and variable selection? We answer all the questions raised above with one simple deterministic inequality which holds for any set of observations and any sample size. This implies that all our results are finite sample (non-asymptotic) in nature. At the end, one only needs to bound certain random quantities under specific settings of interest to get concrete rates and we derive these bounds for the case of independent observations. In particular the problem of inference after variable selection is studied, for the first time, when $d$, the number of covariates increases (almost exponentially) with sample size $n$. We provide comments on the right'' statistic to consider for inference under variable selection and efficient computation of quantiles.

## Working papers

### 12. Valid post-selection inference for the average treatment effect with covariate adjustment in randomized experiments

##### [ Abstract ]

Randomized experiments are the fundamental tools to evaluate the treatment effect in many fields. Prior to the treatment assignment, the baseline covariates are often collected and can be incorporated into the analysis to improve the estimation efficiency. The efficiency gain from covariate adjustment might encourage attempts to hunt for covariates that maximize the efficiency of the treatment effect estimate. Such a kind of significance hunting'' can invalidate statistical inference due to the data-dependent selection. Luckily, the randomization makes an exception. we show that under a class of unbiased estimators of the average treatment effect, the inference remains valid after selecting for the estimator with minimum variance, provided with a consistent standard error. We adopt a model-free approach without imposing a parametric outcome model and solely depends on the randomization in treatment assignment.

### 13. Generalized Cp and a predictive model selection test in assumption-lean framework

##### [ Abstract ]

The classical methods of variable selection based on the estimate of the out-of-sample prediction risk are designed under the Gauss-Markov model and thus are not justifiable under misspecification. The customary elbow rule based on the scree plot can be misleading and a formal testing procedure accompanying confidence intervals will be more desirable. We propose a model-free analog of Cp, generalized Cp (GCp), and a predictive model selection test based on GCp. This estimator can be shown to be asymptotically equivalent to the testing error based on an independent sample and is also asymptotically equivalent to the leave-one-out cross-validation estimator of the out-of-sample prediction risk. We are currently pursuing the optimality and properties of the model selection test.

### 14. Computation of PoSI statistics

##### [ Abstract ]

The use of covariate selection in modern data-driven modelling invalidates classifical statistical inference. The "PoSI methods" of Berk et al. (2013) and Bachoc et al. (2016) provide valid inference after arbitrary model selection but are computationally inefficient because it involves inference simultaneously over all models. Even in the linear regression problem, the number of operations required therein is $O(p2^{p-1})$ which is prohibitive for large $p$. We propose a continuum relaxation of the PoSI statistic is proposed. This relaxation allows the use of various maximization algorithms for functions on a continuous convex set which only requires at most logarithmic of the total number of models with guaranteed approximation error bounds provided.

### 15. Common versus idiosyncratic risk

##### [ Abstract ]

It is of great interest to dissect the driving forces of common movements, or co-movement, among correlated objects, such as asset prices and product sales. Two popular models are commonly used, a common factor model and a network model, to explain the co-movement. However, there exists no literature on simultaneously examining the relative importance of these two mechanisms. We develop a flexible model incorporating both common factors and networks. We investigate conditions under which the common factors and the network effects can be simultaneously identified. Applying our model to asset pricing, we evaluate the relative importance of the two mechanisms in the co-movement of asset returns.

### 16. Self-reported Chinese company data: Can it be trusted?

##### [ Abstract ]

The Annual Industrial Survey (AIS), dubbed as the "census data", has been used as the golden source for empirical firm-level economic and operational research. It covers a long time span (as early as 1992) and provides rich information including identification information, stocks, and flows. However, the self-reported nature cast doubts on its credibility. The goal of this paper is to determine the reliability of AIS by comparing with Orbis, another firm-level data source that has the largest collection of firms with detailed ownership information. Firms' ownership is of the particular interest of researchers and serves as one of the most important controlling variables in their analysis. We, therefore, examine the disparities of ownership between AIS and Orbis, namely state-owned, privately-owned or foreign. Among the firms that have ownership information on both sides, the matching rate of ownership information is as high as 90%, which proves the credibility of AIS ownership information. Careful comparisons of several controlling variables between the cohort of matched firms and the AIS general population show there is no systematic bias in the matched cohort.