Link

Publications

Journal publications

[ Top ]

1. State ownership in China: An equity network perspective

Junhui Cai, Xian Gu, Linda Zhao, Wu Zhu (2023)
The Arc of the Chinese Economy, edited by Hanming Fang and Marshall Meyer
[]

State ownership is the pillar of China’s economy. One cannot understand China’s economy without understanding the state ownership. Existing measures of state-owned enterprises (SOEs), largely self-reported, are limited to industrial firms covered by the Annual Industrial Survey (AIS). We provide a new lens by constructing a novel dynamic equity ownership network of all 40 million registered firms in China. Based on the network, we propose a new dynamic SOE metric. Our analysis reveals systematic and large-scale discrepancies between our method and the existing measures, with ours identifying a notably larger pool of SOEs. By the end of 2017, state capital had increased to 31% among all the in-network firms, while the total capital of all SOEs, including partial SOEs, had climbed up to 85%. Our finding suggests that state ownership exhibits both decentralization and indirect control trends over time, shedding new insights for future research.

2. Hierarchical vintage sparse PCA. Discussion on the paper by Rohe and Zeng

Junhui Cai, Dan Yang, Wu Zhu, Linda Zhao (2023)
Journal of the Royal Statistical Society. Series B: Statistical Methodology
[ Paper ] [ Published version ]

3. Practical issues concerning assumption-lean inference for generalized linear models. Discussion on the paper by Vansteelandt and Dukes

Elizabeth Ogburn, Junhui Cai, Arun Kumar Kuchibhotla, Richard Berk, Andreas Buja (2021)
Journal of the Royal Statistical Society. Series B: Statistical Methodology
[ Paper ] [ Published version ]

4. Valid post-selection inference in model-free linear regression

Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Junhui Cai, Edward I. George, Linda Zhao (2019)
Annals of Statistics
[] [ Paper ] [ Published version ]

Modern data-driven approaches to modeling make extensive use of covariate/model selection. Such selection incurs a cost: it invalidates classical statistical inference. A conservative remedy to the problem was proposed by Berk et al. (2013) and further extended by Bachoc et al. (2016). These proposals, labeled ``PoSI methods'', provide valid inference after arbitrary model selection. They are computationally NP-hard and have certain limitations in their theoretical justifications. We therefore propose computationally efficient PoSI confidence regions and prove large-$p$ asymptotics for them. We do this for linear OLS regression allowing misspecification of the normal linear model, for both fixed and random covariates, and for independent as well as some types of dependent data. We start by proving a general equivalence result for the post-selection inference problem and a simultaneous inference problem in a setting that strips inessential features still present in a related result of Berk et al. (2013). We then construct valid PoSI confidence regions that are the first to have vastly improved computational efficiency in that the required computation times grow only quadratically rather than exponentially with the total number $p$ of covariates. These are also the first PoSI confidence regions with guaranteed asymptotic validity when the total number of covariates~$p$ diverges (almost exponentially) with the sample size~$n$. Under standard tail assumptions, we only require $(\log p)^7 = o(n)$ and $k = o(\sqrt{n/\log p})$ where $k (\le p)$ is the largest number of covariates (model size) considered for selection. We study various properties of these confidence regions, including their Lebesgue measures, and compare them (theoretically) with those proposed previously.

5. Statistical theory powering data science

Junhui Cai, Avishai Mandelbaum, Chaitra H Nagaraja, Haipeng Shen, Linda Zhao (2019)
Statistical Science
[] [ Paper ] [ Published version ]

Statisticians are finding their place in the emerging field of data science. However, many issues considered “new” in data science have long histories in statistics. Examples of using statistical thinking are illustrated, which range from exploratory data analysis to mea- suring uncertainty to accommodating nonrandom samples. These examples are then applied to service networks, baseball predictions and official statistics.

Preprints

[ Top ]

6. Doubly high-dimensional contextual bandits: An interpretable model with applications to assortment/pricing

Junhui Cai, Ran Chen, Martin Wainwright, Linda Zhao (2023)
Submitted to Management Science
[] [ SSRN ]

Key challenges in running a retail business include how to select products to present to consumers (the assortment problem), and how to price products (the pricing problem) to maximize revenue or profit. Instead of considering these problems in isolation, we propose a joint approach to assortment-pricing based on contextual bandits. Our model is doubly high-dimensional, in that both context vectors and actions allowed to take values in high-dimensional spaces. In order to circumvent the curse of dimensionality, we propose a simple yet flexible model that captures the interactions between covariates and actions via a (near) low-rank representation matrix. The resulting class of models is reasonably expressive while remaining interpretable through latent factors, and includes various structured linear bandit and pricing models as particular cases. We propose a computationally tractable procedure that combines an exploration/exploitation protocol with an efficient low-rank matrix estimator, and we prove bounds on its regret. Simulation results show that this method has lower regret than state-of-the-art methods applied to various standard bandit and pricing models. We also illustrate the gains achievable using our method by two case studies on real-world assortment-pricing problems for an industry-leading instant noodles company, and a smaller beauty start-up. In each case, we show both the gains in revenue achievable by our bandit methods, as well as the interpretability of the latent factor models that are learned.

7. Network regression and supervised centrality estimation

Junhui Cai, Dan Yang, Wu Zhu, Haipeng Shen, Linda Zhao (2023)
R&R at Journal of American Statistical Association
[] [ Paper ] [ arXiv ]

Networks are ubiquitous and play a crucial role in our lives. The position of an agent in the network, usually captured by the “centrality”, has implications for the agent’s behaviour and serves as an important intermediary of network effects. Therefore, the centrality is often incorporated in regression models to elucidate the network effect on an outcome variable of interest. In empirical studies, researchers often adopt a two-stage procedure to estimate the centrality and to infer the network effect – they first estimate the centrality from the observed network and then employ the estimated centrality in the regression for estimation and inference. Despite its prevalent adoption, this naive two-stage procedure lacks theoretical backing and can fail in both estimation and inference. We therefore propose a unified framework that combines a network model and a network regression model, under which we prove the short-comings of the two-stage in centrality estimation and the undesirable consequences in the network regression. We then propose a novel supervised network centrality estimation (SuperCENT) methodology that simultaneously combines the information from the two models. SuperCENT dominates the two-stage procedure in the estimation of the centrality and the true underlying network universally. In addition, SuperCENT yields superior estimation of the network effect and provides valid and narrower confidence intervals than those from the two-stage. We apply our method to predict the currency risk premium based on the global trade network. We show that a trading strategy based on SuperCENT centrality estimates yields a return three times as high as the two-stage method, and the inference drawn by SuperCENT verifies an economic theory via a rigorous statistical testing while the two-stage procedure cannot.

8. Ownership network and firm growth: What do forty million companies tell about the Chinese economy?

Franklin Allen, Junhui Cai, Xian Gu, Jun Qian, Linda Zhao, Wu Zhu (2023)
In submission to Journal of Financial Economics
China Financial Research Conference (CFRC) 2021 Best Paper Award (3 out of 534 papers).
[] [ Paper ] [ SSRN ]

The finance–growth nexus has been a central question in understanding the unprecedented success of the Chinese economy. Using unique data on all the registered firms in China, we build extensive firm-to-firm equity ownership networks. Entering a network and increasing network centrality leads to higher firm growth, and the effect of global centralities strengthens over time. The RMB 4 trillion stimulus launched by the Chinese government in 2008 partially “crowded out” the positive network effects. Equity ownership networks and bank credit tend to act as substitutes for state-owned enterprises, but as complements for private firms in promoting growth

9. Centralization or decentralization? The evolution of state-ownership in China

Franklin Allen, Junhui Cai, Xian Gu, Jun Qian, Linda Zhao, Wu Zhu (2023)
China International Conference in Finance (CICF) 2021 XiYue Best Paper Award (2 out of 2065 papers).
[] [ Paper ] [ SSRN ] [ VoxChina ]

In this paper, we anatomize the state sector and its role in Chinese economy. We propose a measure of Chinese SOEs (and partial SOEs) based on the firm-to-firm equity investment relationships. We are the first to identify all SOEs among over 40 millions of all Chinese registered firms. Our measure captures a significant larger number of SOEs than the existing measure. The aggregated capital of all (partial) SOEs has climbed up to 85%, and the total state capital in all SOEs has increased to 31%, both over total capital in the economy by 2017. The state ownership shows parallel trends of decentralization (authoritarian hierarchy) and indirect control (ownership hierarchy) over time. In addition, we find mixed ownership is associated with higher firm growth and performance; while hierarchical distance to governments is associated with better firm performance but lower growth. Drawing a stark distinction between SOEs and privately-owned enterprises (POEs) could lead to misperceptions of the role of state ownership in Chinese economy

10. Personalized reinforcement learning: With applications to recommender system

Junhui Cai, Ran Chen, Martin Wainwright, Linda Zhao (2023)
[]

Reinforcement learning (RL) has achieved remarkable success across various domains; however, its applicability is often hampered by challenges in practicality and interpretability. Many real-world applications, such as in healthcare and business settings, have large and/or continuous state and action spaces and demand personalized solutions. In addition, the interpretability of the model is crucial to decision-makers so as to guide their decision-making process while incorporating their domain knowledge. To bridge this gap, we propose a personalized reinforcement learning framework that integrates personalized information into the state-transition and reward-generating mechanisms. We develop an online RL algorithm for our framework. Specifically, our algorithm learns the embeddings of the personalized state-transition distribution in a Reproduction Kernel Hilbert Space (RKHS) by balancing the exploitation-exploration trade-off. We further provide the regret bound of the algorithm and demonstrate its effectiveness in recommender systems.

11. Optimal assortment and pricing via generalized MNL models with Poisson arrival

Junhui Cai, Ran Chen, Qitao Huang, Martin Wainwright, Linda Zhao, Wu Zhu (2023)
[]

The Multinomial Logit (MNL) / discrete choice model is a traditional model addressing light-weight assortment problem. However, it fails to consider the dynamic nature of customer arrivals, which significantly impacts practical decision-making based on real-time periods like weeks, months, and years. To bridge this gap, we extend the MNL model by incorporating a Poisson distribution, where the arrival rate is dependent on the current assortment and pricing, thus modeling customer arrival patterns over time. The key challenge lies in balancing the popularity of the Poisson model to attract more customers and the profitability in the MNL to increase conversion and revenue for each visit so as to maximize the cumulative reward by a real-time period. We propose an efficient algorithm to jointly solve the Poisson and MNL model. We provide a non-asymptotic bound for regret and show that our algorithm is optimal up to log factors.

12. Towards a holistic representation of online customer journeys: A tensor-based framework

Xinyuan Zhang, Junhui Cai, Jingjing Li, Ahmed Abbassi (2023)
[]

Understanding online user journeys has become crucial for explaining and predicting digital behavior. Existing methodologies often rely on principled feature engineering, which, while successful in predicting and interpreting customer journeys, are constructed artificially and thus present certain limitations. A more holistic and parsimonious framework is needed to fully comprehend omni-channel customer journeys. In this paper, we propose a tensor-based framework to capture users' digital channel interactions over time. We represent customer journeys through the lens of a three-dimensional user-channel-time tensor. We adopt tensor decomposition to extract interpretable latent factors. These factors capture digital trace patterns with explanatory and predictive power. For prediction, we incorporate the tensor into a deep learning architecture to learn the nonlinear and temporal convolutional patterns in customers' journeys. We evaluate our framework on 24 million raw user clickstreams and show that our methodology not only enhances our understanding of customer decision-making processes in purchases, but also significantly improves conversion prediction.

13. Nonparametric empirical Bayes estimation and testing for sparse and heteroscedastic signals

Junhui Cai, Xu Han, Ya'acov Ritov, Linda Zhao (2021)
arXiv:2106.08881
[] [ arXiv ]

Large-scale modern data often involves estimation and testing for high-dimensional unknown parameters. It is desirable to identify the sparse signals, ``the needles in the haystack'', with accuracy and false discovery control. However, the unprecedented complexity and heterogeneity in modern data structure require new machine learning tools to effectively exploit commonalities and to robustly adjust for both sparsity and heterogeneity. In addition, estimates for high-dimensional parameters often lack uncertainty quantification. In this paper, we propose a novel Spike-and-Nonparametric mixture prior (SNP) -- a spike to promote the sparsity and a nonparametric structure to capture signals. In contrast to the state-of-the-art methods, the proposed methods solve the estimation and testing problem at once with several merits: 1) an accurate sparsity estimation; 2) point estimates with shrinkage/soft-thresholding property; 3) credible intervals for uncertainty quantification; 4) an optimal multiple testing procedure that controls false discovery rate. Our method exhibits promising empirical performance on both simulated data and a gene expression case study.

14. Microscopic dynamics of equity ownership networks in China

Junhui Cai, Xian Gu, Linda Zhao, Wu Zhu (2021)
[]

15. All of Linear Regression

Arun Kumar Kuchibhotla, Lawrence D. Brown, Andreas Buja, Junhui Cai (2019)
arXiv:1910.06386
[] [ Paper ] [ arXiv ]

Least squares linear regression is one of the oldest and widely used data analysis tools. Although the theoretical analysis of ordinary least squares (OLS) estimator is as old, several fundamental questions are yet to be answered. Suppose regression observations $(X_1,Y_1),...,(X_n,Y_n)$ (not necessarily independent) are available. Some of the questions we deal with are as follows: under what conditions, does the OLS estimator converge and what is the limit? What happens if the dimension is allowed to grow with $n$? What happens if the observations are dependent with dependence possibly strengthening with $n$? How to do statistical inference under these kinds of misspecification? What happens to OLS estimator under variable selection? How to do inference under misspecification and variable selection? We answer all the questions raised above with one simple deterministic inequality which holds for any set of observations and any sample size. This implies that all our results are finite sample (non-asymptotic) in nature. At the end, one only needs to bound certain random quantities under specific settings of interest to get concrete rates and we derive these bounds for the case of independent observations. In particular the problem of inference after variable selection is studied, for the first time, when $d$, the number of covariates increases (almost exponentially) with sample size $n$. We provide comments on the ``right'' statistic to consider for inference under variable selection and efficient computation of quantiles.

Working papers

[ Top ]

16. Ownership structure in China's real estate sector

Junhui Cai, Xian Gu, Wu Zhu, Linda Zhao (2021)
[]

17. Valid post-selection inference for the average treatment effect with covariate adjustment in randomized experiments

Junhui Cai, Arun Kumar Kuchibhotla, Linda Zhao (2020)
[]

Randomized experiments are the fundamental tools to evaluate the treatment effect in many fields. Prior to the treatment assignment, the baseline covariates are often collected and can be incorporated into the analysis to improve the estimation efficiency. The efficiency gain from covariate adjustment might encourage attempts to hunt for covariates that maximize the efficiency of the treatment effect estimate. Such a kind of ``significance hunting'' can invalidate statistical inference due to the data-dependent selection. Luckily, the randomization makes an exception. we show that under a class of unbiased estimators of the average treatment effect, the inference remains valid after selecting for the estimator with minimum variance, provided with a consistent standard error. We adopt a model-free approach without imposing a parametric outcome model and solely depends on the randomization in treatment assignment.

18. Generalized Cp and a predictive model selection test in assumption-lean framework

Junhui Cai, Lawrence D. Brown, Arun Kumar Kuchibhotla, Linda Zhao (2020)
[]

The classical methods of variable selection based on the estimate of the out-of-sample prediction risk are designed under the Gauss-Markov model and thus are not justifiable under misspecification. The customary elbow rule based on the scree plot can be misleading and a formal testing procedure accompanying confidence intervals will be more desirable. We propose a model-free analog of Cp, generalized Cp (GCp), and a predictive model selection test based on GCp. This estimator can be shown to be asymptotically equivalent to the testing error based on an independent sample and is also asymptotically equivalent to the leave-one-out cross-validation estimator of the out-of-sample prediction risk. We are currently pursuing the optimality and properties of the model selection test.

19. Computation of PoSI statistics

Arun Kumar Kuchibhotla, Junhui Cai (2020)
[]

The use of covariate selection in modern data-driven modelling invalidates classifical statistical inference. The "PoSI methods" of Berk et al. (2013) and Bachoc et al. (2016) provide valid inference after arbitrary model selection but are computationally inefficient because it involves inference simultaneously over all models. Even in the linear regression problem, the number of operations required therein is $O(p2^{p-1})$ which is prohibitive for large $p$. We propose a continuum relaxation of the PoSI statistic is proposed. This relaxation allows the use of various maximization algorithms for functions on a continuous convex set which only requires at most logarithmic of the total number of models with guaranteed approximation error bounds provided.

20. Common versus idiosyncratic risk

Junhui Cai, Wu Zhu, Linda Zhao (2020)
[]

It is of great interest to dissect the driving forces of common movements, or co-movement, among correlated objects, such as asset prices and product sales. Two popular models are commonly used, a common factor model and a network model, to explain the co-movement. However, there exists no literature on simultaneously examining the relative importance of these two mechanisms. We develop a flexible model incorporating both common factors and networks. We investigate conditions under which the common factors and the network effects can be simultaneously identified. Applying our model to asset pricing, we evaluate the relative importance of the two mechanisms in the co-movement of asset returns.

21. Self-reported Chinese company data: Can it be trusted?

Junhui Cai, Edward Cai, Ann Harrison, Marshall Meyer, Linda Zhao, Minyuan Zhao (2018)
[]

The Annual Industrial Survey (AIS), dubbed as the "census data", has been used as the golden source for empirical firm-level economic and operational research. It covers a long time span (as early as 1992) and provides rich information including identification information, stocks, and flows. However, the self-reported nature cast doubts on its credibility. The goal of this paper is to determine the reliability of AIS by comparing with Orbis, another firm-level data source that has the largest collection of firms with detailed ownership information. Firms' ownership is of the particular interest of researchers and serves as one of the most important controlling variables in their analysis. We, therefore, examine the disparities of ownership between AIS and Orbis, namely state-owned, privately-owned or foreign. Among the firms that have ownership information on both sides, the matching rate of ownership information is as high as 90%, which proves the credibility of AIS ownership information. Careful comparisons of several controlling variables between the cohort of matched firms and the AIS general population show there is no systematic bias in the matched cohort.

Non-refereed publications

[ Top ]

22. 2021 IMS membership survey: report

Junhui Cai, Nicole Pashley, Linda Zhao (2021)
IMS Bulletin
[ Paper ]

23. Project on glacier recognition with SIFT and CNN

Junhui Cai (2015)
The Arctic Explorers from S.H. Ho College
[ Paper ]