Efficient posterior sampling for high-dimensional imbalanced logistic regression

Deborshee Sen; Matthias Sachs; Jianfeng Lu; David B. Dunson

doi:10.1093/biomet/asaa035

Efficient posterior sampling for high-dimensional imbalanced logistic regression

Deborshee Sen, Matthias Sachs, Jianfeng Lu, David B. Dunson

Mathematics

Research output: Contribution to journal › Article › peer-review

5 Citations (Scopus)

Abstract

Classification with high-dimensional data is of widespread interest and often involves dealing with imbalanced data. Bayesian classification approaches are hampered by the fact that current Markov chain Monte Carlo algorithms for posterior computation become inefficient as the number p of predictors or the number n of subjects to classify gets large, because of the increasing computational time per step and worsening mixing rates. One strategy is to employ a gradient-based sampler to improve mixing while using data subsamples to reduce the per-step computational complexity. However, the usual subsampling breaks down when applied to imbalanced data. Instead, we generalize piecewise-deterministic Markov chain Monte Carlo algorithms to include importance-weighted and mini-batch subsampling. These maintain the correct stationary distribution with arbitrarily small subsamples and substantially outperform current competitors. We provide theoretical support for the proposed approach and demonstrate its performance gains in simulated data examples and an application to cancer data.

Original language	English
Pages (from-to)	1005-1012
Number of pages	8
Journal	Biometrika
Volume	107
Issue number	4
DOIs	https://doi.org/10.1093/biomet/asaa035
Publication status	Published - 1 Dec 2020

Bibliographical note

Publisher Copyright:
© 2020 Biometrika Trust.

Keywords

Imbalanced data
Logistic regression
Piecewise-deterministic Markov process
Scalable inference
Subsampling

ASJC Scopus subject areas

Statistics and Probability
General Mathematics
Agricultural and Biological Sciences (miscellaneous)
General Agricultural and Biological Sciences
Statistics, Probability and Uncertainty
Applied Mathematics

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1093/biomet/asaa035

Cite this

@article{e666750c530c40c1b7a4c740b8f1f9b6,

title = "Efficient posterior sampling for high-dimensional imbalanced logistic regression",

abstract = "Classification with high-dimensional data is of widespread interest and often involves dealing with imbalanced data. Bayesian classification approaches are hampered by the fact that current Markov chain Monte Carlo algorithms for posterior computation become inefficient as the number p of predictors or the number n of subjects to classify gets large, because of the increasing computational time per step and worsening mixing rates. One strategy is to employ a gradient-based sampler to improve mixing while using data subsamples to reduce the per-step computational complexity. However, the usual subsampling breaks down when applied to imbalanced data. Instead, we generalize piecewise-deterministic Markov chain Monte Carlo algorithms to include importance-weighted and mini-batch subsampling. These maintain the correct stationary distribution with arbitrarily small subsamples and substantially outperform current competitors. We provide theoretical support for the proposed approach and demonstrate its performance gains in simulated data examples and an application to cancer data.",

keywords = "Imbalanced data, Logistic regression, Piecewise-deterministic Markov process, Scalable inference, Subsampling",

author = "Deborshee Sen and Matthias Sachs and Jianfeng Lu and Dunson, {David B.}",

note = "Publisher Copyright: {\textcopyright} 2020 Biometrika Trust.",

year = "2020",

month = dec,

day = "1",

doi = "10.1093/biomet/asaa035",

language = "English",

volume = "107",

pages = "1005--1012",

journal = "Biometrika",

issn = "0006-3444",

publisher = "Oxford University Press",

number = "4",

}

TY - JOUR

T1 - Efficient posterior sampling for high-dimensional imbalanced logistic regression

AU - Sen, Deborshee

AU - Sachs, Matthias

AU - Lu, Jianfeng

AU - Dunson, David B.

PY - 2020/12/1

Y1 - 2020/12/1

N2 - Classification with high-dimensional data is of widespread interest and often involves dealing with imbalanced data. Bayesian classification approaches are hampered by the fact that current Markov chain Monte Carlo algorithms for posterior computation become inefficient as the number p of predictors or the number n of subjects to classify gets large, because of the increasing computational time per step and worsening mixing rates. One strategy is to employ a gradient-based sampler to improve mixing while using data subsamples to reduce the per-step computational complexity. However, the usual subsampling breaks down when applied to imbalanced data. Instead, we generalize piecewise-deterministic Markov chain Monte Carlo algorithms to include importance-weighted and mini-batch subsampling. These maintain the correct stationary distribution with arbitrarily small subsamples and substantially outperform current competitors. We provide theoretical support for the proposed approach and demonstrate its performance gains in simulated data examples and an application to cancer data.

AB - Classification with high-dimensional data is of widespread interest and often involves dealing with imbalanced data. Bayesian classification approaches are hampered by the fact that current Markov chain Monte Carlo algorithms for posterior computation become inefficient as the number p of predictors or the number n of subjects to classify gets large, because of the increasing computational time per step and worsening mixing rates. One strategy is to employ a gradient-based sampler to improve mixing while using data subsamples to reduce the per-step computational complexity. However, the usual subsampling breaks down when applied to imbalanced data. Instead, we generalize piecewise-deterministic Markov chain Monte Carlo algorithms to include importance-weighted and mini-batch subsampling. These maintain the correct stationary distribution with arbitrarily small subsamples and substantially outperform current competitors. We provide theoretical support for the proposed approach and demonstrate its performance gains in simulated data examples and an application to cancer data.

KW - Imbalanced data

KW - Logistic regression

KW - Piecewise-deterministic Markov process

KW - Scalable inference

KW - Subsampling

UR - http://www.scopus.com/inward/record.url?scp=85095556781&partnerID=8YFLogxK

U2 - 10.1093/biomet/asaa035

DO - 10.1093/biomet/asaa035

M3 - Article

AN - SCOPUS:85095556781

SN - 0006-3444

VL - 107

SP - 1005

EP - 1012

JO - Biometrika

JF - Biometrika

IS - 4

ER -

Efficient posterior sampling for high-dimensional imbalanced logistic regression

Abstract

Bibliographical note

Keywords

ASJC Scopus subject areas

UN SDGs

Access to Document

Fingerprint

Cite this