NeurIPS Poster Coresets for Decision Trees of Signals

Poster

Coresets for Decision Trees of Signals

Ibrahim Jubran · Ernesto Evgeniy Sanches Shayda · Ilan I Newman · Dan Feldman

[ Abstract ]

[ OpenReview]

Abstract: A

k

$k$ -decision tree

t

$t$ (or

k

$k$ -tree) is a recursive partition of a matrix (2D-signal) into

k \geq 1

$k\geq 1$ block matrices (axis-parallel rectangles, leaves) where each rectangle is assigned a real label. Its regression or classification loss to a given matrix

D

$D$ of

N

$N$ entries (labels) is the sum of squared differences over every label in

D

$D$ and its assigned label by

t

$t$ .Given an error parameter

ε \in (0, 1)

$\varepsilon\in(0,1)$ , a

(k, ε)

$(k,\varepsilon)$ -coreset

C

$C$ of

D

$D$ is a small summarization that provably approximates this loss to \emph{every} such tree, up to a multiplicative factor of

1 \pm ε

$1\pm\varepsilon$ . In particular, the optimal

k

$k$ -tree of

C

$C$ is a

(1 + ε)

$(1+\varepsilon)$ -approximation to the optimal

k

$k$ -tree of

D

$D$ .We provide the first algorithm that outputs such a

(k, ε)

$(k,\varepsilon)$ -coreset for \emph{every} such matrix

D

$D$ . The size

| C |

$|C|$ of the coreset is polynomial in

k \log (N) / ε

$k\log(N)/\varepsilon$ , and its construction takes

O (N k)

$O(Nk)$ time.This is by forging a link between decision trees from machine learning -- to partition trees in computational geometry. Experimental results on \texttt{sklearn} and \texttt{lightGBM} show that applying our coresets on real-world data-sets boosts the computation time of random forests and their parameter tuning by up to x

10

$10$ , while keeping similar accuracy. Full open source code is provided.

Chat is not available.