Clustering Information Distance by Nested-Lattice Compression

oleh: Su Gao, Luhai Fan

Format: Article
Diterbitkan: IEEE 2019-01-01

Deskripsi

Clustering of mixed dataset is one of the key issues in complex system architecture decomposition and many other pattern recognition applications. Recent development of machine learning techniques, especially deep neural network models significantly boost clustering performance, whereas theoretical analysis explaining why it works and how to further improve the performance is limited in literature. From the information theoretical point of view, information distance based on Kolmogorov complexity captures non-feature similarities and works as an absolute information measure. Thus, it is valuable to model the clustering process of mixed discrete and continuous datasets with this generalized similarity measure, and investigate the practical implementation of the information distance, compression distance, to make it work within the machine learning-enabled clustering schemes. In this paper, a generative adversarial network (GAN)-based deep clustering scheme is modified to compare the similarities of the original data in order to reduce the deviation of projected features from real data characteristics. The compression distance is extended to be computable for continuous data with the assistance of the Wyner-Ziv setup, where theoretical limit is given via asymptotically optimal nested lattices. Flatness-based analysis is carried out with distance upper bound derived for non-vanishing flatness factor. Practical distance calculation is derived to incorporate with neural-network-based clustering. Simulation via practical lattice codes illustrates the consistency of the design and shows its potential effectiveness in clustering applications.