Data Synthetization and Feature Analysis: A Study in Bladder Cancer Recurrence Data

AHFE International

Accelerating Open Access Science in Human Factors Engineering and Human-Centered Computing

Data Synthetization and Feature Analysis: A Study in Bladder Cancer Recurrence Data

Open Access

Article

Conference Proceedings

Authors: Sandi Baressi Šegota, Ivan Lorencin, Nikola Anđelić, Vedran Mrzljak, Antun Gršković, Juraj Ahel, Klara Smolić, Dean Markić

Abstract: The application of synthetic data within the biomedical domain is rapidly gaining momentum, driven by the growing need for robust datasets suitable for machine learning (ML) and statistical modeling. In scenarios where access to real patient data is limited due to privacy concerns or scarcity, synthetic data offers an attractive alternative. These artificially generated datasets aim to mimic the statistical characteristics of original data, enabling researchers to conduct exploratory analysis, develop predictive models, or validate findings without compromising patient confidentiality. However, the increasing use of synthetic data raises several methodological and interpretative challenges, particularly regarding the correct sequence and context for applying statistical analyses. One of the central issues identified in contemporary literature concerns the timing of data analysis relative to the synthetic data generation process. Some studies conduct statistical or ML analyses directly on real datasets and use synthetic data for validation or augmentation. Others, conversely, perform all stages of analysis including feature importance estimation, correlation assessment, and model training on synthetic data. This inconsistency raises the question of whether statistical analysis conducted solely on synthetic datasets yields reliable insights, or whether it constitutes a methodological flaw. The prevailing assumption is that analysis should ideally be performed on real data to preserve statistical integrity, but empirical evaluation of this notion remains limited. In the current study, the authors address this issue by applying a synthetic data generation method specifically, the Tabular Variational Auto encoder (TVAE) to a biomedical dataset focused on bladder cancer recurrence. This dataset includes various diagnostic variables, and the primary goal is to assess how well synthetic data replicates analytical insights drawn from the original data. To achieve this, the authors conduct both correlational analysis and machine learning-based feature importance estimation. The results derived from synthetic datasets of varying sizes are then compared to those obtained from the original data. The findings indicate that while synthetic data can approximate general trends observed in the original dataset, there are notable differences depending on the analytical technique employed. In particular, models such as Random Forest appear more sensitive to variations introduced during the synthetization process. This sensitivity manifests as shifts in feature importance rankings and variability in predictive performance, especially when working with smaller synthetic datasets. On the other hand, simpler statistical methods such as correlation coefficients display more stability, suggesting that some analytical approaches may be more robust to data generation artifacts than others. These observations underscore the importance of methodological caution when interpreting results based on synthetic biomedical data. While synthetic datasets hold considerable promise for advancing data-driven research in biomedicine, they are not a one-size-fits-all solution. The sequence in which synthetic data is introduced into the research pipeline whether before or after statistical analysis—can significantly influence the validity of the findings. As such, researchers must critically assess the suitability of synthetic data for specific analytical tasks and ensure transparency in reporting their methodological choices. Future work should further explore the impact of different generative models and dataset properties on the reliability of synthetic-data-driven insights.

Keywords: Synthetic data, Biomedical analytics, Machine learning, Tabular Variational Autoencoder (TVAE), Urinary Bladder cancer

DOI: 10.54941/ahfe1006801

Cite this paper:

Downloads

160

Visits

492