Common column identification for table similarity detection in electrified transportation data lakes

Abstract:

Electrified transportation often requires researchers and operators to interact with datasets from a wide range of sources and disciplines, such as transportation, power systems, public health, policies, and regulations. These datasets vary in quality and format, making it difficult to understand, preprocess, and identify key columns representing real-world entities or values for indexing and joining, which can negatively impact downstream analysis and operation. Existing solutions are limited, requiring extensive manual customization or data expertise to utilize. In this article, we propose a multi-layered approach to automatically identify key columns to expedite preprocessing and aid in analysis of electrified transportation data. Our method leverages a dynamic ontology to identify common fields and an information theory-based strategy for edge cases that are difficult to generalize. Evaluations on a number of datasets from data.gov and kaggle.com show improved performance of our methods over several baseline techniques, and our ablation analyses illustrate the efficacy of individual components of our method. Our case studies also demonstrate that our methods have the potential to improve analysis of electrified transportation data and aid in automatic integration of such datasets.

See publication:
https://www.sciencedirect.com/science/article/pii/S095741742504148X
This publication pertains to:
Systems of Systems
Publication Authors:
  • Spencer Paulissen
  • Matthew Bruchon
  • Jason Lustbader
  • Qin Lv
It appeared in:
Peer-reviewed technical journal
Shout-outs/Achievements:
--
Keywords:
Electrified transportation, Key identification, Data integration, Data fusion, Ontology matching