Databases and Data Analytics

In today's world, data is multimodal, ubiquitous, and overwhelming. It is a great challenge to go from data to information, then to knowledge and actionable intelligence. The area of databases and data analytics addresses this challenge and studies problems such as 1) how to efficiently organize and query information from data; 2) how to mine and discover knowledge from data and aggregate intelligence obtained from multiple sources; 3) how to learn from data such that a computer may be able to perform cognitive and reasoning tasks at a level of intelligence seen in humans. The members of the database and data analytics group work on the theory and practice of data management, data mining, data visualization, knowledge discovery, machine learning and inference, and artificial intelligence. Their researches build foundations that may lead to business intelligences, facilitate discoveries in sciences and help achieve artificial intelligence.

Faculty

Feng Chen: Cloud Storage, Data-center Storage

Jianhua Chen: Databases, Machine Learning, Data Mining

Sukhamay Kundu: Social Networks

Rahul Shah: Databases, Top-K Retrieval, Data Compression

Mingxuan Sun: Statistical Machine Learning, Information Retrieval, Visualization

Evangelos Triantaphyllou: Data Mining, Knowledge Discovery, Decision Making

Qingyang Wang: Big Data Processing and Management

Jian Zhang: Machine Learning, Data Analysis, Massive Data

Specific Projects

Automatic ontology extraction: J Chen develops a set of techniques for automatically extracting ontological components (concepts, taxonomic and non-taxonomic relations) from domain texts. A combination of information retrieval metrics, lexical knowledge- bases, machine learning and statistical approaches produces an effective way for the ontology learning task.

Behavior-based malware analytic: Malicious software (malware) is one of the most serious threats to cyber security. Zhang designs and investigates various machine learning methods to meet the challenge of distinguishing the behavior of the malware from that of legitimate software and to distinguish different types of malware. His approaches to identify malware are based on the behavior of the software (e.g., execution trace). Supported by NSF.

Critical nuggets of data: Triantaphyllou develops new ways for improving classification accuracy via identification of critical nuggets of data. Given a dataset the question is how to identify small subsets of data (called critical nuggets) playing a critical role in classification analysis. That is, when they are removed the new classification model derived from the remaining data would be significantly different than the model derived when all the data are considered.

Cloud and data-center storage management: F Chen investigates cloud-based and data-center storage systems for large-scale data management. Research includes developing mechanisms to enhance user experience with cloud storage, integrate emerging storage technologies in data-center environment, and optimize data-intensive applications in enterprise systems. Supported by Louisiana BoR and NSF.

Data processing/management for enterprise applications: Qingyang Studies the efficient data processing and management of enterprise applications running in cloud. He is developing tools to efficiently manage heterogeneous streaming data generated by different system layers or sources and filter out useful information for performance diagnosis and management of the target enterprise application. His research is Supported by Louisiana BoR and NSF.

Deep learning in recommender systems: Sun investigates ways to integrate collaborative filtering models and deep learning to incorporate heterogeneous contents including text, images and user session data in a large scale to improve personalization and alleviate cold-start problems. With 2.5 years of industry experience on large-scale online music recommendation as a Senior Scientist at Pandora, she brings unique insights in tackling the research challenges.

Disease diagnosis using gene expression data: Advanced DNA microarray measurement of the expression levels of thousands of genes opens a door to better diagnosis of cancers and other diseases and at the same time, it makes the diagnosis extremely complex. Zhang designs and investigates statistical and machine learning methods to identify patterns of relevant genes that are effective indicators of diseases.

Learnability of datasets: Triantaphyllou explores the learnability of datasets and finds ways to optimize classification accuracy. It has been observed multiple times that some training datasets are easy to analyze and build accurate classification models from them while other datasets are very hard in this regard. Also, there are datasets that are of intermediate difficulty. These observations lead to the formulation of some critical questions. The answers can help one optimize classification.

Mining and discovering knowledge for materials science: Zhang designs and investigates knowledge-discovery methods to model the materials properties (at both macro and micro level), to verify conjectures about the properties, and to discover salient events and information (that the researchers may not be aware of) from massive simulation data sets.

Scalable knowledge discovery: J Chen develops novel adaptive statistical sampling methods and an intelligent data analysis tool for efficient, scalable knowledge discovery from massive data sets using ensemble learning. These methods produce significant sample size reduction while maintaining competitive prediction accuracy, and can be useful for huge data sets in science and engineering. Supported by Louisiana BoR.

Social networks: Kundu develops efficient graph-theoretic and combinatorial algorithms for social networks analysis and data mining.

Uncertain and top-k databases: Errors and uncertainty are inherent in bigdata. In extracting relevant information from data, ranking of results is needed not only for user convenience but also for efficiently reporting results. Shah's research builds the foundations for numeric, categorical and string uncertain data along with top-k result retrieval tools. Supported by NSF-CCF

Visualizing and analyzing big preference data: Sun develops statistically interpretable and computationally efficient framework and machine learning algorithms for the analysis of big preference data, with applications in a wide range of disciplines including public health, social science, e-commerce, and education. The research project is supported by Louisiana BoR.