Methods and techniques

A brief description of select completed or ongoing projects that are about developing new big data methods and techniques are described below. If you need further information, please contact us.

Probabilistic graphical modeling

This funded research is about developing probabilisitic models to understand complex domains with large amounts of uncertainty. This involves the development of methods and algorithms for representation, inference and learning using probabilistic graphical models, e.g., Bayesian networks and Markov networks. Developed techniques are applied to solve real world problems such as heterogeneous data integration, imbalanced data learning, and big data learning.

Mining Twitter data: From content to connections

Microblogging has quickly grown as the avatar of social interaction. Though many websites like FriendFeed, Dailybooth, and Tumblr support microblogging, Twitter is the most favored microblogging platform. With 500 million registered users, more than 400 million tweets are posted every day. Twitter’s ability to propagate real-time information to a wide set of users makes it a potential system for disseminating vital information.

About our Twitter database and infrastructure at WSU: We collect streaming data from the Twitter’s firehose API. This gives us about 10 percent of the entire Twitter data. We obtain about 5GB of data and about 19 million tweets each day. Since our data is extremely “big and growing”, we have established a complete distributed database that can perform parallel queries through the API. Our setup is designed to greatly minimize the query time for big data processing. We have designed our system to retrieve and analyze a wide array of information from the Twitter data such as retweet network, Follower and friends network, Twitter Lists, Geo-location based statistics, Topic modeling on Tweets, etc.

Recent Work
Location-specific tweet detection and topic summarization in Twitter: We developed a novel framework to identify and summarize tweets that are specific to a particular geo-graphical location. Our new weighting scheme called Location Centric Word Co-occurrence (LCWC) uses the content of the tweets and the network information of the “twitterers” to identify tweets that are location-specific. Using our approach, the topics that are specific to a particular location of interest are summarized and presented to the end-users. In our analysis, we found that (a) top trending tweets from a location are poor descriptors of location-specific tweets, (b) ranking the tweets based on users’ geo-location cannot ascertain the location specificity of the tweets, and (c) the users’ network information plays an important role in determining the location-specific characteristics of the tweets.

Low-rank approximation-based spectral clustering for big data analytics

Spectral clustering is a well-known graph-theoretic approach of finding natural groupings in a given dataset. Today, digital data are accumulated at a faster than ever speed in various fields, such as the Web, science, engineering, biomedicine, and real-world sensing. It is not uncommon for a dataset to contain tens of thousands of samples and/or features. Spectral clustering generally becomes infeasible for analyzing these big data. In this project, we propose a Low-rank Approximation-based Spectral (LAS) clustering for big data analytics. By integrating low-rank matrix approximations, i.e., the approximations to the affinity matrix and its subspace, as well as those for the Laplacian matrix and the Laplacian subspace, LAS gains great computational and spatial efficiency for processing big data. In addition, we propose various fast sampling strategies to efficiently select data samples. From a theoretical perspective, we mathematically prove the correctness of LAS, and provide the analysis of its approximation error, and computational complexity. 

Addressing big data challenges in genome sequencing and RNA interaction prediction

1. Single-cell genome sequencing: 

Enormous progress towards ubiquitous DNA sequencing has now brought a whole new realm of exciting applications within reach, one of which is genomic analysis at single-cell resolution. Single-cell genome sequencing holds great promise for various areas of biology including environmental biology, studying a myriad of uncultivable environmental bacteria ranging from the human body to the oceans, and tumor phylogenetics. The Algorithmic Biology Lab (ABL) has developed two single-cell genome assembly tools, Velvet-SC and HyDA, that can process terabyte large DNA sequencing data sets. We do not call that Big Data, even though some researchers may consider one DNA sequencing data set big. Big Data challenges emerge when we have to deal with a sample that often contains millions and sometimes billions of single cells. Our key observation is the redundancy in a sample, as many cells are biological replicates. Funded by NSF ABI, our group works on compressive sensing algorithms to extract all the genomes in a sample with minimal sequencing cost and computational effort. The size of such data can reach a few petabytes.

2. RNA structure and RNA-RNA interaction prediction:

RNA has found a new key role in the research arena after astonishing discoveries of regulatory mechanisms of non-coding RNAs in the late 90's. The importance of those discoveries was recognized when the Noble prize was awarded in 2006, only a few years after, to Andrew Fire and Craig Mello for their discovery of RNA interference - gene silencing by double-stranded RNA. The ABL develops RNA secondary structure and RNA-RNA interaction prediction algorithms. Although the input data sets are RNA sequences which are not large, our O(n^6) running-time and O(n^4)-memory algorithms have to deal with several hundred gigabytes of memory space for small RNA sequences. Therefore, the intermediary data that is generated by the algorithm, along the way from sequence data to structure or interaction information, poses Big Data challenges. For instance, machine learning algorithms and mining folding pathways from such intermediary data sets need deep computational tricks, e.g. topology-preserving dimensionality reduction, to become tractable on today's machines. 

Detecting qualitative changes in living systems

Currently, early detection of complex diseases is achieved only after the physiological traits of the phenotype are present. For instance, in the case of cancer, when the tumor is already present. Qualitative changes detected at the genomic level can help prevent the evolution of complex diseases right at the onset. Instead of detecting the presence of cancer, we aim to detect the departure from the healthy state. The big data challenge here comes from the added need to monitor and continuously analyze the expression levels of 30,000 genes and more than 100,000 proteins over many time points, leading to a major data explosion. Results on several model organisms using simulated and real data show that our method can accurately detect intervals when the biological system (i.e.: cell) changes from one qualitative state to another. To the best of our knowledge, this would be the first tool able to pinpoint, using high-throughput transcriptome data and signaling pathways, a moment in time when a system significantly changes its state in a qualitative way.