Dr. Barna Saha of AT&T Research will speak on "Clean Data and Unlimited Resources: A Probabilistic Way to a Fantasy World."
Over the last decade, while the quantity of digital data skyrocketed, the quality of data
remained far from perfect. Use of automated tools and fallible humans for data collection
and integration, complex storage, and unreliable channels for data processing and
communication all contribute to low quality of data. Since poor data quality can have
serious consequences on the results of data analyses, the importance of veracity, the fourth
'V' of Big Data, is increasingly being recognized. On the other hand, with the rise of data
centers as a popular computing platform with intensive demand, the need for proper
allocation of data center resources is more important than ever.
In the first part of this talk, we develop scalable (near-linear time) algorithms to repair
semi-structured data, such as XML, popular for data interchange and storage. While more
than half of Web data is XML, a large fraction of it is erroneous, with the major source of
errors being mismatch between open and close tags. This problem of repairing XML data
leads to a broad class of problems that we call the ``Language Edit Distance Problem’’:
given a string σ over alphabet Σ and a grammar G defined over Σ, compute the minimum
number of repairs (insertions, deletions and substitutions), that are required to map σ into a
valid member of G. The language edit distance problem is a significant generalization of
the widely used string edit distance problem and has many applications beyond data
quality, in compiler optimization, natural language processing, biological sciences, etc.
Randomization here plays a crucial role in the design and analysis of algorithms.
In the second part of the talk, we summarize our contributions in resource allocation
problems. Specifically, we present our work on a powerful probabilistic method, the
Lovász Local Lemma, and show how this leads to resolving a major open question, and the
very first set of algorithms for a large class of combinatorial optimization problems.
Barna Saha is a Senior Member of Technical Staff Research at AT&T-Shannon Labs, and has been there since August 2011 after completing her Ph.D. from the University of Maryland, College Park working with Samir Khuller. Her research interests span algorithm design and analysis, discrete optimization, and foundational aspects of databases and data management. She received a Dean's Dissertation Fellowship Award for excellence in dissertation research. She is the recipient of the best paper award at the Very Large Data Bases Conference (VLDB'09) for her work on uncertain ranking. Her work on designing scalable algorithms for data quality problems was the
finalist for best paper awards at IEEE International Conference on Data Engineering (ICDE'12).
Events are free and open to the public unless otherwise noted.