Sanguthevar Rajasekaran, University of Connecticut
In this presentation we plan to summarize some of the novel algorithms that we have recently proposed in the context of record linkage. Blocking is a technique that is typically used to speed up record linkage algorithms. Recently, we have introduced a novel algorithm for blocking called SuperBlocking. We have created novel record linkage algorithms that employ SuperBlocking. Experimental comparisons reveal that our algorithms outperform state-of-the-art algorithms for record linkage. We have also developed parallel versions of our record linkage algorithms and they obtain close to linear speedups. We will provide details on these algorithms in this presentation. We can think of each record as a string of characters. Numerous distance metrics can be found in the literature for strings. The performance of a record linkage algorithm might depend on the distance metric used. Some popular ones are: edit distance (also known as the Levenshtein distance), q-gram distance, Hausdorff distance, etc. Jaro is one such popular distance metric that is being widely used for applications such as record linkage. The best-known prior algorithms for computing the Jaro distance between two strings took quadratic time. Recently, we have presented a linear time algorithm for Jaro distance computation. We will summarize this algorithm also in this presentation.
Xiaowei Xu, University of Arkansas, Little Rock; Xingqiao Wang, Vivekanandan Gunasekaran
The development of Artificial Intelligence has led to sophisticated
language models that rival human writing. However, their use in specialized
areas can yield unsafe, biased, or factually incorrect outputs. Our
innovative AI framework adopts a 'Train Once, Apply Anywhere' (TOAA)
approach, modifying these Foundation models for safer, more robust use
across different domains.
Our method involves transferring knowledge from a foundational Large
Language Model into a Customized Language Model (CLLM). This process
significantly reduces the model's size while maintaining its performance,
enabling efficient operation on consumer-grade computers. The CLLM offers
fast processing, cost-effectiveness, and enhanced accuracy.
A key feature of our CLLM is its single-training, multi-domain application
capability, contrasting with traditional AI models limited to their
training domain. This flexibility marks a significant shift in AI
methodologies.
We tested our TOAA framework using various foundation models, including
GPT-3.5, Dolly, and LLAMA, focusing on entity matching; an essential task
for data integrity. Our CLLM, trained on one dataset, excelled across
multiple domains, showcasing superior accuracy, a 50-fold increase in
speed, and linear cost savings compared to using foundation models.
This study confirms the TOAA framework's effectiveness for domain-specific
tasks, advancing AI deployment's practicality, safety, and efficiency; an
Omni Trust AI.
Vivek Gunasekaran, UALR; Xiaowei Xu, UALR
The proliferation of Large Language Models (LLMs) has significantly transformed the landscape of natural language processing, content
generation, and information retrieval. However, their widespread adoption
raises concerns regarding potential vulnerabilities that can be exploited
for malicious purposes.
This study provides an in-depth exploration of LLM vulnerability
implementation, encompassing a thorough analysis of theoretical
foundations, practical implications, and proactive mitigation strategies.
The research identifies key factors, such as model architectures, training
data, and deployment scenarios, that can introduce inherent weaknesses in
LLMs. In addition, these vulnerabilities can be introduced during various
phases, such as the design, development, deployment, maintenance, and
operations of LLM-based applications. Ongoing monitoring and iterative
model updates were also discussed as essential components of a dynamic and
adaptive security strategy.
In conclusion, this study provides a comprehensive examination of LLM
vulnerability implementation, offering a nuanced understanding of its
theoretical foundations, practical implications, and proactive mitigation
strategies. By addressing these vulnerabilities, this study contributes to
the development of more secure, responsible, and trustworthy LLMs that
foster confidence in their applications across various domains.
Demo of OMNIMatch and Guidance
Xingqiao Wang, UALR; Xiaowei Xu, UALR; Vivek Gunasekaran, UALR
In the evolving landscape of data management, entity matching stands as a
critical yet challenging task. OMNIMatch emerges as a revolutionary
solution, harnessing the power of Large Language Models (LLM) to redefine
entity matching. This demo introduces OMNIMatch, highlighting its role in
simplifying and enhancing the accuracy of entity matching processes.
The demonstration will expertly guide viewers through a variety of
real-world scenarios, effectively showcasing OMNIMatch's extensive
applicability in multiple tasks. It will highlight the tool's proficiency
in processing two specific types of datasets: one that is similar to US
Census data, and another comprising simulated household data. Each scenario
is carefully selected to demonstrate OMNIMatch's exceptional capability in
managing these intricate data structures, showcasing its powerful and
versatile functionality.
Designed for data professionals and business analysts alike, OMNIMatch's
applications span across sectors, offering transformative benefits in data
quality and insights. This demo invites you to witness firsthand the future
of entity matching, showcasing how OMNIMatch stands at the forefront of
data management innovation.
Beatrix Haddock, Institute for Health Metrics and Evaluation, University of Washington; Alix Pletcher, Institute for Health Metrics and Evaluation, University of Washington; Nathaniel Blair-Stahn, Institute for Health Metrics and Evaluation, University of Washington; Os Keyes, Institute for Health Metrics and Evaluation, University of Washington; Matt Kappel, Institute for Health Metrics and Evaluation, University of Washington; Steve Bachmeier, Institute for Health Metrics and Evaluation, University of Washington; Syl Lutze, Institute for Health Metrics and Evaluation, University of Washington; James Albright, Institute for Health Metrics and Evaluation, University of Washington; Alison Bowman, Institute for Health Metrics and Evaluation, University of Washington; Caroline Kinuthia, Institute for Health Metrics and Evaluation, University of Washington; Rajan Mudambi, Institute for Health Metrics and Evaluation, University of Washington; Abraham D. Flaxman, Institute for Health Metrics and Evaluation, University of Washington; Zeb Burke-Conte (pronouns: he/him), Institute for Health Metrics and Evaluation, University of Washington
Entity resolution (also known as record linkage) is the data science
challenge of determining which records correspond to the same real-life
entity, such as a person, business, or establishment.
The United States Census Bureau regularly performs entity resolution on
administrative lists containing hundreds of millions to billions of
records. However, these administrative lists contain PII and are highly
confidential, preventing those outside the Bureau from understanding the
Bureau's entity resolution challenges in detail.
In this session, we present pseudopeople, an open-source Python package
that generates simulated datasets with hundreds of millions of records,
which resemble the administrative lists linked by the Census Bureau.
pseudopeople is based on an individual-based microsimulation of the United
States population, including dynamics such as migration, mortality, and
fertility. pseudopeople users can customize the noise present in the
datasets generated.
pseudopeople data can be used to create authentic entity resolution tasks
for testing new methods or software. We present an example of an entity
resolution pipeline emulating the methods used by the Census Bureau, using
only freely available open-source software and simulated data.
Onais Khan Mohammed, University of Arkansas at Little Rock; John R. Talburt, University of Arkansas at Little Rock; Adeeba Tarannum, University of Arkansas at Little Rock; Abdul Kareem Khan Kashif, University of Arkansas at Little Rock; Salman Khan, University of Arkansas at Little Rock; Khizer Syed, University of Arkansas at Little Rock
This work describes the research and development of a tool to parse
demographic items into a standard set of fields to achieve metadata
alignment using an active learning technique based on token pattern
mappings augmented by active learning. Input strings are tokenized and then
a token mask is created by replacing each token with a single-character
code indicating the tokens potential function in the input string. A
user-created mapping then directs each token represented in the mask to its
correct functional category. Testing has shown the system to be as
accurate, and in some cases, more accurate than comparable parsing systems.
The primary advantage of this approach over other systems is that it allows
a user to easily add a new mapping when an input does not conform to any
previously encoded mappings instead of having to reprogram system parsing
rules or retrain a supervised parsing machine learning model. These address
components are essential for the use of HiPER indices, Boolean rules, and
scoring rules, and these rules play a crucial role in the implementation of
various data preparation functions. This includes identifying and
separating the street address, city, state, and postal code. These
components are then stored in a structured format, allowing them to be
easily retrieved and used in various applications.