Computation and Language 78
☆ BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, Ranjay Krishna
We introduce Blink, a new benchmark for multimodal language models (LLMs)
that focuses on core visual perception abilities not found in other
evaluations. Most of the Blink tasks can be solved by humans "within a blink"
(e.g., relative depth estimation, visual correspondence, forensics detection,
and multi-view reasoning). However, we find these perception-demanding tasks
cast significant challenges for current multimodal LLMs because they resist
mediation through natural language. Blink reformats 14 classic computer vision
tasks into 3,807 multiple-choice questions, paired with single or multiple
images and visual prompting. While humans get 95.70% accuracy on average, Blink
is surprisingly challenging for existing multimodal LLMs: even the
best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only
13.17% and 7.63% higher than random guessing, indicating that such perception
abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also
highlights that specialist CV models could solve these problems much better,
suggesting potential pathways for future improvements. We believe Blink will
stimulate the community to help multimodal LLMs catch up with human-level
visual perception.
comment: Multimodal Benchmark, Project Url: https://zeyofu.github.io/blink/
☆ Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, Zhihui Xie
We introduce Reka Core, Flash, and Edge, a series of powerful multimodal
language models trained from scratch by Reka. Reka models are able to process
and reason with text, images, video, and audio inputs. This technical report
discusses details of training some of these models and provides comprehensive
evaluation results. We show that Reka Edge and Reka Flash are not only
state-of-the-art but also outperform many much larger models, delivering
outsized values for their respective compute class. Meanwhile, our most capable
and largest model, Reka Core, approaches the best frontier models on both
automatic evaluations and blind human evaluations. On image question answering
benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V.
Meanwhile, on multimodal chat, Core ranks as the second most preferred model
under a blind third-party human evaluation setup, outperforming other models
such as Claude 3 Opus. On text benchmarks, Core not only performs competitively
to other frontier models on a set of well-established benchmarks (e.g. MMLU,
GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question
answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped
in production at http://chat.reka.ai . A showcase of non cherry picked
qualitative examples can also be found at http://showcase.reka.ai .
☆ When LLMs are Unfit Use FastFit: Fast and Effective Text Classification with Many Classes NAACL
We present FastFit, a method, and a Python package design to provide fast and
accurate few-shot classification, especially for scenarios with many
semantically similar classes. FastFit utilizes a novel approach integrating
batch contrastive learning and token-level similarity score. Compared to
existing few-shot learning packages, such as SetFit, Transformers, or few-shot
prompting of large language models via API calls, FastFit significantly
improves multiclass classification performance in speed and accuracy across
FewMany, our newly curated English benchmark, and Multilingual datasets.
FastFit demonstrates a 3-20x improvement in training speed, completing training
in just a few seconds. The FastFit package is now available on GitHub and PyPi,
presenting a user-friendly solution for NLP practitioners.
comment: Accepted to NAACL
☆ Large Language Models in Targeted Sentiment Analysis
In this paper we investigate the use of decoder-based generative transformers
for extracting sentiment towards the named entities in Russian news articles.
We study sentiment analysis capabilities of instruction-tuned large language
models (LLMs). We consider the dataset of RuSentNE-2023 in our study. The first
group of experiments was aimed at the evaluation of zero-shot capabilities of
LLMs with closed and open transparencies. The second covers the fine-tuning of
Flan-T5 using the "chain-of-thought" (CoT) three-hop reasoning framework
(THoR). We found that the results of the zero-shot approaches are similar to
the results achieved by baseline fine-tuned encoder-based transformers
(BERT-base). Reasoning capabilities of the fine-tuned Flan-T5 models with THoR
achieve at least 5% increment with the base-size model compared to the results
of the zero-shot experiment. The best results of sentiment analysis on
RuSentNE-2023 were achieved by fine-tuned Flan-T5-xl, which surpassed the
results of previous state-of-the-art transformer-based classifiers. Our CoT
application framework is publicly available:
https://github.com/nicolay-r/Reasoning-for-Sentiment-Analysis-Framework
comment: Fine-tuned Flan-T5-xl outperforms the top #1 results of
transformer-based classifier in RuSentNE-2023 competition, to appear in
Lobachevskii Journal of Mathematics No.8/2024 proceedings
☆ Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
Aligning language models (LMs) based on human-annotated preference data is a
crucial step in obtaining practical and performant LM-based systems. However,
multilingual human preference data are difficult to obtain at scale, making it
challenging to extend this framework to diverse languages. In this work, we
evaluate a simple approach for zero-shot cross-lingual alignment, where a
reward model is trained on preference data in one source language and directly
applied to other target languages. On summarization and open-ended dialog
generation, we show that this method is consistently successful under
comprehensive evaluation settings, including human evaluation: cross-lingually
aligned models are preferred by humans over unaligned models on up to >70% of
evaluation instances. We moreover find that a different-language reward model
sometimes yields better aligned models than a same-language reward model. We
also identify best practices when there is no language-specific data for even
supervised finetuning, another component in alignment.
☆ Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair
In Simultaneous Machine Translation (SiMT) systems, training with a
simultaneous interpretation (SI) corpus is an effective method for achieving
high-quality yet low-latency systems. However, it is very challenging to curate
such a corpus due to limitations in the abilities of annotators, and hence,
existing SI corpora are limited. Therefore, we propose a method to convert
existing speech translation corpora into interpretation-style data, maintaining
the original word order and preserving the entire source content using Large
Language Models (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in
text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces
latencies while maintaining the same level of quality as the models trained
with offline datasets. The LLM-SI-Corpus is available at
\url{https://github.com/yusuke1997/LLM-SI-Corpus}.
comment: 23 pages, 9 figures
☆ Augmenting emotion features in irony detection with Large language modeling
This study introduces a novel method for irony detection, applying Large
Language Models (LLMs) with prompt-based learning to facilitate emotion-centric
text augmentation. Traditional irony detection techniques typically fall short
due to their reliance on static linguistic features and predefined knowledge
bases, often overlooking the nuanced emotional dimensions integral to irony. In
contrast, our methodology augments the detection process by integrating subtle
emotional cues, augmented through LLMs, into three benchmark pre-trained NLP
models - BERT, T5, and GPT-2 - which are widely recognized as foundational in
irony detection. We assessed our method using the SemEval-2018 Task 3 dataset
and observed substantial enhancements in irony detection capabilities.
comment: 11 pages, 3 tables, 2 figures. Submitted to the 25th Chinese Lexical
Semantics Workshop
☆ Resilience through Scene Context in Visual Referring Expression Generation
Scene context is well known to facilitate humans' perception of visible
objects. In this paper, we investigate the role of context in Referring
Expression Generation (REG) for objects in images, where existing research has
often focused on distractor contexts that exert pressure on the generator. We
take a new perspective on scene context in REG and hypothesize that contextual
information can be conceived of as a resource that makes REG models more
resilient and facilitates the generation of object descriptions, and object
types in particular. We train and test Transformer-based REG models with target
representations that have been artificially obscured with noise to varying
degrees. We evaluate how properties of the models' visual context affect their
processing and performance. Our results show that even simple scene contexts
make models surprisingly resilient to perturbations, to the extent that they
can identify referent types even when visual information about the target is
completely missing.
☆ Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting
Embedding models are crucial for various natural language processing tasks
but can be limited by factors such as limited vocabulary, lack of context, and
grammatical errors. This paper proposes a novel approach to improve embedding
performance by leveraging large language models (LLMs) to enrich and rewrite
input text before the embedding process. By utilizing ChatGPT 3.5 to provide
additional context, correct inaccuracies, and incorporate metadata, the
proposed method aims to enhance the utility and accuracy of embedding models.
The effectiveness of this approach is evaluated on three datasets:
Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual
Classification. Results demonstrate significant improvements over the baseline
model on the TwitterSemEval 2015 dataset, with the best-performing prompt
achieving a score of 85.34 compared to the previous best of 81.52 on the
Massive Text Embedding Benchmark (MTEB) Leaderboard. However, performance on
the other two datasets was less impressive, highlighting the importance of
considering domain-specific characteristics. The findings suggest that
LLM-based text enrichment has shown promising results to improve embedding
performance, particularly in certain domains. Hence, numerous limitations in
the process of embedding can be avoided.
☆ Advancing the Robustness of Large Language Models through Self-Denoised Smoothing NAACL 2024
Jiabao Ji, Bairu Hou, Zhen Zhang, Guanhua Zhang, Wenqi Fan, Qing Li, Yang Zhang, Gaowen Liu, Sijia Liu, Shiyu Chang
Although large language models (LLMs) have achieved significant success,
their vulnerability to adversarial perturbations, including recent jailbreak
attacks, has raised considerable concerns. However, the increasing size of
these models and their limited access make improving their robustness a
challenging task. Among various defense strategies, randomized smoothing has
shown great potential for LLMs, as it does not require full access to the
model's parameters or fine-tuning via adversarial training. However, randomized
smoothing involves adding noise to the input before model prediction, and the
final model's robustness largely depends on the model's performance on these
noise corrupted data. Its effectiveness is often limited by the model's
sub-optimal performance on noisy data. To address this issue, we propose to
leverage the multitasking nature of LLMs to first denoise the noisy inputs and
then to make predictions based on these denoised versions. We call this
procedure self-denoised smoothing. Unlike previous denoised smoothing
techniques in computer vision, which require training a separate model to
enhance the robustness of LLMs, our method offers significantly better
efficiency and flexibility. Our experimental results indicate that our method
surpasses existing methods in both empirical and certified robustness in
defending against adversarial attacks for both downstream tasks and human
alignments (i.e., jailbreak attacks). Our code is publicly available at
https://github.com/UCSB-NLP-Chang/SelfDenoise
comment: Accepted by NAACL 2024. Jiabao, Bairu, Zhen, Guanhua contributed
equally. This is an updated version of the paper: arXiv:2307.07171
☆ FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom
Federated Learning (FL) has emerged as a promising solution for collaborative
training of large language models (LLMs). However, the integration of LLMs into
FL introduces new challenges, particularly concerning the evaluation of LLMs.
Traditional evaluation methods that rely on labeled test sets and
similarity-based metrics cover only a subset of the acceptable answers, thereby
failing to accurately reflect the performance of LLMs on generative tasks.
Meanwhile, although automatic evaluation methods that leverage advanced LLMs
present potential, they face critical risks of data leakage due to the need to
transmit data to external servers and suboptimal performance on downstream
tasks due to the lack of domain knowledge. To address these issues, we propose
a Federated Evaluation framework of Large Language Models, named FedEval-LLM,
that provides reliable performance measurements of LLMs on downstream tasks
without the reliance on labeled test sets and external tools, thus ensuring
strong privacy-preserving capability. FedEval-LLM leverages a consortium of
personalized LLMs from participants as referees to provide domain knowledge and
collective evaluation capability, thus aligning to the respective downstream
tasks and mitigating uncertainties and biases associated with a single referee.
Experimental results demonstrate a significant improvement in the evaluation
capability of personalized evaluation models on downstream tasks. When applied
to FL, these evaluation models exhibit strong agreement with human preference
and RougeL-score on meticulously curated test sets. FedEval-LLM effectively
overcomes the limitations of traditional metrics and the reliance on external
services, making it a promising framework for the evaluation of LLMs within
collaborative training scenarios.
comment: In Progress
☆ Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
Despite the impressive capabilities of Large Language Models (LLMs) on
various tasks, they still struggle with scenarios that involves complex
reasoning and planning. Recent work proposed advanced prompting techniques and
the necessity of fine-tuning with high-quality data to augment LLMs' reasoning
abilities. However, these approaches are inherently constrained by data
availability and quality. In light of this, self-correction and self-learning
emerge as viable solutions, employing strategies that allow LLMs to refine
their outputs and learn from self-assessed rewards. Yet, the efficacy of LLMs
in self-refining its response, particularly in complex reasoning and planning
task, remains dubious. In this paper, we introduce AlphaLLM for the
self-improvements of LLMs, which integrates Monte Carlo Tree Search (MCTS) with
LLMs to establish a self-improving loop, thereby enhancing the capabilities of
LLMs without additional annotations. Drawing inspiration from the success of
AlphaGo, AlphaLLM addresses the unique challenges of combining MCTS with LLM
for self-improvement, including data scarcity, the vastness search spaces of
language tasks, and the subjective nature of feedback in language tasks.
AlphaLLM is comprised of prompt synthesis component, an efficient MCTS approach
tailored for language tasks, and a trio of critic models for precise feedback.
Our experimental results in mathematical reasoning tasks demonstrate that
AlphaLLM significantly enhances the performance of LLMs without additional
annotations, showing the potential for self-improvement in LLMs.
☆ CMNEE: A Large-Scale Document-Level Event Extraction Dataset based on Open-Source Chinese Military News LREC
Extracting structured event knowledge, including event triggers and
corresponding arguments, from military texts is fundamental to many
applications, such as intelligence analysis and decision assistance. However,
event extraction in the military field faces the data scarcity problem, which
impedes the research of event extraction models in this domain. To alleviate
this problem, we propose CMNEE, a large-scale, document-level open-source
Chinese Military News Event Extraction dataset. It contains 17,000 documents
and 29,223 events, which are all manually annotated based on a pre-defined
schema for the military domain including 8 event types and 11 argument role
types. We designed a two-stage, multi-turns annotation strategy to ensure the
quality of CMNEE and reproduced several state-of-the-art event extraction
models with a systematic evaluation. The experimental results on CMNEE fall
shorter than those on other domain datasets obviously, which demonstrates that
event extraction for military domain poses unique challenges and requires
further research efforts. Our code and data can be obtained from
https://github.com/Mzzzhu/CMNEE.
comment: 13 pages, 7 figures, accepted to LREC-COLING 2024
☆ Introducing v0.5 of the AI Safety Benchmark from MLCommons
Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller, Ram Gandikota, Agasthya Gangavarapu, Ananya Gangavarapu, James Gealy, Rajat Ghosh, James Goel, Usman Gohar, Sujata Goswami, Scott A. Hale, Wiebke Hutiri, Joseph Marvin Imperial, Surgan Jandial, Nick Judd, Felix Juefei-Xu, Foutse Khomh, Bhavya Kailkhura, Hannah Rose Kirk, Kevin Klyman, Chris Knotz, Michael Kuchnik, Shachi H. Kumar, Chris Lengerich, Bo Li, Zeyi Liao, Eileen Peters Long, Victor Lu, Yifan Mai, Priyanka Mary Mammen, Kelvin Manyeki, Sean McGregor, Virendra Mehta, Shafee Mohammed, Emanuel Moss, Lama Nachman, Dinesh Jinenhally Naganna, Amin Nikanjam, Besmira Nushi, Luis Oala, Iftach Orr, Alicia Parrish, Cigdem Patlak, William Pietri, Forough Poursabzi-Sangdeh, Eleonora Presani, Fabrizio Puletti, Paul Röttger, Saurav Sahay, Tim Santos, Nino Scherrer, Alice Schoenauer Sebag, Patrick Schramowski, Abolfazl Shahbazi, Vin Sharma, Xudong Shen, Vamsi Sistla, Leonard Tang, Davide Testuggine, Vithursan Thangarasa, Elizabeth Anne Watkins, Rebecca Weiss, Chris Welty, Tyler Wilbers, Adina Williams, Carole-Jean Wu, Poonam Yadav, Xianjun Yang, Yi Zeng, Wenhui Zhang, Fedor Zhdanov, Jiacheng Zhu, Percy Liang, Peter Mattson, Joaquin Vanschoren
This paper introduces v0.5 of the AI Safety Benchmark, which has been created
by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been
designed to assess the safety risks of AI systems that use chat-tuned language
models. We introduce a principled approach to specifying and constructing the
benchmark, which for v0.5 covers only a single use case (an adult chatting to a
general-purpose assistant in English), and a limited set of personas (i.e.,
typical users, malicious users, and vulnerable users). We created a new
taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark.
We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024.
The v1.0 benchmark will provide meaningful insights into the safety of AI
systems. However, the v0.5 benchmark should not be used to assess the safety of
AI systems. We have sought to fully document the limitations, flaws, and
challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes
(1) a principled approach to specifying and constructing the benchmark, which
comprises use cases, types of systems under test (SUTs), language and context,
personas, tests, and test items; (2) a taxonomy of 13 hazard categories with
definitions and subcategories; (3) tests for seven of the hazard categories,
each comprising a unique set of test items, i.e., prompts. There are 43,090
test items in total, which we created with templates; (4) a grading system for
AI systems against the benchmark; (5) an openly available platform, and
downloadable tool, called ModelBench that can be used to evaluate the safety of
AI systems on the benchmark; (6) an example evaluation report which benchmarks
the performance of over a dozen openly available chat-tuned language models;
(7) a test specification for the benchmark.
☆ Length Generalization of Causal Transformers without Position Encoding
Generalizing to longer sentences is important for recent Transformer-based
language models. Besides algorithms manipulating explicit position features,
the success of Transformers without position encodings (NoPE) provides a new
way to overcome the challenge. In this paper, we study the length
generalization property of NoPE. We find that although NoPE can extend to
longer sequences than the commonly used explicit position encodings, it still
has a limited context length. We identify a connection between the failure of
NoPE's generalization and the distraction of attention distributions. We
propose a parameter-efficient tuning for searching attention heads' best
temperature hyper-parameters, which substantially expands NoPE's context size.
Experiments on long sequence language modeling, the synthetic passkey retrieval
task and real-world long context tasks show that NoPE can achieve competitive
performances with state-of-the-art length generalization algorithms. The source
code is publicly accessible
☆ OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
Instruction fine-tuning pretrained LLMs for diverse downstream tasks has
demonstrated remarkable success and has captured the interest of both academics
and practitioners. To ensure such fine-tuned LLMs align with human preferences,
techniques such as RLHF and DPO have emerged. At the same time, there is
increasing interest in smaller parameter counts for models. In this work, using
OpenLLaMA 3Bv2 as a base model, we describe the recipe used to fine-tune the
OpenBezoar family of models. In this recipe: We first generate synthetic
instruction fine-tuning data using an open and commercially non-restrictive
instruction fine-tuned variant of the Falcon-40B model under three schemes
based on: LaMini-LM, WizardLM/Evol-Instruct (with databricks-dolly-15k as a
seed dataset) and Orca (with the Flan Collection as a seed dataset), then
filter these generations using GPT-4 as a human proxy. We then perform
cost-effective QLoRA-based supervised fine-tuning sequentially with each
scheme. The resulting checkpoint is further fine-tuned with a subset of the
HH-RLHF dataset to minimize distribution shift prior to using the DPO loss to
obtain the final checkpoint. Evaluation is done with the LM Eval Harness
tasks/metrics as well as on MT-Bench using the "LLM-as-a-judge" framework with
Claude 2.1, with the finding that the final checkpoint,
"OpenBezoar-HH-RLHF-DPO", demonstrates superior performance over many models at
the 3B parameter scale, even outperforming the top model in one of the
categories on the Huggingface Open LLM Leaderboard. We release
"OpenBezoar-SFT", "OpenBezoar-HH-RLHF-SFT", "OpenBezoar-HH-RLHF-DPO"
checkpoints, alongside our generated datasets on HuggingFace at
https://huggingface.co/collections/SurgeGlobal/open-bezoar-6620a24923e12127e9e2b9cc
and our codebase at
https://bitbucket.org/paladinanalytics/workspace/projects/OP.
comment: 25 pages, 27 Figures, 8 Tables
☆ EuSQuAD: Automatically Translated and Aligned SQuAD2.0 for Basque
The widespread availability of Question Answering (QA) datasets in English
has greatly facilitated the advancement of the Natural Language Processing
(NLP) field. However, the scarcity of such resources for minority languages,
such as Basque, poses a substantial challenge for these communities. In this
context, the translation and alignment of existing QA datasets plays a crucial
role in narrowing this technological gap. This work presents EuSQuAD, the first
initiative dedicated to automatically translating and aligning SQuAD2.0 into
Basque, resulting in more than 142k QA examples. We demonstrate EuSQuAD's value
through extensive qualitative analysis and QA experiments supported with
EuSQuAD as training data. These experiments are evaluated with a new
human-annotated dataset.
comment: Under review in the journal of Procesamiento de Lenguaje Natural
☆ Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines?
The increasing threat of disinformation calls for automating parts of the
fact-checking pipeline. Identifying text segments requiring fact-checking is
known as claim detection (CD) and claim check-worthiness detection (CW), the
latter incorporating complex domain-specific criteria of worthiness and often
framed as a ranking task. Zero- and few-shot LLM prompting is an attractive
option for both tasks, as it bypasses the need for labeled datasets and allows
verbalized claim and worthiness criteria to be directly used for prompting. We
evaluate the LLMs' predictive and calibration accuracy on five CD/CW datasets
from diverse domains, each utilizing a different worthiness criterion. We
investigate two key aspects: (1) how best to distill factuality and worthiness
criteria into a prompt and (2) what amount of context to provide for each
claim. To this end, we experiment with varying the level of prompt verbosity
and the amount of contextual information provided to the model. Our results
show that optimal prompt verbosity is domain-dependent, adding context does not
improve performance, and confidence scores can be directly used to produce
reliable check-worthiness rankings.
☆ Stance Detection on Social Media with Fine-Tuned Large Language Models
Stance detection, a key task in natural language processing, determines an
author's viewpoint based on textual analysis. This study evaluates the
evolution of stance detection methods, transitioning from early machine
learning approaches to the groundbreaking BERT model, and eventually to modern
Large Language Models (LLMs) such as ChatGPT, LLaMa-2, and Mistral-7B. While
ChatGPT's closed-source nature and associated costs present challenges, the
open-source models like LLaMa-2 and Mistral-7B offers an encouraging
alternative. Initially, our research focused on fine-tuning ChatGPT, LLaMa-2,
and Mistral-7B using several publicly available datasets. Subsequently, to
provide a comprehensive comparison, we assess the performance of these models
in zero-shot and few-shot learning scenarios. The results underscore the
exceptional ability of LLMs in accurately detecting stance, with all tested
models surpassing existing benchmarks. Notably, LLaMa-2 and Mistral-7B
demonstrate remarkable efficiency and potential for stance detection, despite
their smaller sizes compared to ChatGPT. This study emphasizes the potential of
LLMs in stance detection and calls for more extensive research in this field.
☆ FecTek: Enhancing Term Weight in Lexicon-Based Retrieval with Feature Context and Term-level Knowledge
Lexicon-based retrieval has gained siginificant popularity in text retrieval
due to its efficient and robust performance. To further enhance performance of
lexicon-based retrieval, researchers have been diligently incorporating
state-of-the-art methodologies like Neural retrieval and text-level contrastive
learning approaches. Nonetheless, despite the promising outcomes, current
lexicon-based retrieval methods have received limited attention in exploring
the potential benefits of feature context representations and term-level
knowledge guidance. In this paper, we introduce an innovative method by
introducing FEature Context and TErm-level Knowledge modules(FecTek). To
effectively enrich the feature context representations of term weight, the
Feature Context Module (FCM) is introduced, which leverages the power of BERT's
representation to determine dynamic weights for each element in the embedding.
Additionally, we develop a term-level knowledge guidance module (TKGM) for
effectively utilizing term-level knowledge to intelligently guide the modeling
process of term weight. Evaluation of the proposed method on MS Marco benchmark
demonstrates its superiority over the previous state-of-the-art approaches.
☆ Aligning language models with human preferences
Language models (LMs) trained on vast quantities of text data can acquire
sophisticated skills such as generating summaries, answering questions or
generating code. However, they also manifest behaviors that violate human
preferences, e.g., they can generate offensive content, falsehoods or
perpetuate social biases. In this thesis, I explore several approaches to
aligning LMs with human preferences. First, I argue that aligning LMs can be
seen as Bayesian inference: conditioning a prior (base, pretrained LM) on
evidence about human preferences (Chapter 2). Conditioning on human preferences
can be implemented in numerous ways. In Chapter 3, I investigate the relation
between two approaches to finetuning pretrained LMs using feedback given by a
scoring function: reinforcement learning from human feedback (RLHF) and
distribution matching. I show that RLHF can be seen as a special case of
distribution matching but distributional matching is strictly more general. In
chapter 4, I show how to extend the distribution matching to conditional
language models. Finally, in chapter 5 I explore a different root: conditioning
an LM on human preferences already during pretraining. I show that involving
human feedback from the very start tends to be more effective than using it
only during supervised finetuning. Overall, these results highlight the room
for alignment techniques different from and complementary to RLHF.
comment: PhD thesis
☆ From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency
The staggering pace with which the capabilities of large language models
(LLMs) are increasing, as measured by a range of commonly used natural language
understanding (NLU) benchmarks, raises many questions regarding what
"understanding" means for a language model and how it compares to human
understanding. This is especially true since many LLMs are exclusively trained
on text, casting doubt on whether their stellar benchmark performances are
reflective of a true understanding of the problems represented by these
benchmarks, or whether LLMs simply excel at uttering textual forms that
correlate with what someone who understands the problem would say. In this
philosophically inspired work, we aim to create some separation between form
and meaning, with a series of tests that leverage the idea that world
understanding should be consistent across presentational modes - inspired by
Fregean senses - of the same meaning. Specifically, we focus on consistency
across languages as well as paraphrases. Taking GPT-3.5 as our object of study,
we evaluate multisense consistency across five different languages and various
tasks. We start the evaluation in a controlled setting, asking the model for
simple facts, and then proceed with an evaluation on four popular NLU
benchmarks. We find that the model's multisense consistency is lacking and run
several follow-up analyses to verify that this lack of consistency is due to a
sense-dependent task understanding. We conclude that, in this aspect, the
understanding of LLMs is still quite far from being consistent and human-like,
and deliberate on how this impacts their utility in the context of learning
about human language and understanding.
☆ Enhancing Suicide Risk Assessment: A Speech-Based Automated Approach in Emergency Medicine
Shahin Amiriparian, Maurice Gerczuk, Justina Lutz, Wolfgang Strube, Irina Papazova, Alkomiet Hasan, Alexander Kathan, Björn W. Schuller
The delayed access to specialized psychiatric assessments and care for
patients at risk of suicidal tendencies in emergency departments creates a
notable gap in timely intervention, hindering the provision of adequate mental
health support during critical situations. To address this, we present a
non-invasive, speech-based approach for automatic suicide risk assessment. For
our study, we have collected a novel dataset of speech recordings from $20$
patients from which we extract three sets of features, including wav2vec,
interpretable speech and acoustic features, and deep learning-based spectral
representations. We proceed by conducting a binary classification to assess
suicide risk in a leave-one-subject-out fashion. Our most effective speech
model achieves a balanced accuracy of $66.2\,\%$. Moreover, we show that
integrating our speech model with a series of patients' metadata, such as the
history of suicide attempts or access to firearms, improves the overall result.
The metadata integration yields a balanced accuracy of $94.4\,\%$, marking an
absolute improvement of $28.2\,\%$, demonstrating the efficacy of our proposed
approaches for automatic suicide risk assessment in emergency medicine.
☆ Ethical-Lens: Curbing Malicious Usages of Open-Source Text-to-Image Models
The burgeoning landscape of text-to-image models, exemplified by innovations
such as Midjourney and DALLE 3, has revolutionized content creation across
diverse sectors. However, these advancements bring forth critical ethical
concerns, particularly with the misuse of open-source models to generate
content that violates societal norms. Addressing this, we introduce
Ethical-Lens, a framework designed to facilitate the value-aligned usage of
text-to-image tools without necessitating internal model revision. Ethical-Lens
ensures value alignment in text-to-image models across toxicity and bias
dimensions by refining user commands and rectifying model outputs. Systematic
evaluation metrics, combining GPT4-V, HEIM, and FairFace scores, assess
alignment capability. Our experiments reveal that Ethical-Lens enhances
alignment capabilities to levels comparable with or superior to commercial
models like DALLE 3, ensuring user-generated content adheres to ethical
standards while maintaining image quality. This study indicates the potential
of Ethical-Lens to ensure the sustainable development of open-source
text-to-image tools and their beneficial integration into society. Our code is
available at https://github.com/yuzhu-cai/Ethical-Lens.
comment: 42 pages, 17 figures, 29 tables
☆ LongEmbed: Extending Embedding Models for Long Context Retrieval
Embedding models play a pivot role in modern NLP applications such as IR and
RAG. While the context limit of LLMs has been pushed beyond 1 million tokens,
embedding models are still confined to a narrow context window not exceeding 8k
tokens, refrained from application scenarios requiring long inputs such as
legal contracts. This paper explores context window extension of existing
embedding models, pushing the limit to 32k without requiring additional
training. First, we examine the performance of current embedding models for
long context retrieval on our newly constructed LongEmbed benchmark. LongEmbed
comprises two synthetic tasks and four carefully chosen real-world tasks,
featuring documents of varying length and dispersed target information.
Benchmarking results underscore huge room for improvement in these models.
Based on this, comprehensive experiments show that training-free context window
extension strategies like position interpolation can effectively extend the
context window of existing embedding models by several folds, regardless of
their original context being 512 or beyond 4k. Furthermore, for models
employing absolute position encoding (APE), we show the possibility of further
fine-tuning to harvest notable performance gains while strictly preserving
original behavior for short inputs. For models using rotary position embedding
(RoPE), significant enhancements are observed when employing RoPE-specific
methods, such as NTK and SelfExtend, indicating RoPE's superiority over APE for
context window extension. To facilitate future research, we release E5-Base-4k
and E5-RoPE-Base, along with the LongEmbed benchmark.
☆ TIMIT Speaker Profiling: A Comparison of Multi-task learning and Single-task learning Approaches
This study employs deep learning techniques to explore four speaker profiling
tasks on the TIMIT dataset, namely gender classification, accent
classification, age estimation, and speaker identification, highlighting the
potential and challenges of multi-task learning versus single-task models. The
motivation for this research is twofold: firstly, to empirically assess the
advantages and drawbacks of multi-task learning over single-task models in the
context of speaker profiling; secondly, to emphasize the undiminished
significance of skillful feature engineering for speaker recognition tasks. The
findings reveal challenges in accent classification, and multi-task learning is
found advantageous for tasks of similar complexity. Non-sequential features are
favored for speaker recognition, but sequential ones can serve as starting
points for complex models. The study underscores the necessity of meticulous
experimentation and parameter tuning for deep learning models.
☆ RAGAR, Your Falsehood RADAR: RAG-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models ACL
The escalating challenge of misinformation, particularly in the context of
political discourse, necessitates advanced solutions for fact-checking. We
introduce innovative approaches to enhance the reliability and efficiency of
multimodal fact-checking through the integration of Large Language Models
(LLMs) with Retrieval-augmented Generation (RAG)- based advanced reasoning
techniques. This work proposes two novel methodologies, Chain of RAG (CoRAG)
and Tree of RAG (ToRAG). The approaches are designed to handle multimodal
claims by reasoning the next questions that need to be answered based on
previous evidence. Our approaches improve the accuracy of veracity predictions
and the generation of explanations over the traditional fact-checking approach
of sub-question generation with chain of thought veracity prediction. By
employing multimodal LLMs adept at analyzing both text and images, this
research advances the capability of automated systems in identifying and
countering misinformation.
comment: 8 pages, submitted to ACL Rolling Review
☆ Constituents Correspond to Word Sequence Patterns among Sentences with Equivalent Predicate-Argument Structures: Unsupervised Constituency Parsing by Span Matching
Unsupervised constituency parsing is about identifying word sequences that
form a syntactic unit (i.e., constituents) in a target sentence. Linguists
identify the constituent by evaluating a set of Predicate-Argument Structure
(PAS) equivalent sentences where we find the constituent corresponds to
frequent word sequences. However, such information is unavailable to previous
parsing methods which identify the constituent by observing sentences with
diverse PAS. In this study, we empirically verify that \textbf{constituents
correspond to word sequence patterns in the PAS-equivalent sentence set}. We
propose a frequency-based method \emph{span-overlap}, applying the word
sequence pattern to computational unsupervised parsing for the first time.
Parsing experiments show that the span-overlap parser outperforms
state-of-the-art parsers in eight out of ten languages. Further discrimination
analysis confirms that the span-overlap method can non-trivially separate
constituents from non-constituents. This result highlights the utility of the
word sequence pattern. Additionally, we discover a multilingual phenomenon:
\textbf{participant-denoting constituents are more frequent than event-denoting
constituents}. The phenomenon indicates a behavioral difference between the two
constituent types, laying the foundation for future labeled unsupervised
parsing.
☆ emrQA-msquad: A Medical Dataset Structured with the SQuAD V2.0 Framework, Enriched with emrQA Medical Information
Machine Reading Comprehension (MRC) holds a pivotal role in shaping Medical
Question Answering Systems (QAS) and transforming the landscape of accessing
and applying medical information. However, the inherent challenges in the
medical field, such as complex terminology and question ambiguity, necessitate
innovative solutions. One key solution involves integrating specialized medical
datasets and creating dedicated datasets. This strategic approach enhances the
accuracy of QAS, contributing to advancements in clinical decision-making and
medical research. To address the intricacies of medical terminology, a
specialized dataset was integrated, exemplified by a novel Span extraction
dataset derived from emrQA but restructured into 163,695 questions and 4,136
manually obtained answers, this new dataset was called emrQA-msquad dataset.
Additionally, for ambiguous questions, a dedicated medical dataset for the Span
extraction task was introduced, reinforcing the system's robustness. The
fine-tuning of models such as BERT, RoBERTa, and Tiny RoBERTa for medical
contexts significantly improved response accuracy within the F1-score range of
0.75 to 1.00 from 10.1% to 37.4%, 18.7% to 44.7% and 16.0% to 46.8%,
respectively. Finally, emrQA-msquad dataset is publicy available at
https://huggingface.co/datasets/Eladio/emrqa-msquad.
comment: The dataset is available in
https://huggingface.co/datasets/Eladio/emrqa-msquad
☆ Exploring Boundaries and Intensities in Offensive and Hate Speech: Unveiling the Complex Spectrum of Social Media Discourse
The prevalence of digital media and evolving sociopolitical dynamics have
significantly amplified the dissemination of hateful content. Existing studies
mainly focus on classifying texts into binary categories, often overlooking the
continuous spectrum of offensiveness and hatefulness inherent in the text. In
this research, we present an extensive benchmark dataset for Amharic,
comprising 8,258 tweets annotated for three distinct tasks: category
classification, identification of hate targets, and rating offensiveness and
hatefulness intensities. Our study highlights that a considerable majority of
tweets belong to the less offensive and less hate intensity levels,
underscoring the need for early interventions by stakeholders. The prevalence
of ethnic and political hatred targets, with significant overlaps in our
dataset, emphasizes the complex relationships within Ethiopia's sociopolitical
landscape. We build classification and regression models and investigate the
efficacy of models in handling these tasks. Our results reveal that hate and
offensive speech can not be addressed by a simplistic binary classification,
instead manifesting as variables across a continuous range of values. The
Afro-XLMR-large model exhibits the best performances achieving F1-scores of
75.30%, 70.59%, and 29.42% for the category, target, and regression tasks,
respectively. The 80.22% correlation coefficient of the Afro-XLMR-large model
indicates strong alignments.
☆ Can We Catch the Elephant? The Evolvement of Hallucination Evaluation on Natural Language Generation: A Survey
Hallucination in Natural Language Generation (NLG) is like the elephant in
the room, obvious but often overlooked until recent achievements significantly
improved the fluency and grammatical accuracy of generated text. For Large
Language Models (LLMs), hallucinations can happen in various downstream tasks
and casual conversations, which need accurate assessment to enhance reliability
and safety. However, current studies on hallucination evaluation vary greatly,
and people still find it difficult to sort out and select the most appropriate
evaluation methods. Moreover, as NLP research gradually shifts to the domain of
LLMs, it brings new challenges to this direction. This paper provides a
comprehensive survey on the evolvement of hallucination evaluation methods,
aiming to address three key aspects: 1) Diverse definitions and granularity of
facts; 2) The categories of automatic evaluators and their applicability; 3)
Unresolved issues and future directions.
comment: 19 pages in total, with 9 pages as main body. Under review as a
conference paper at CoLM 2024
☆ Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector
Current open-source large language models (LLMs) are often undergone careful
safety alignment before public release. Some attack methods have also been
proposed that help check for safety vulnerabilities in LLMs to ensure alignment
robustness. However, many of these methods have moderate attack success rates.
Even when successful, the harmfulness of their outputs cannot be guaranteed,
leading to suspicions that these methods have not accurately identified the
safety vulnerabilities of LLMs. In this paper, we introduce a LLM attack method
utilizing concept-based model explanation, where we extract safety concept
activation vectors (SCAVs) from LLMs' activation space, enabling efficient
attacks on well-aligned LLMs like LLaMA-2, achieving near 100% attack success
rate as if LLMs are completely unaligned. This suggests that LLMs, even after
thorough safety alignment, could still pose potential risks to society upon
public release. To evaluate the harmfulness of outputs resulting with various
attack methods, we propose a comprehensive evaluation method that reduces the
potential inaccuracies of existing evaluations, and further validate that our
method causes more harmful content. Additionally, we discover that the SCAVs
show some transferability across different open-source LLMs.
☆ Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration
Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li, Jingang Wang, Xunliang Cai, Dongyan Zhao
Large language models (LLMs) have recently shown remarkable performance
across a wide range of tasks. However, the substantial number of parameters in
LLMs contributes to significant latency during model inference. This is
particularly evident when utilizing autoregressive decoding methods, which
generate one token in a single forward process, thereby not fully capitalizing
on the parallel computing capabilities of GPUs. In this paper, we propose a
novel parallel decoding approach, namely \textit{hidden transfer}, which
decodes multiple successive tokens simultaneously in a single forward pass. The
idea is to transfer the intermediate hidden states of the previous context to
the \textit{pseudo} hidden states of the future tokens to be generated, and
then the pseudo hidden states will pass the following transformer layers
thereby assimilating more semantic information and achieving superior
predictive accuracy of the future tokens.
Besides, we use the novel tree attention mechanism to simultaneously generate
and verify multiple candidates of output sequences, which ensure the lossless
generation and further improves the generation efficiency of our method.
Experiments demonstrate the effectiveness of our method. We conduct a lot of
analytic experiments to prove our motivation. In terms of acceleration metrics,
we outperform all the single-model acceleration techniques, including Medusa
and Self-Speculative decoding.
☆ Enhance Robustness of Language Models Against Variation Attack through Graph Integration COLING 2024
The widespread use of pre-trained language models (PLMs) in natural language
processing (NLP) has greatly improved performance outcomes. However, these
models' vulnerability to adversarial attacks (e.g., camouflaged hints from drug
dealers), particularly in the Chinese language with its rich character
diversity/variation and complex structures, hatches vital apprehension. In this
study, we propose a novel method, CHinese vAriatioN Graph Enhancement (CHANGE),
to increase the robustness of PLMs against character variation attacks in
Chinese content. CHANGE presents a novel approach for incorporating a Chinese
character variation graph into the PLMs. Through designing different
supplementary tasks utilizing the graph structure, CHANGE essentially enhances
PLMs' interpretation of adversarially manipulated text. Experiments conducted
in a multitude of NLP tasks show that CHANGE outperforms current language
models in combating against adversarial attacks and serves as a valuable
contribution to robust language model research. These findings contribute to
the groundwork on robust language models and highlight the substantial
potential of graph-guided pre-training strategies for real-world applications.
comment: 12 pages, 4 figures, accepted by COLING 2024
☆ Sequential Compositional Generalization in Multimodal Models NAACL
The rise of large-scale multimodal models has paved the pathway for
groundbreaking advances in generative modeling and reasoning, unlocking
transformative applications in a variety of complex tasks. However, a pressing
question that remains is their genuine capability for stronger forms of
generalization, which has been largely underexplored in the multimodal setting.
Our study aims to address this by examining sequential compositional
generalization using \textsc{CompAct} (\underline{Comp}ositional
\underline{Act}ivities)\footnote{Project Page:
\url{http://cyberiada.github.io/CompAct}}, a carefully constructed,
perceptually grounded dataset set within a rich backdrop of egocentric kitchen
activity videos. Each instance in our dataset is represented with a combination
of raw video footage, naturally occurring sound, and crowd-sourced step-by-step
descriptions. More importantly, our setup ensures that the individual concepts
are consistently distributed across training and evaluation sets, while their
compositions are novel in the evaluation set. We conduct a comprehensive
assessment of several unimodal and multimodal models. Our findings reveal that
bi-modal and tri-modal models exhibit a clear edge over their text-only
counterparts. This highlights the importance of multimodality while charting a
trajectory for future research in this domain.
comment: Accepted to the main conference of NAACL (2024) as a long paper
☆ ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity
Paraphrase generation is a pivotal task in natural language processing (NLP).
Existing datasets in the domain lack syntactic and lexical diversity, resulting
in paraphrases that closely resemble the source sentences. Moreover, these
datasets often contain hate speech and noise, and may unintentionally include
non-English language sentences. This research introduces ParaFusion, a
large-scale, high-quality English paraphrase dataset developed using Large
Language Models (LLM) to address these challenges. ParaFusion augments existing
datasets with high-quality data, significantly enhancing both lexical and
syntactic diversity while maintaining close semantic similarity. It also
mitigates the presence of hate speech and reduces noise, ensuring a cleaner and
more focused English dataset. Results show that ParaFusion offers at least a
25% improvement in both syntactic and lexical diversity, measured across
several metrics for each data source. The paper also aims to set a gold
standard for paraphrase evaluation as it contains one of the most comprehensive
evaluation strategies to date. The results underscore the potential of
ParaFusion as a valuable resource for improving NLP applications.
☆ Variational Multi-Modal Hypergraph Attention Network for Multi-Modal Relation Extraction
Multi-modal relation extraction (MMRE) is a challenging task that aims to
identify relations between entities in text leveraging image information.
Existing methods are limited by their neglect of the multiple entity pairs in
one sentence sharing very similar contextual information (ie, the same text and
image), resulting in increased difficulty in the MMRE task. To address this
limitation, we propose the Variational Multi-Modal Hypergraph Attention Network
(VM-HAN) for multi-modal relation extraction. Specifically, we first construct
a multi-modal hypergraph for each sentence with the corresponding image, to
establish different high-order intra-/inter-modal correlations for different
entity pairs in each sentence. We further design the Variational Hypergraph
Attention Networks (V-HAN) to obtain representational diversity among different
entity pairs using Gaussian distribution and learn a better hypergraph
structure via variational attention. VM-HAN achieves state-of-the-art
performance on the multi-modal relation extraction task, outperforming existing
methods in terms of accuracy and efficiency.
☆ Token-level Direct Preference Optimization
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align
them with human values and intentions. This process often utilizes methods like
pairwise comparisons and KL divergence against a reference LLM, focusing on the
evaluation of full answers generated by the models. However, the generation of
these responses occurs in a token level, following a sequential,
auto-regressive fashion. In this paper, we introduce Token-level Direct
Preference Optimization (TDPO), a novel approach to align LLMs with human
preferences by optimizing policy at the token level. Unlike previous methods,
which face challenges in divergence efficiency, TDPO incorporates forward KL
divergence constraints for each token, improving alignment and diversity.
Utilizing the Bradley-Terry model for a token-based reward system, TDPO
enhances the regulation of KL divergence, while preserving simplicity without
the need for explicit reward modeling. Experimental results across various text
tasks demonstrate TDPO's superior performance in balancing alignment with
generation diversity. Notably, fine-tuning with TDPO strikes a better balance
than DPO in the controlled sentiment generation and single-turn dialogue
datasets, and significantly improves the quality of generated responses
compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at
https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.
☆ EVIT: Event-Oriented Instruction Tuning for Event Reasoning
Events refer to specific occurrences, incidents, or happenings that take
place under a particular background. Event reasoning aims to infer events
according to certain relations and predict future events. The cutting-edge
techniques for event reasoning play a crucial role in various natural language
processing applications. Large language models (LLMs) have made significant
advancements in event reasoning owing to their wealth of knowledge and
reasoning capabilities. However, smaller instruction-tuned models currently in
use do not consistently demonstrate exceptional proficiency in managing these
tasks. This discrepancy arises from the absence of explicit modeling of events
and the interconnections of them within their instruction data. Consequently,
these models face challenges in comprehending event structures and semantics
while struggling to bridge the gap between their interpretations and human
understanding of events. Additionally, their limitations in grasping event
relations lead to constrained event reasoning abilities to effectively deduce
and incorporate pertinent event knowledge. In this paper, we propose
Event-Oriented Instruction Tuning (EvIT) to train our LLM. Specifically, we
first propose a novel structure named event quadruple which contains the
structure and semantics of events and is complete in the event representation.
We then design event-relation learning based on the structures. We encapsulate
the learning into the instruction-tuning formulation to better stimulate the
event reasoning capacity of our model. We design a heuristic unsupervised
method to mine event quadruple from a large-scale corpus. At last, we finetune
a Llama model on our Event-Oriented Instruction Tuning. We conduct extensive
experiments on event reasoning tasks on several datasets. Automatic and human
evaluations demonstrate EvIT achieves competitive performances on event
reasoning.
☆ Aligning Language Models to Explicitly Handle Ambiguity
Hyuhng Joon Kim, Youna Kim, Cheonbok Park, Junyeob Kim, Choonghyun Park, Kang Min Yoo, Sang-goo Lee, Taeuk Kim
In spoken languages, utterances are often shaped to be incomplete or vague
for efficiency. This can lead to varying interpretations of the same input,
based on different assumptions about the context. To ensure reliable user-model
interactions in such scenarios, it is crucial for models to adeptly handle the
inherent ambiguity in user queries. However, conversational agents built upon
even the most recent large language models (LLMs) face challenges in processing
ambiguous inputs, primarily due to the following two hurdles: (1) LLMs are not
directly trained to handle inputs that are too ambiguous to be properly
managed; (2) the degree of ambiguity in an input can vary according to the
intrinsic knowledge of the LLMs, which is difficult to investigate. To address
these issues, this paper proposes a method to align LLMs to explicitly handle
ambiguous inputs. Specifically, we introduce a proxy task that guides LLMs to
utilize their intrinsic knowledge to self-disambiguate a given input. We
quantify the information gain from the disambiguation procedure as a measure of
the extent to which the models perceive their inputs as ambiguous. This measure
serves as a cue for selecting samples deemed ambiguous from the models'
perspectives, which are then utilized for alignment. Experimental results from
several question-answering datasets demonstrate that the LLMs fine-tuned with
our approach are capable of handling ambiguous inputs while still performing
competitively on clear questions within the task.
☆ P-NAL: an Effective and Interpretable Entity Alignment Method
Entity alignment (EA) aims to find equivalent entities between two Knowledge
Graphs. Existing embedding-based EA methods usually encode entities as
embeddings, triples as embeddings' constraint and learn to align the
embeddings. The structural and side information are usually utilized via
embedding propagation, aggregation or interaction. However, the details of the
underlying logical inference steps among the alignment process are usually
omitted, resulting in inadequate inference process. In this paper, we introduce
P-NAL, an entity alignment method that captures two types of logical inference
paths with Non-Axiomatic Logic (NAL). Type 1 is the bridge-like inference path
between to-be-aligned entity pairs, consisting of two relation/attribute
triples and a similarity sentence between the other two entities. Type 2 links
the entity pair by their embeddings. P-NAL iteratively aligns entities and
relations by integrating the conclusions of the inference paths. Moreover, our
method is logically interpretable and extensible due to the expressiveness of
NAL. Our proposed method is suitable for various EA settings. Experimental
results show that our method outperforms state-of-the-art methods in terms of
Hits@1, achieving 0.98+ on all three datasets of DBP15K with both supervised
and unsupervised settings. To our knowledge, we present the first in-depth
analysis of entity alignment's basic principles from a unified logical
perspective.
comment: 13 pages, 2 figures
☆ CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment
Multilingual proficiency presents a significant challenge for large language
models (LLMs). English-centric models are usually suboptimal in other
languages, particularly those that are linguistically distant from English.
This performance discrepancy mainly stems from the imbalanced distribution of
training data across languages during pre-training and instruction tuning
stages. To address this problem, we propose a novel approach called CrossIn,
which utilizes a mixed composition of cross-lingual instruction tuning data.
Our method leverages the compressed representation shared by various languages
to efficiently enhance the model's task-solving capabilities and multilingual
proficiency within a single process. In addition, we introduce a multi-task and
multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental
results demonstrate that our method substantially improves performance across
tasks and languages, and we provide extensive insights into the impact of
cross-lingual data volume and the integration of translation data on enhancing
multilingual consistency and accuracy.
comment: 11 pages
☆ SKIP: Skill-Localized Prompt Tuning for Inference Speed Boost-Up
Prompt-tuning methods have shown comparable performance as
parameter-efficient fine-tuning (PEFT) methods in various natural language
understanding tasks. However, existing prompt tuning methods still utilize the
entire model architecture; thus, they fail to accelerate inference speed in the
application. In this paper, we propose a novel approach called SKIll-localized
Prompt tuning (SKIP), which is extremely efficient in inference time. Our
method significantly enhances inference efficiency by investigating and
utilizing a skill-localized subnetwork in a language model. Surprisingly, our
method improves the inference speed up to 160% while pruning 52% of the
parameters. Furthermore, we demonstrate that our method is applicable across
various transformer-based architectures, thereby confirming its practicality
and scalability.
comment: 6 pages
☆ TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
With large language models (LLMs) widely deployed in long content generation
recently, there has emerged an increasing demand for efficient long-sequence
inference support. However, key-value (KV) cache, which is stored to avoid
re-computation, has emerged as a critical bottleneck by growing linearly in
size with the sequence length. Due to the auto-regressive nature of LLMs, the
entire KV cache will be loaded for every generated token, resulting in low
utilization of computational cores and high latency. While various compression
methods for KV cache have been proposed to alleviate this issue, they suffer
from degradation in generation quality. We introduce TriForce, a hierarchical
speculative decoding system that is scalable to long sequence generation. This
approach leverages the original model weights and dynamic sparse KV cache via
retrieval as a draft model, which serves as an intermediate layer in the
hierarchy and is further speculated by a smaller model to reduce its drafting
latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K,
achieving up to 2.31$\times$ on an A100 GPU but also showcases scalability in
handling even longer contexts. For the offloading setting on two RTX 4090 GPUs,
TriForce achieves 0.108s/token$\unicode{x2014}$only half as slow as the
auto-regressive baseline on an A100, which attains 7.78$\times$ on our
optimized offloading system. Additionally, TriForce performs 4.86$\times$ than
DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is
highlighted by its consistently outstanding performance across various
temperatures. The code is available at
https://github.com/Infini-AI-Lab/TriForce.
☆ Enhancing Length Extrapolation in Sequential Models with Pointer-Augmented Neural Memory
We propose Pointer-Augmented Neural Memory (PANM) to help neural networks
understand and apply symbol processing to new, longer sequences of data. PANM
integrates an external neural memory that uses novel physical addresses and
pointer manipulation techniques to mimic human and computer symbol processing
abilities. PANM facilitates pointer assignment, dereference, and arithmetic by
explicitly using physical pointers to access memory content. Remarkably, it can
learn to perform these operations through end-to-end training on sequence data,
powering various sequential models. Our experiments demonstrate PANM's
exceptional length extrapolating capabilities and improved performance in tasks
that require symbol processing, such as algorithmic reasoning and Dyck language
recognition. PANM helps Transformer achieve up to 100% generalization accuracy
in compositional learning tasks and significantly better results in
mathematical reasoning, question answering and machine translation tasks.
comment: Preprint
☆ Challenging Negative Gender Stereotypes: A Study on the Effectiveness of Automated Counter-Stereotypes LREC
Gender stereotypes are pervasive beliefs about individuals based on their
gender that play a significant role in shaping societal attitudes, behaviours,
and even opportunities. Recognizing the negative implications of gender
stereotypes, particularly in online communications, this study investigates
eleven strategies to automatically counter-act and challenge these views. We
present AI-generated gender-based counter-stereotypes to (self-identified) male
and female study participants and ask them to assess their offensiveness,
plausibility, and potential effectiveness. The strategies of counter-facts and
broadening universals (i.e., stating that anyone can have a trait regardless of
group membership) emerged as the most robust approaches, while humour,
perspective-taking, counter-examples, and empathy for the speaker were
perceived as less effective. Also, the differences in ratings were more
pronounced for stereotypes about the different targets than between the genders
of the raters. Alarmingly, many AI-generated counter-stereotypes were perceived
as offensive and/or implausible. Our analysis and the collected dataset offer
foundational insight into counter-stereotype generation, guiding future efforts
to develop strategies that effectively challenge gender stereotypes in online
interactions.
comment: LREC-COLING2024
☆ AdvisorQA: Towards Helpful and Harmless Advice-seeking Question Answering with Collective Intelligence
As the integration of large language models into daily life is on the rise,
there is a clear gap in benchmarks for advising on subjective and personal
dilemmas. To address this, we introduce AdvisorQA, the first benchmark
developed to assess LLMs' capability in offering advice for deeply personalized
concerns, utilizing the LifeProTips subreddit forum. This forum features a
dynamic interaction where users post advice-seeking questions, receiving an
average of 8.9 advice per query, with 164.2 upvotes from hundreds of users,
embodying a collective intelligence framework. Therefore, we've completed a
benchmark encompassing daily life questions, diverse corresponding responses,
and majority vote ranking to train our helpfulness metric. Baseline experiments
validate the efficacy of AdvisorQA through our helpfulness metric, GPT-4, and
human evaluation, analyzing phenomena beyond the trade-off between helpfulness
and harmlessness. AdvisorQA marks a significant leap in enhancing QA systems
for providing personalized, empathetic advice, showcasing LLMs' improved
understanding of human subjectivity.
comment: 19 pages, 11 figures
☆ Sharing Parameter by Conjugation for Knowledge Graph Embeddings in Complex Space COLING 2022
A Knowledge Graph (KG) is the directed graphical representation of entities
and relations in the real world. KG can be applied in diverse Natural Language
Processing (NLP) tasks where knowledge is required. The need to scale up and
complete KG automatically yields Knowledge Graph Embedding (KGE), a shallow
machine learning model that is suffering from memory and training time
consumption issues. To mitigate the computational load, we propose a
parameter-sharing method, i.e., using conjugate parameters for complex numbers
employed in KGE models. Our method improves memory efficiency by 2x in relation
embedding while achieving comparable performance to the state-of-the-art
non-conjugate models, with faster, or at least comparable, training time. We
demonstrated the generalizability of our method on two best-performing KGE
models $5^{\bigstar}\mathrm{E}$ and $\mathrm{ComplEx}$ on five benchmark
datasets.
comment: 8 pages, 1 figure, 6 tables, accepted at TextGraphs-16 workshop held
in conjunction with COLING 2022
♻ ☆ Visually grounded few-shot word learning in low-resource settings
We propose a visually grounded speech model that learns new words and their
visual depictions from just a few word-image example pairs. Given a set of test
images and a spoken query, we ask the model which image depicts the query word.
Previous work has simplified this few-shot learning problem by either using an
artificial setting with digit word-image pairs or by using a large number of
examples per class. Moreover, all previous studies were performed using English
speech-image data. We propose an approach that can work on natural word-image
pairs but with less examples, i.e. fewer shots, and then illustrate how this
approach can be applied for multimodal few-shot learning in a real low-resource
language, Yor\`ub\'a. Our approach involves using the given word-image example
pairs to mine new unsupervised word-image training pairs from large collections
of unlabelled speech and images. Additionally, we use a word-to-image attention
mechanism to determine word-image similarity. With this new model, we achieve
better performance with fewer shots than previous approaches on an existing
English benchmark. Many of the model's mistakes are due to confusion between
visual concepts co-occurring in similar contexts. The experiments on Yor\`ub\'a
show the benefit of transferring knowledge from a multimodal model trained on a
larger set of English speech-image data.
comment: Accepted to TASLP. arXiv admin note: substantial text overlap with
arXiv:2305.15937
♻ ☆ Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models NAACL 2024
Wanyong Feng, Jaewook Lee, Hunter McNichols, Alexander Scarlatos, Digory Smith, Simon Woodhead, Nancy Otero Ornelas, Andrew Lan
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of
education since they are easy to administer, grade, and are a reliable format
in assessments and practices. One of the most important aspects of MCQs is the
distractors, i.e., incorrect options that are designed to target common errors
or misconceptions among real students. To date, the task of crafting
high-quality distractors largely remains a labor and time-intensive process for
teachers and learning content designers, which has limited scalability. In this
work, we study the task of automated distractor generation in the domain of
math MCQs and explore a wide variety of large language model (LLM)-based
approaches, from in-context learning to fine-tuning. We conduct extensive
experiments using a real-world math MCQ dataset and find that although LLMs can
generate some mathematically valid distractors, they are less adept at
anticipating common errors or misconceptions among real students.
comment: NAACL 2024 findings
♻ ☆ JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks
With the rapid advancements in Multimodal Large Language Models (MLLMs),
securing these models against malicious inputs while aligning them with human
values has emerged as a critical challenge. In this paper, we investigate an
important and unexplored question of whether techniques that successfully
jailbreak Large Language Models (LLMs) can be equally effective in jailbreaking
MLLMs. To explore this issue, we introduce JailBreakV-28K, a pioneering
benchmark designed to assess the transferability of LLM jailbreak techniques to
MLLMs, thereby evaluating the robustness of MLLMs against diverse jailbreak
attacks. Utilizing a dataset of 2, 000 malicious queries that is also proposed
in this paper, we generate 20, 000 text-based jailbreak prompts using advanced
jailbreak attacks on LLMs, alongside 8, 000 image-based jailbreak inputs from
recent MLLMs jailbreak attacks, our comprehensive dataset includes 28, 000 test
cases across a spectrum of adversarial scenarios. Our evaluation of 10
open-source MLLMs reveals a notably high Attack Success Rate (ASR) for attacks
transferred from LLMs, highlighting a critical vulnerability in MLLMs that
stems from their text-processing capabilities. Our findings underscore the
urgent need for future research to address alignment vulnerabilities in MLLMs
from both textual and visual inputs.
♻ ☆ Efficient Sentiment Analysis: A Resource-Aware Evaluation of Feature Extraction Techniques, Ensembling, and Deep Learning Models
While reaching for NLP systems that maximize accuracy, other important
metrics of system performance are often overlooked. Prior models are easily
forgotten despite their possible suitability in settings where large computing
resources are unavailable or relatively more costly. In this paper, we perform
a broad comparative evaluation of document-level sentiment analysis models with
a focus on resource costs that are important for the feasibility of model
deployment and general climate consciousness. Our experiments consider
different feature extraction techniques, the effect of ensembling,
task-specific deep learning modeling, and domain-independent large language
models (LLMs). We find that while a fine-tuned LLM achieves the best accuracy,
some alternate configurations provide huge (up to 24, 283 *) resource savings
for a marginal (<1%) loss in accuracy. Furthermore, we find that for smaller
datasets, the differences in accuracy shrink while the difference in resource
consumption grows further.
♻ ☆ SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization
Prakamya Mishra, Zonghai Yao, Parth Vashisht, Feiyun Ouyang, Beining Wang, Vidhi Dhaval Mody, Hong Yu
Large Language Models (LLMs) such as GPT & Llama have demonstrated
significant achievements in summarization tasks but struggle with factual
inaccuracies, a critical issue in clinical NLP applications where errors could
lead to serious consequences. To counter the high costs and limited
availability of expert-annotated data for factual alignment, this study
introduces an innovative pipeline that utilizes >100B parameter GPT variants
like GPT-3.5 & GPT-4 to act as synthetic experts to generate high-quality
synthetics feedback aimed at enhancing factual consistency in clinical note
summarization. Our research primarily focuses on edit feedback generated by
these synthetic feedback experts without additional human annotations,
mirroring and optimizing the practical scenario in which medical professionals
refine AI system outputs. Although such 100B+ parameter GPT variants have
proven to demonstrate expertise in various clinical NLP tasks, such as the
Medical Licensing Examination, there is scant research on their capacity to act
as synthetic feedback experts and deliver expert-level edit feedback for
improving the generation quality of weaker (<10B parameter) LLMs like GPT-2
(1.5B) & Llama 2 (7B) in clinical domain. So in this work, we leverage 100B+
GPT variants to act as synthetic feedback experts offering expert-level edit
feedback, that is used to reduce hallucinations and align weaker (<10B
parameter) LLMs with medical facts using two distinct alignment algorithms (DPO
& SALT), endeavoring to narrow the divide between AI-generated content and
factual accuracy. This highlights the substantial potential of LLM-based
synthetic edits in enhancing the alignment of clinical factuality.
comment: Equal contribution for the first two authors
♻ ☆ Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens
Large language models (LLMs) have demonstrated remarkable capabilities across
various tasks. However, their widespread application is hindered by the
resource-intensive decoding process. To address this challenge, current
approaches have incorporated additional decoding heads to enable parallel
prediction of multiple subsequent tokens, thereby achieving inference
acceleration. Nevertheless, the accuracy of these decoding heads falls short of
the auto-regressive decoding approach.
In light of these limitations, we propose Chimera, a novel framework
specifically designed for speculative sampling. Within this framework, we
introduce a lightweight draft model that effectively utilizes previously
generated tokens to predict subsequent words. To ensure both accuracy and
efficiency, we present two strategies within the lightweight draft model.
Firstly, we focus on capturing short-range dependencies at the bottom layer.
Secondly, we leverage the readily available representations from the original
LLM.Through empirical evaluation on the Vicuna and LlaMA-2 series, Chimera
demonstrates impressive results, achieving an average latency speedup ratio of
2.7x compared to the vanilla auto-regressive decoding approach. This highlights
the potential of our proposed framework in significantly improving the
efficiency of large language models during the decoding process.
♻ ☆ InstructIE: A Bilingual Instruction-based Information Extraction Dataset
Honghao Gui, Shuofei Qiao, Jintian Zhang, Hongbin Ye, Mengshu Sun, Lei Liang, Jeff Z. Pan, Huajun Chen, Ningyu Zhang
Large language models can perform well on general natural language tasks, but
their effectiveness is still not optimal for information extraction. Recent
works indicate that the main reason lies in the lack of extensive data on
information extraction instructions. Note that the existing datasets on
information extraction instructions not only have limited coverage but also
involve high construction costs. To address this issue, we introduce
InstructIE, a bilingual instruction-based information extraction dataset, which
covers 12 diverse domains. Specifically, we propose KG2Instruction, a framework
specifically for the automatic generation of such datasets. Experimental
results demonstrate that large language models trained with InstructIE can not
only obtain better information extraction capabilities but also enhance
zero-shot performance compared with baselines.
comment: Work in progress; project homepage:
https://www.zjukg.org/project/InstructIE/ dataset:
https://huggingface.co/datasets/zjunlp/InstructIE
♻ ☆ Can We Edit Multimodal Large Language Models? EMNLP 2023
In this paper, we focus on editing Multimodal Large Language Models (MLLMs).
Compared to editing single-modal LLMs, multimodal model editing is more
challenging, which demands a higher level of scrutiny and careful consideration
in the editing process. To facilitate research in this area, we construct a new
benchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite
of innovative metrics for evaluation. We conduct comprehensive experiments
involving various model editing baselines and analyze the impact of editing
different components for multimodal LLMs. Empirically, we notice that previous
baselines can implement editing multimodal LLMs to some extent, but the effect
is still barely satisfactory, indicating the potential difficulty of this task.
We hope that our work can provide the NLP community with insights. Code and
dataset are available in https://github.com/zjunlp/EasyEdit.
comment: EMNLP 2023. Add the Exact Match/Accuracy results of Reliability and
T-Generality
♻ ☆ Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks ICASSP 2024
In-context learning (ICL) ability has emerged with the increasing scale of
large language models (LLMs), enabling them to learn input-label mappings from
demonstrations and perform well on downstream tasks. However, under the
standard ICL setting, LLMs may sometimes neglect query-related information in
demonstrations, leading to incorrect predictions. To address this limitation,
we propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to
explore the power of ICL in open-domain question answering, an important form
in knowledge-intensive tasks. HICL leverages LLMs' reasoning ability to extract
query-related knowledge from demonstrations, then concatenates the knowledge to
prompt LLMs in a more explicit way. Furthermore, we track the source of this
knowledge to identify specific examples, and introduce a Hint-related Example
Retriever (HER) to select informative examples for enhanced demonstrations. We
evaluate HICL with HER on 3 open-domain QA benchmarks, and observe average
performance gains of 2.89 EM score and 2.52 F1 score on gpt-3.5-turbo, 7.62 EM
score and 7.27 F1 score on LLaMA-2-Chat-7B compared with standard setting.
comment: Accepted by ICASSP 2024
♻ ☆ Language Imbalance Can Boost Cross-lingual Generalisation
Multilinguality is crucial for extending recent advancements in language
modelling to diverse linguistic communities. To maintain high performance while
representing multiple languages, multilingual models ideally align
representations, allowing what is learned in one language to generalise to
others. Prior research has emphasised the importance of parallel data and
shared vocabulary elements as key factors for such alignment. In this study, we
investigate an unintuitive novel driver of cross-lingual generalisation:
language imbalance. In controlled experiments on perfectly equivalent cloned
languages, we observe that the existence of a predominant language during
training boosts the performance of less frequent languages and leads to
stronger alignment of model representations across languages. Furthermore, we
find that this trend is amplified with scale: with large enough models or long
enough training, we observe that bilingual training data with a 90/10 language
split yields better performance on both languages than a balanced 50/50 split.
Building on these insights, we design training schemes that can improve
performance in all cloned languages, even without altering the training data.
As we extend our analysis to real languages, we find that infrequent languages
still benefit from frequent ones, yet whether language imbalance causes
cross-lingual generalisation there is not conclusive.
♻ ☆ Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition
This paper presents a paradigm that adapts general large-scale pretrained
models (PTMs) to speech emotion recognition task. Although PTMs shed new light
on artificial general intelligence, they are constructed with general tasks in
mind, and thus, their efficacy for specific tasks can be further improved.
Additionally, employing PTMs in practical applications can be challenging due
to their considerable size. Above limitations spawn another research direction,
namely, optimizing large-scale PTMs for specific tasks to generate
task-specific PTMs that are both compact and effective. In this paper, we focus
on the speech emotion recognition task and propose an improved emotion-specific
pretrained encoder called Vesper. Vesper is pretrained on a speech dataset
based on WavLM and takes into account emotional characteristics. To enhance
sensitivity to emotional information, Vesper employs an emotion-guided masking
strategy to identify the regions that need masking. Subsequently, Vesper
employs hierarchical and cross-layer self-supervision to improve its ability to
capture acoustic and semantic representations, both of which are crucial for
emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D
datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12
layers, and the performance of Vesper with 12 layers surpasses that of WavLM
Large with 24 layers.
comment: This paper was accepted by IEEE Transactions on Affective Computing
2024
♻ ☆ Can LLMs perform structured graph reasoning?
Pretrained Large Language Models (LLMs) have demonstrated various reasoning
capabilities through language-based prompts alone, particularly in unstructured
task settings (tasks purely based on language semantics). However, LLMs often
struggle with structured tasks, because of the inherent incompatibility of
input representation. Reducing structured tasks to uni-dimensional language
semantics often renders the problem trivial. Keeping the trade-off between LLM
compatibility and structure complexity in mind, we design various graph
reasoning tasks as a proxy to semi-structured tasks in this paper, in order to
test the ability to navigate through representations beyond plain text in
various LLMs. Particularly, we design 10 distinct problems of graph traversal,
each representing increasing levels of complexity, and benchmark 5 different
instruct-finetuned LLMs (GPT-4, GPT-3.5, Claude-2, Llama-2 and Palm-2) on the
aforementioned tasks. Further, we analyse the performance of models across
various settings such as varying sizes of graphs as well as different forms of
k-shot prompting. We highlight various limitations, biases and properties of
LLMs through this benchmarking process, such as an inverse relation to the
average degrees of freedom of traversal per node in graphs, the overall
negative impact of k-shot prompting on graph reasoning tasks, and a positive
response bias which prevents LLMs from identifying the absence of a valid
solution. Finally, we introduce a new prompting technique specially designed
for graph traversal tasks (PathCompare), which demonstrates a notable increase
in the performance of LLMs in comparison to standard prompting techniques such
as Chain-of-Thought (CoT).
♻ ☆ Knowledgeable Preference Alignment for LLMs in Domain-specific Question Answering
Deploying large language models (LLMs) to real scenarios for domain-specific
question answering (QA) is a key thrust for LLM applications, which poses
numerous challenges, especially in ensuring that responses are both
accommodating to user requirements and appropriately leveraging domain-specific
knowledge bases. They are the two major difficulties for LLM application as
vanilla fine-tuning falls short of addressing. Combining these requirements, we
conceive of them as the requirement for the model's preference to be
harmoniously aligned with humans'. Thus, we introduce Knowledgeable Preference
AlignmenT (KnowPAT), which constructs two kinds of preference sets to tackle
the two issues. Besides, we design a new alignment objective to align the LLM
preference with different human preferences uniformly, aiming to optimize LLM
performance in real-world, domain-specific QA settings. Adequate experiments
and comprehensive comparisons with 15 baseline methods illustrate that our
KnowPAT is a superior pipeline for real-scenario domain-specific QA with LLMs.
comment: Work in progress. Code is available at
https://github.com/zjukg/KnowPAT
♻ ☆ LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models
While Large Language Models (LLMs) have demonstrated commendable performance
across a myriad of domains and tasks, existing LLMs still exhibit a palpable
deficit in handling multimodal functionalities, especially for the Spoken
Question Answering (SQA) task which necessitates precise alignment and deep
interaction between speech and text features. To address the SQA challenge on
LLMs, we initially curated the free-form and open-ended LibriSQA dataset from
Librispeech, comprising Part I with natural conversational formats and Part II
encompassing multiple-choice questions followed by answers and analytical
segments. Both parts collectively include 107k SQA pairs that cover various
topics. Given the evident paucity of existing speech-text LLMs, we propose a
lightweight, end-to-end framework to execute the SQA task on the LibriSQA,
witnessing significant results. By reforming ASR into the SQA format, we
further substantiate our framework's capability in handling ASR tasks. Our
empirical findings bolster the LLMs' aptitude for aligning and comprehending
multimodal information, paving the way for the development of universal
multimodal LLMs. The dataset and demo can be found at
https://github.com/ZihanZhaoSJTU/LibriSQA.
♻ ☆ Gaining More Insight into Neural Semantic Parsing with Challenging Benchmarks
The Parallel Meaning Bank (PMB) serves as a corpus for semantic processing
with a focus on semantic parsing and text generation. Currently, we witness an
excellent performance of neural parsers and generators on the PMB. This might
suggest that such semantic processing tasks have by and large been solved. We
argue that this is not the case and that performance scores from the past on
the PMB are inflated by non-optimal data splits and test sets that are too
easy. In response, we introduce several changes. First, instead of the prior
random split, we propose a more systematic splitting approach to improve the
reliability of the standard test data. Second, except for the standard test
set, we also propose two challenge sets: one with longer texts including
discourse structure, and one that addresses compositional generalization. We
evaluate five neural models for semantic parsing and meaning-to-text
generation. Our results show that model performance declines (in some cases
dramatically) on the challenge sets, revealing the limitations of neural models
when confronting such challenges.
♻ ☆ A Family of Pretrained Transformer Language Models for Russian LREC
Dmitry Zmitrovich, Alexander Abramov, Andrey Kalmykov, Maria Tikhonova, Ekaterina Taktasheva, Danil Astafurov, Mark Baushenko, Artem Snegirev, Vitalii Kadulin, Sergey Markov, Tatiana Shavrina, Vladislav Mikhailov, Alena Fenogenova
Transformer language models (LMs) are fundamental to NLP research
methodologies and applications in various languages. However, developing such
models specifically for the Russian language has received little attention.
This paper introduces a collection of 13 Russian Transformer LMs, which spans
encoder (ruBERT, ruRoBERTa, ruELECTRA), decoder (ruGPT-3), and encoder-decoder
(ruT5, FRED-T5) architectures. We provide a report on the model architecture
design and pretraining, and the results of evaluating their generalization
abilities on Russian language understanding and generation datasets and
benchmarks. By pretraining and releasing these specialized Transformer LMs, we
aim to broaden the scope of the NLP research directions and enable the
development of industrial solutions for the Russian language.
comment: to appear in LREC-COLING-2024
♻ ☆ ViLLM-Eval: A Comprehensive Evaluation Suite for Vietnamese Large Language Models
The rapid advancement of large language models (LLMs) necessitates the
development of new benchmarks to accurately assess their capabilities. To
address this need for Vietnamese, this work aims to introduce ViLLM-Eval, the
comprehensive evaluation suite designed to measure the advanced knowledge and
reasoning abilities of foundation models within a Vietnamese context.
ViLLM-Eval consists of multiple-choice questions and predict next word tasks
spanning various difficulty levels and diverse disciplines, ranging from
humanities to science and engineering. A thorough evaluation of the most
advanced LLMs on ViLLM-Eval revealed that even the best performing models have
significant room for improvement in understanding and responding to Vietnamese
language tasks. ViLLM-Eval is believed to be instrumental in identifying key
strengths and weaknesses of foundation models, ultimately promoting their
development and enhancing their performance for Vietnamese users. This paper
provides a thorough overview of ViLLM-Eval as part of the Vietnamese Large
Language Model shared task, held within the 10th International Workshop on
Vietnamese Language and Speech Processing (VLSP 2023).
comment: arXiv admin note: text overlap with arXiv:2305.08322 by other authors
♻ ☆ Octopus v3: Technical Report for On-device Sub-billion Multimodal AI Agent
A multimodal AI agent is characterized by its ability to process and learn
from various types of data, including natural language, visual, and audio
inputs, to inform its actions. Despite advancements in large language models
that incorporate visual data, such as GPT-4V, effectively translating
image-based data into actionable outcomes for AI agents continues to be
challenging. In this paper, we introduce a multimodal model that incorporates
the concept of functional token specifically designed for AI agent
applications. To ensure compatibility with edge devices, our model is optimized
to a compact size of less than 1B parameters. Like GPT-4, our model can process
both English and Chinese. We demonstrate that this model is capable of
operating efficiently on a wide range of edge devices, including as constrained
as a Raspberry Pi.
♻ ☆ Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement EMNLP 2023
To enhance the multi-step reasoning capabilities of large language models,
researchers have extensively explored prompting methods, notably the
Chain-of-Thought (CoT) method which explicitly elicits human-like rationales.
However, they have inadvertently overlooked the potential of enhancing model
reasoning performance by formulating higher-quality problems. In this work, we
start from the problem side and propose Self-Polish (SP), a novel method that
facilitates the model's reasoning by guiding it to progressively refine the
given problems to be more comprehensible and solvable. We also explore several
automatic prompting varients and propose the Self-Polish prompt bank for the
community. SP is orthogonal to all other prompting methods of answer/reasoning
side like CoT, allowing for seamless integration with state-of-the-art
techniques for further improvement. Thorough experiments show that the proposed
method attains notable and consistent effectiveness on five reasoning
benchmarks across different models. Furthermore, our method also showcases
impressive performance on robustness evaluation. Codes and prompts are
available at https://github.com/WooooDyy/Self-Polish.
comment: Accepted to EMNLP 2023 Findings. Codes and prompts are available at
https://github.com/WooooDyy/Self-Polish
♻ ☆ KTRL+F: Knowledge-Augmented In-Document Search
We introduce a new problem KTRL+F, a knowledge-augmented in-document search
task that necessitates real-time identification of all semantic targets within
a document with the awareness of external sources through a single natural
query. KTRL+F addresses following unique challenges for in-document search:
1)utilizing knowledge outside the document for extended use of additional
information about targets, and 2) balancing between real-time applicability
with the performance. We analyze various baselines in KTRL+F and find
limitations of existing models, such as hallucinations, high latency, or
difficulties in leveraging external knowledge. Therefore, we propose a
Knowledge-Augmented Phrase Retrieval model that shows a promising balance
between speed and performance by simply augmenting external knowledge in phrase
embedding. We also conduct a user study to verify whether solving KTRL+F can
enhance search experience for users. It demonstrates that even with our simple
model, users can reduce the time for searching with less queries and reduced
extra visits to other sources for collecting evidence. We encourage the
research community to work on KTRL+F to enhance more efficient in-document
information access.
♻ ☆ Measuring Social Norms of Large Language Models
We present a new challenge to examine whether large language models
understand social norms. In contrast to existing datasets, our dataset requires
a fundamental understanding of social norms to solve. Our dataset features the
largest set of social norm skills, consisting of 402 skills and 12,383
questions covering a wide set of social norms ranging from opinions and
arguments to culture and laws. We design our dataset according to the K-12
curriculum. This enables the direct comparison of the social understanding of
large language models to humans, more specifically, elementary students. While
prior work generates nearly random accuracy on our benchmark, recent large
language models such as GPT3.5-Turbo and LLaMA2-Chat are able to improve the
performance significantly, only slightly below human performance. We then
propose a multi-agent framework based on large language models to improve the
models' ability to understand social norms. This method further improves large
language models to be on par with humans. Given the increasing adoption of
large language models in real-world applications, our finding is particularly
important and presents a unique direction for future improvements. The proposed
method and dataset are available in
https://huggingface.co/datasets/socialdataset2024/social.
♻ ☆ RoNID: New Intent Discovery with Generated-Reliable Labels and Cluster-friendly Representations DASFAA 2024
New Intent Discovery (NID) strives to identify known and reasonably deduce
novel intent groups in the open-world scenario. But current methods face issues
with inaccurate pseudo-labels and poor representation learning, creating a
negative feedback loop that degrades overall model performance, including
accuracy and the adjusted rand index. To address the aforementioned challenges,
we propose a Robust New Intent Discovery (RoNID) framework optimized by an
EM-style method, which focuses on constructing reliable pseudo-labels and
obtaining cluster-friendly discriminative representations. RoNID comprises two
main modules: reliable pseudo-label generation module and cluster-friendly
representation learning module. Specifically, the pseudo-label generation
module assigns reliable synthetic labels by solving an optimal transport
problem in the E-step, which effectively provides high-quality supervised
signals for the input of the cluster-friendly representation learning module.
To learn cluster-friendly representation with strong intra-cluster compactness
and large inter-cluster separation, the representation learning module combines
intra-cluster and inter-cluster contrastive learning in the M-step to feed more
discriminative features into the generation module. RoNID can be performed
iteratively to ultimately yield a robust model with reliable pseudo-labels and
cluster-friendly representations. Experimental results on multiple benchmarks
demonstrate our method brings substantial improvements over previous
state-of-the-art methods by a large margin of +1~+4 points.
comment: DASFAA 2024
♻ ☆ Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception ICLR 2024
Mobile device agent based on Multimodal Large Language Models (MLLM) is
becoming a popular application. In this paper, we introduce Mobile-Agent, an
autonomous multi-modal mobile device agent. Mobile-Agent first leverages visual
perception tools to accurately identify and locate both the visual and textual
elements within the app's front-end interface. Based on the perceived vision
context, it then autonomously plans and decomposes the complex operation task,
and navigates the mobile Apps through operations step by step. Different from
previous solutions that rely on XML files of Apps or mobile system metadata,
Mobile-Agent allows for greater adaptability across diverse mobile operating
environments in a vision-centric way, thereby eliminating the necessity for
system-specific customizations. To assess the performance of Mobile-Agent, we
introduced Mobile-Eval, a benchmark for evaluating mobile device operations.
Based on Mobile-Eval, we conducted a comprehensive evaluation of Mobile-Agent.
The experimental results indicate that Mobile-Agent achieved remarkable
accuracy and completion rates. Even with challenging instructions, such as
multi-app operations, Mobile-Agent can still complete the requirements. Code
and model will be open-sourced at https://github.com/X-PLUG/MobileAgent.
comment: Accepted by ICLR 2024 Workshop in Large Language Model (LLM) Agents
♻ ☆ Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization
Swaroop Nath, Tejpalsingh Siledar, Sankara Sri Raghava Ravindra Muddu, Rupasai Rangaraju, Harshad Khadilkar, Pushpak Bhattacharyya, Suman Banerjee, Amey Patil, Sudhanshu Shekhar Singh, Muthusamy Chelliah, Nikesh Garera
Reinforcement Learning from Human Feedback (RLHF) has become a dominating
strategy in aligning Language Models (LMs) with human values/goals. The key to
the strategy is learning a reward model ($\varphi$), which can reflect the
latent reward model of humans. While this strategy has proven effective, the
training methodology requires a lot of human preference annotation (usually in
the order of tens of thousands) to train $\varphi$. Such a large-scale
annotation is justifiable when it's a one-time effort, and the reward model is
universally applicable. However, human goals are subjective and depend on the
task, requiring task-specific preference annotations, which can be impractical
to fulfill. To address this challenge, we propose a novel approach to infuse
domain knowledge into $\varphi$, which reduces the amount of preference
annotation required ($21\times$), omits Alignment Tax, and provides some
interpretability. We validate our approach in E-Commerce Opinion Summarization,
with a significant reduction in dataset size (to just $940$ samples) while
advancing the SOTA ($\sim4$ point ROUGE-L improvement, $68\%$ of times
preferred by humans over SOTA). Our contributions include a novel Reward
Modeling technique and two new datasets: PromptOpinSumm (supervised data for
Opinion Summarization) and OpinPref (a gold-standard human preference dataset).
The proposed methodology opens up avenues for efficient RLHF, making it more
adaptable to applications with varying human values. We release the artifacts
(Code: github.com/efficient-rlhf. PromptOpinSumm: hf.co/prompt-opin-summ.
OpinPref: hf.co/opin-pref) for usage under MIT License.
comment: 19 pages, 6 figures, 21 tables
♻ ☆ Monotonic Paraphrasing Improves Generalization of Language Model Prompting
Performance of large language models (LLMs) may vary with different prompts
or instructions of even the same task. One commonly recognized factor for this
phenomenon is the model's familiarity with the given prompt or instruction,
which is typically estimated by its perplexity. However, finding the prompt
with the lowest perplexity is challenging, given the enormous space of possible
prompting phrases. In this paper, we propose monotonic paraphrasing (MonoPara),
an end-to-end decoding strategy that paraphrases given prompts or instructions
into their lower perplexity counterparts based on an ensemble of a paraphrase
LM for prompt (or instruction) rewriting, and a target LM (i.e. the prompt or
instruction executor) that constrains the generation for lower perplexity. The
ensemble decoding process can efficiently paraphrase the original prompt
without altering its semantic meaning, while monotonically decreasing the
perplexity of each generation as calculated by the target LM. We explore in
detail both greedy and search-based decoding as two alternative decoding
schemes of MonoPara. Notably, MonoPara does not require any training and can
monotonically lower the perplexity of the paraphrased prompt or instruction,
leading to improved performance of zero-shot LM prompting as evaluated on a
wide selection of tasks. In addition, MonoPara is also shown to effectively
improve LMs' generalization on perturbed and unseen task instructions.
comment: Under review at ARR 2024 April
♻ ☆ FIZZ: Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document
Through the advent of pre-trained language models, there have been notable
advancements in abstractive summarization systems. Simultaneously, a
considerable number of novel methods for evaluating factual consistency in
abstractive summarization systems has been developed. But these evaluation
approaches incorporate substantial limitations, especially on refinement and
interpretability. In this work, we propose highly effective and interpretable
factual inconsistency detection method metric Factual Inconsistency Detection
by Zoom-in Summary and Zoom-out Document for abstractive summarization systems
that is based on fine-grained atomic facts decomposition. Moreover, we align
atomic facts decomposed from the summary with the source document through
adaptive granularity expansion. These atomic facts represent a more
fine-grained unit of information, facilitating detailed understanding and
interpretability of the summary's factual inconsistency. Experimental results
demonstrate that our proposed factual consistency checking system significantly
outperforms existing systems.
♻ ☆ Explaining latent representations of generative models with large multimodal models ICLR 2024
Learning interpretable representations of data generative latent factors is
an important topic for the development of artificial intelligence. With the
rise of the large multimodal model, it can align images with text to generate
answers. In this work, we propose a framework to comprehensively explain each
latent variable in the generative models using a large multimodal model. We
further measure the uncertainty of our generated explanations, quantitatively
evaluate the performance of explanation generation among multiple large
multimodal models, and qualitatively visualize the variations of each latent
variable to learn the disentanglement effects of different generative models on
explanations. Finally, we discuss the explanatory capabilities and limitations
of state-of-the-art large multimodal models.
comment: ICLR 2024 Workshop on Reliable and Responsible Foundation Models
♻ ☆ Llama-VITS: Enhancing TTS Synthesis with Semantic Awareness LREC
Recent advancements in Natural Language Processing (NLP) have seen
Large-scale Language Models (LLMs) excel at producing high-quality text for
various purposes. Notably, in Text-To-Speech (TTS) systems, the integration of
BERT for semantic token generation has underscored the importance of semantic
content in producing coherent speech outputs. Despite this, the specific
utility of LLMs in enhancing TTS synthesis remains considerably limited. This
research introduces an innovative approach, Llama-VITS, which enhances TTS
synthesis by enriching the semantic content of text using LLM. Llama-VITS
integrates semantic embeddings from Llama2 with the VITS model, a leading
end-to-end TTS framework. By leveraging Llama2 for the primary speech synthesis
process, our experiments demonstrate that Llama-VITS matches the naturalness of
the original VITS (ORI-VITS) and those incorporate BERT (BERT-VITS), on the
LJSpeech dataset, a substantial collection of neutral, clear speech. Moreover,
our method significantly enhances emotive expressiveness on the EmoV_DB_bea_sem
dataset, a curated selection of emotionally consistent speech from the EmoV_DB
dataset, highlighting its potential to generate emotive speech.
comment: 9 pages, 2 figures, 4 tables; accepted at LREC-COLING 2024
♻ ☆ A Survey on Open Information Extraction from Rule-based Model to Large Language Model
Pai Liu, Wenyang Gao, Wenjie Dong, Lin Ai, Ziwei Gong, Songfang Huang, Zongsheng Li, Ehsan Hoque, Julia Hirschberg, Yue Zhang
Open Information Extraction (OpenIE) represents a crucial NLP task aimed at
deriving structured information from unstructured text, unrestricted by
relation type or domain. This survey paper provides an overview of OpenIE
technologies spanning from 2007 to 2024, emphasizing a chronological
perspective absent in prior surveys. It examines the evolution of task settings
in OpenIE to align with the advances in recent technologies. The paper
categorizes OpenIE approaches into rule-based, neural, and pre-trained large
language models, discussing each within a chronological framework.
Additionally, it highlights prevalent datasets and evaluation metrics currently
in use. Building on this extensive review, the paper outlines potential future
directions in terms of datasets, information sources, output formats,
methodologies, and evaluation metrics.
comment: The first five authors contributed to this work equally. Names are
ordered randomly
♻ ☆ SelectLLM: Can LLMs Select Important Instructions to Annotate?
Instruction tuning benefits from large and diverse datasets, however creating
such datasets involves a high cost of human labeling. While synthetic datasets
generated by large language models (LLMs) have partly solved this issue, they
often contain low-quality data. One effective solution is selectively
annotating unlabelled instructions, especially given the relative ease of
acquiring unlabeled instructions or texts from various sources. However, how to
select unlabelled instructions is not well-explored, especially in the context
of LLMs. Further, traditional data selection methods, relying on input
embedding space density, tend to underestimate instruction sample complexity,
whereas those based on model prediction uncertainty often struggle with
synthetic label quality. Therefore, we introduce SelectLLM, an alternative
framework that leverages the capabilities of LLMs to more effectively select
unlabeled instructions. SelectLLM consists of two key steps: Coreset-based
clustering of unlabelled instructions for diversity and then prompting a LLM to
identify the most beneficial instructions within each cluster. Our experiments
demonstrate that SelectLLM matches or outperforms other state-of-the-art
methods in instruction tuning benchmarks. It exhibits remarkable consistency
across human and synthetic datasets, along with better cross-dataset
generalization, as evidenced by a 10% performance improvement on the Cleaned
Alpaca test set when trained on Dolly data. All code and data are publicly
available (https://github.com/minnesotanlp/select-llm).
comment: First Authors: Ritik Sachin Parkar and Jaehyung Kim | Second Author:
Jong Inn Park | PI: Dongyeop Kang