Paper Content:
Page 1:
LLM-IE: A Python Package for Generative Information Extraction with Large
Language Models
Enshuo Hsu1, 2, MS; Kirk Roberts1, PhD
1McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston,
Houston, Texas, USA.
2Enteprise Development and Integration , University of Texas MD Anderson Cancer Center, Houston, TX ,
USA.
Corresponding author: Kirk Roberts, kirk.roberts@uth.tmc.edu
Keywords:
Natural language processing, Large language models, Named entity recognition, Relation extraction
Word Count:
Total: 2199
Abstract: 149
ABSTRACT
Objectives
Despite the recent adoption of large language models (LLMs) for biomedical information extraction, challenges in
prompt engineering and algorithms persist, with no dedicated software available. To address this, we developed
LLM-IE: a Python package for building complete information extraction pipelines. Our key innovation is an
interactive LLM agent to support schema definition and prompt design.
Materials and Methods
The LLM-IE supports named entity recognition, entity attribute extraction, and relation extraction tasks. We
benchmarked on the i2b2 datasets and conducted a system evaluation .
Results
The sentence -based prompting algorithm resulted in the best performance while requir ing a longer inferen ce time.
System evaluation provided intuitive visualization .
Discussion
LLM-IE was designed from practical NLP experience in healthcare and has been adopted in internal projects. It
should hold great value to the biomedical NLP community.
Conclusion
We developed a Python package, LLM-IE, that provides building blocks for robust information extraction pipeline
construction.
Page 2:
BACKGROUND AND SIGNIFICANCE
The use of large language models (LLMs) for
information extraction in natural language processing
(NLP) has gained increasing popularity [1]. There are
several benefits including 1) low annotation
requirement through zero -shot and few -shot learning
[2,3], 2) comparable performance to fully fine -tuned
models [3], and 3) end -to-end entity span and relation
extraction [4]. In the biomedical field where manually
labeled gold standards are expensive and information
extraction schema s are often complex, LLM -based
information extraction methods show great promise.
Recent works have been focusing on 1) LLM
inferencing infrastructures [5–7], 2) LLM prompting
algorithms [3,4,8–14], and 3) prompt engineering [15].
However, for NLP practitioners, challenge s persist as
the inference engines are difficult to configure and
depend heavily on computing environment . Further,
prompt engineering requires experience, domain
knowledge, and effort in iterative development. Finally,
despite some studies releas ing source code, to our
knowledge no software integrates multiple systems
and methods and provides a comprehensive toolkit
for the LLM-based information extraction pipeline
building. Therefore, we developed a Python packag e,
LLM-IE, for the clinical NLP community .
Our work has the following significance:
1. We provide a uniform interface for different
LLM inference engines which avoids the
complexity of configuration.
2. We implement popular prompting algorithms
published in the biomedical domain and the
open domain and provide simple APIs.
3. We build an LLM agent (“Prompt Editor”) to
help users write and polish prompt templates .
OBJECTIVE
We published a Python package on the Python
Package Index (PyPi) repository and the GitHub
repository .
METHODS
LLM-IE is a comprehensive toolkit that provides
building blocks for the construction of LLM-based
information extraction pipelines. The package and
documentation are available on PyPi
(https://pypi.org/project/llm -ie/ ) and GitHub
(https://github.com/daviden1013/llm -ie ) Usage
LLM-IE covers the life cycle of an NLP information
extraction pipeline: 1) task definition, 2) prompt
design, 3) named entity extraction, 4) entity attributes
extraction, 5) relation extraction, 6) data management,
and 7) visualization.
In the task definition and prompt design phases,
users work closely with the Prompt Editor, is an LLM
agent with access to many pre -stored prompt
templates and guidelines . Users choose an
information extraction algorithm (“ extractor”) and
start chatting with the Prompt Editor via terminal or
IPython (e.g., Jupyter Notebooks) . On the backend,
the Prompt Editor analyzes the users’ requests using
the relevant templates and prompt -writing guidelines
and generates a prompt template with specific task
description s, schema definition, output format
definition, and input placeholders.
The system prompt for the Prompt Editor:
You are an AI assistant specializing
in prompt writing and improvement.
Your role is to help users refine,
rewrite, and generate effective
prompts based on guidelines
provided…
The chat prompt template includes a placeholder for
prompt guidelines and examples:
# Task description
Chat with the user following the
prompt guideline below.
# Prompt guideline
{{prompt_guideline}}
Users are encouraged to iteratively develop with the
Prompt Editor until a final prompt template is
prepared. In the named entity extraction and entity
attributes extraction phases, the frame extractor
applies the prompt template for end-to-end entity
spans and attribute extraction on the target
documents . The LLM outputs strings following the
JSON schema specified in the prompt template. A
post-processing method then conver ts them into
structured frames with frame ID, entity text, entity
spans, and a set o f attributes. The relation extraction
phase involves the extracted frames from the previous
step and a relation extraction prompt template which
can be constructed by working with the Prompt Editor.
The relation extractor s apply the prompt template on
pairs of frames to detect relation existence (i.e., binary
Page 3:
relations) and relation types (i.e., multi -class
relations). To reduce computation for LLM inferencing,
users are encouraged to provide a pre -processing
function (i.e., possible_relation_types_func ) that
applies decision rules. For example, if the two frames
in the pair are “drug” and “dosage”, the possible
relation types are “Dosage -Drug” and “No-relation”, while “dosage” and “dosage” frames must be “No -
relation” and thus do not require LLM inferencing.
After extraction, the built -in data types (e.g.,
LLMInformationExtraction Document) process, store,
and visualize the frames and relations via a Flask App
or HTML rendering (Figure 1).
Figure 1: Usage flowchart . Users start by providing a simple description of the task to the LLM agent. The LLM agent
generates standard prompt templates with Task description, schema definition, output format definition, and input
placeholders. Users iteratively develop prompt templates with the LLM agent until a high -quality prompt template is
prepared. The FrameExtractors use the prompt template to extract entities and attributes (“frames”). The
RelationExtractors extract the relation and relation types between f rames. The built -in visualization tools render the
frames and relations on a web App.
Page 4:
System Design
Our system design follows four principles: 1)
Efficiency, in which recent and successful inference
engines and prompting algorithms are supported (e.g.,
Ollama [5], HuggingFace -hub [16], Llama.cpp [6],
vLLM [7], OpenAI API) . 2) Flexibility, in which
fundamental functions are implemented as modules
and classes (e.g., Inference Engines, Frame Extractors,
Relation Extractors) for easy customization . 3)
Transparency, in which all the prompt templates , LLM
inputs, and outputs are accessible to users. 4)
Minimalism, in which the package has few
dependencies. Users only install dependencies for
functions they use.
The LLM -IE package is composed of four Python
modules: 1) Engines, 2) Extractors, 3) Data types, and
4) Prompt Editor. The Engines module defines
interface classes that support popular open -source
(e.g., Ollama, HuggingFace -hub) and closed -source
(e.g., OpenAI API) LLM inference engines . They work
for the Prompt Editor and extractors . The Extractors
module defines prompting algorithms (“ extractors”)
for frame and relation extraction. The Basic frame
extractor prompts LLM directly and outputs a list of frames. The Review frame extractor prompts LLM to
generate initial outputs and prompt again for
amendment and correction. The Sentence frame
extractor splits the target document into sentences
and prompt s sentence by sentence to improve recall
and entity span detection accuracy . The binary
relation extractor prompts LLM to review and detect
relations between a pair of frames. The multi -class
relation extractor prompts LLM to classify relation
types betwee n a pair of frames. The algorithm sources
are summarized in Table 1. More implementation
details are shown in Table SX . The Data types module
defines data management classes for frames and
relations storage, validation, and visualization. A
document is packaged into a self -contained object.
The validation checks for overlaps and redundancy
and ensures that relations are linking two existing
frames. For minimalism, we implemented the
visualization methods (i.e., viz_serve, viz_render) by
internally calling o ur plug-in Python package, “ie -viz”.
The Prompt editor module defines a Prompt Editor
class that serves as a n LLM agent for prompt
development. It has access to pre -stored prompt -
writing guidelines and examples for each extractor
(Figure 2).
Page 5:
Figure 2: Conceptual c lass diagram. The Engines module defines InferenceEngine classes that host LLM and
provides an interface for inference. The Extractors module defines FrameExtractors and RelationExtractors that
process apply prompt tem plates, prompt LLM for information extraction, and post -process outputs. The Data types
module defines containers for text, entities, and relations management and visualization. The PromptEditor module
defines a PromptEditor AI Agent to write and comment on prompt templates.
Table 1: Prompting algorithm sources
Task Algorithms (implemented in
extractors) Algorithm reference s
Named entity recognition BasicFrameExtractor [3]
ReviewFrameExtractor [8,9]
SentenceFrameExtractor [10,11,17]
Entity attribute extraction All above FrameExtractors [4]
Relation extraction BinaryRelationExtractor [12,13]
MultiClassRelationExtractor [12–14]
Benchmarking and System Evaluation
We benchmarked our package on three clinical NLP
datasets for named entity recognition (NER), entity
attribute extraction (E A), and relation extraction (RE).
We adopted the 2012 [18] and 2014 [19] Integrating
Biology and the Bedside (i2b2), and 2018 National NLP Clinical Challenges (n2c2) [20] Natural Language
Processing Challenge. All experiments were evaluated
with the Llama -3.1-70B [21] in an 8-shot prompting
setting and conducted with the vLLM [7] inference
engine on a GPU server with 8 NVIDIA A100 GPUs.
Details and source code are discussed on our GitHub
Page 6:
page ( https://github.com/daviden1013/LLM -
IE_Benchmark ).
The i2b2/ n2c2 data user agreement prohibits public
sharing of the text content. Therefore, w e performed a
system evaluation and visualize d the extraction on a synthesized clinical note. The task is to extract drugs,
conditions, and adverse drug events (ADEs) with
corresponding attributes and relations.
Implementation details are available on our GitHub
page (LLM -IE_Benchmark).
RESULTS
Benchmarking For the NER and EA tasks, the Sentence Frame
Extractor achieved the best F1 scores, while
consuming more GPU time. The Review Frame
Extractor had higher recall th an the Basic Frame
Extractor on all NER tasks.
Table 2: Benchmark on the i2b2/ n2c2 datasets for NER, EA, and RE tasks
Tasks Algorithm GPU time
(s)/ Note Benchmarks
Named
Entity
Recognition 2012 Temporal Relations Challenge
EVENT TIMEX
Precision Recall F1 Precision Recall F1
Basic 67.5 0.9406 0.2841 0.4364 0.9595 0.3516 0.5147
Review 84.0 0.8965 0.3995 0.5527 0.9352 0.5473 0.6905
Sentence 132.9 0.9101 0.6824 0.7799 0.8891 0.739 0.8071
2014 De-identification Challenge
Strict Relaxed
Precision Recall F1 Recall Precision F1
Basic 9.4 0.7154 0.4813 0.5755 0.7172 0.4826 0.5769
Review 15.7 0.5649 0.5454 0.555 0.5667 0.5471 0.5567
Sentence 20.7 0.6683 0.7379 0.7014 0.6703 0.7401 0.7035
2018 (Track 2) ADE and Medication Extraction Challenge
Strict Lenient
Precision Recall F1 Recall Precision F1
Basic 44.3 0.7384 0.3534 0.478 0.8537 0.4034 0.5479
Review 63.2 0.7209 0.427 0.5363 0.8416 0.4918 0.6208
Sentence 114.1 0.852 0.6166 0.7154 0.963 0.692 0.8053
Entity
Attribute
Extraction 2012 Temporal Relations Challenge
EVENT TIMEX
Type Polarity Modality Type Value Modifier
Basic 67.5 0.2589 0.2707 0.2737 0.3236 0.2835 0.3198
Review 84.0 0.358 0.3799 0.3828 0.4934 0.4209 0.4857
Sentence 132.9 0.6056 0.642 0.6432 0.678 0.5505 0.667
Relation
Extraction 2018 (Track 2) ADE and Medication Extraction Challenge
Precision Recall F1
Multi-class 213.9 0.3831 0.978 0.5505
System Evaluation
We utilized the LLE -IE package to build an information
extraction pipeline for the drug, condition, and ADE
entities, attributes, and relations. For all the frames
extracted by the Frame extractor, the attribute “Type”
represents the frame type as one of the “Drug”,
“Condition”, or “ADE”. If the Type is “Drug”, “Dosage” and “Frequency” are extracted as additional attributes .
If the Type is “Condition”, an “Assertion” attribute is
assigned. The relations between a “Condition” frame
and a “Drug” frame and bet ween an “ADE” frame and
a “Drug” frame are extracted by the Relation extractor.
We visualized the results with the viz_render method
and displayed them on a browser (Figure 3).
Page 7:
Figure 3: System performance and visualization . The frames are highlighted based on the attribute “Type” as Drug,
Condition, or ADE. For the Drug frames, attributes “Dosage” and “Frequency” are extracted. For the Condition frames,
the attribute “Assertion” is extracted. The relations Condition -Drug and ADE -Drug are visualized as paths. Note that
for publication purposes, only a few entity attributes are displayed in this figure.
DISCUSSION
We developed the LLM-IE Python package for LLM -
based information extraction. The usage (i.e., building
block classes and pipelines) is designed based on our
practical NLP experience in the healthcare industry.
We have been adopting it internally for NLP projects.
Therefore, we believe it is relevant to other NLP
practitioners in the biomedical field. The system
design in which inference engines and extractors are
placed in modules with well -organized inherent
relationships enables continuous development as
new infrastructures and prompting algorithms are
released in the future . Our visualization features
provide an intuitive way to validat e (e.g., error analysis ,
performance evaluation ) outputs with a complex
schema which would be cumbersome otherwise. The benchmark results are reasonable compared to
our recent publication [22]. In some cases, the few -
shot LLM performance was below fully supervised
models, as previously reported [15].
Despite the great features, our LLM-IE package has a
few limitations: 1) it is in an active development phase.
More practical adoption and evaluation are needed. 2)
Like all LLM -based systems, prompt engineering plays
an important role in providing domain knowledge and
task-specific definitions. Despite our Prompt Editor
LLM agent, it is up to the users to finalize the prompt
templates. Some familiarity with prompt writing is still
necessary. 3) The post -processing relies on the LLM to
output in the correct format. Inconsistent elements in
the JSON list are discarded. Thus, it is important to
choose instructed LLMs with good instruction -
following performance. 4) Our benchmarking and
system evaluation used Llama 3.1 to represent the
state-of-the-art open -source LLM at this point.
Further evaluation is needed for other LLMs.
CONCLUSIONS
To fill in the gaps between the latest LLM technology
and biomedical NLP practices, we developed a Python package, LLM-IE, that provides building blocks
for robust information extraction pipeline
construction.
Page 8:
REFERENCES
1 Xu D, Chen W, Peng W, et al. Large Language
Models for Generative Information Extraction: A
Survey. 2024.
2 Brown T, Mann B, Ryder N, et al. Language Models
are Few-Shot Learners. Advances in Neural
Information Processing Systems . 2020;33:1877 –
901.
3 Agrawal M, Hegselmann S, Lang H, et al. Large
Language Models are Few -Shot Clinical
Information Extractors. 2022.
4 Dagdelen J, Dunn A, Lee S, et al. Structured
information extraction from scientific text with
large language models. Nat Commun .
2024;15:1418. doi: 10.1038/s41467 -024-45563-x
5 Ollama. https://ollama.com (accessed 9 August
2024)
6 Gerganov G. ggerganov/llama.cpp. 2024.
7 Efficient Memory Management for Large Language
Model Serving with PagedAttention | Proceedings
of the 29th Symposium on Operating Systems
Principles.
https://dl.acm.org/doi/10.1145/3600006.3613165
(accessed 9 October 2024)
8 Renze M, Guven E. Self -Reflection in LLM Agents:
Effects on Problem -Solving Performance. 2024.
9 Harrington F, Rosenthal E, Swinburne M. Mitigating
Hallucinations in Large Language Models with
Sliding Generation and Self -Checks.
10 Wang X, Zhou W, Zu C, et al. InstructUIE:
Multi-task Instruction Tuning for Unified
Information Extraction. 2023.
11 Xie T, Li Q, Zhang Y, et al. Self-Improving for
Zero-Shot Named Entity Recognition with Large
Language Models. 2024.
12 Jahan I, Laskar MTR, Peng C, et al. Evaluation
of ChatGPT on Biomedical Tasks: A Zero -Shot
Comparison with Fine -Tuned Generative
Transformers. The 22nd Workshop on Biomedical
Natural Language Processing and BioNLP Shared
Tasks. Toronto, Canada: Association for
Computational Linguistics 2023:326 –36. 13 Wadhwa S, Amir S, Wallace B. Revisiting
Relation Extraction in the era of Large Language
Models. In: Rogers A, Boyd -Graber J, Okazaki N,
eds. Proceedings of the 61st Annual Meeting of the
Association for Computational Linguistics (Volume
1: Long Papers) . Toronto, Canada: Association for
Computational Linguistics 2023:15566 –89.
14 Karkera N, Acharya S, Palaniappan SK.
Leveraging pre -trained language models for mining
microbiome -disease relationships. BMC
Bioinformatics . 2023;24:290. doi:
10.1186/s12859 -023-05411-z
15 Hu Y, Chen Q, Du J, et al. Improving large
language models for clinical named entity
recognition via prompt engineering. Journal of the
American Medical Informatics Association .
2024;31:1812 –20. doi: 10.1093/jamia/ocad259
16 Hugging Face Hub documentation.
https://huggingface.co/docs/hub/en/index
(accessed 25 October 2024)
17 Meoni S, De la Clergerie E, Ryffel T. Large
Language Models as Instructors: A Study on
Multilingual Clinical Entity Extraction. The 22nd
Workshop on Biomedical Natural Language
Processing and BioNLP Shared Tasks . Toronto,
Canada: Association for Computational
Linguistics 2023:178 –90.
18 Sun W, Rumshisky A, Uzuner O. Evaluating
temporal relations in clinical text: 2012 i2b2
Challenge. J Am Med Inform Assoc . 2013;20:806 –
13. doi: 10.1136/amiajnl -2013-001628
19 Stubbs A, Kotfila C, Uzuner O. Automated
systems for the de -identification of longitudinal
clinical narratives: Overview of 2014
i2b2/UTHealth shared task Track 1. J Biomed
Inform. 2015;58:S11 –9. doi:
10.1016/j.jbi.2015.06.007
20 Henry S, Buchan K, Filannino M, et al. 2018
n2c2 shared task on adverse drug events and
medication extraction in electronic health records.
J Am Med Inform Assoc . 2019;27:3 –12. doi:
10.1093/jamia/ocz166
21 Dubey A, Jauhri A, Pandey A, et al. The Llama
3 Herd of Models. 2024.
Page 9:
22 Hsu E, Roberts K. Leveraging Large Language
Models for Knowledge -free Weak Supervision in
Clinical Natural Language Processing. 2024.