Detecting Legal Definitions

Cornell Legal Information Institute (LII) is an organization at Cornell whose goal is to help the public know and understand the law. LII has a search engine where users can search for any US Federal and State Law. The organization wanted to see if it was possible to create a system that detects definitions in their corpus so they could add hyperlinks to the definitions in the excerpt the engine returns. As part of a research project, I developed machine learning models to extract legal definitions from a corpus of state regulations. Some of my contribution are:

creating a custom data labeling tool and building a dataset of over 1,000 annotated examples
generating synthetic data using a translation-based model to improve coverage and robustness
experimenting with various architectures—including RNNs, LSTMs, and transformer-based models

I ultimately achieved an F1 score of ~95% and accuracy over 97% in definition extraction tasks. My tech stack was:

Python
Python’s Data Science libraries (Pandas, Numpy, Scikit-learn)
Pytorch
Hugging Face Transformers library