Detecting Legal Definitions

Cornell Legal Information Institute (LII) is an organization at Cornell whose goal is to help the public know and understand the law. LII has a search engine where users can search for any US Federal and State Law. The organization wanted to see if it was possible to create a system that detects definitions in their corpus so they could add hyperlinks to the definitions in the excerpt the engine returns. As part of a research project, I developed machine learning models to extract legal definitions from a corpus of state regulations. Some of my contribution are:

  • creating a custom data labeling tool and building a dataset of over 1,000 annotated examples

  • generating synthetic data using a translation-based model to improve coverage and robustness

  • experimenting with various architectures—including RNNs, LSTMs, and transformer-based models

I ultimately achieved an F1 score of ~95% and accuracy over 97% in definition extraction tasks. My tech stack was:

  • Python

  • Python’s Data Science libraries (Pandas, Numpy, Scikit-learn)

  • Pytorch

  • Hugging Face Transformers library