Chart the use of a word or expression by the Supreme Court of Canada over time:
This website allows you to examine how words have fallen in and out of favour with the Supreme Court of Canada since its establishment in 1877.
For this project, I relied on the comprehensive dataset of Supreme Court of Canada judgements compiled by Professor Sean Rehaag. This open-access dataset is the first comprehensive collation of the Court’s appellate decisions that is made available to allow scholarly inquiry using machine learning techniques.
It is a collection of approximately 15,500 decisions (1877-2022). For the purposes of this small project, I focus on English language judgements. That said, before the 1970s some of the court’s judgements were sometimes written in both official languages. For an example of how bilingual judgements used to look, see Roncarelli v. Duplessis,  SCR 121.
The dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). For reference, here is the full citation: Sean Rehaag, “Supreme Court of Canada Bulk Decisions Dataset” (2023), online: Refugee Law Laboratory (https://refugeelab.ca/bulk-data/scc/).
To process the data I removed all numbers, and grammatical features from the words in the decisions. I decided not lemmatize the words (reduce them to their root), which means that variations of the same word (“right" and “rights") are treated as separate words. After that, I used the open source Python library scikit learn to tabulate the number of words and expressions.
This method is not perfect. Typos, optical character recognition errors, etc. will all affect the results. Look for trends, don't accept these numbers as gospel.
I used two different methods to represent the changing word use trends: count vectorization and tf-idf vectorization. The count vectorizer data is pretty straightforward – it is just a count the number of words per year. It's useful (especially to watch for sudden jumps in word usage) but it had a limitation because the number of words the Court issued went up a lot in the twentieth century. Put differently, it can be misleading to look at a graph of word usage because the Court did not issue the same number of words each year. A rising tide lifts all boats and, for some words, an increase over time might not be a feature of more usage, just a feature of the fact that the Court is issuing more words overall.
That's where the tf-idf vectorizer came in. I grouped all the decisions in a decade together and compared them. It gave me a score per word per decade, which was a more relative measure and helped balance out the fact that the number of words issued by the Court had increased over time.
You can also query the database for two word long expressions (e.g. reasonable doubt). I used the same count vectorizer to search out two word long expressions that occured in at least ten different years. An expression is defined as a two-word sequence that appears in the Court's jurisprudence in at least ten different years.
Please feel free to get in touch with questions. I can be reached at simonwallace [at] osgoode.yorku.ca.