IS THE NEW NIST STANDARD FOR AI LOOKING AT THE WRONG END OF THE HORSE?

July 25, 2021

With all the analytics companies being bought up recently, I made the above comment to Doug Austin in discussing his column a few weeks back on the new NIST standards for developing trust in Artificial Intelligence programs. (https://ediscoverytoday.com/2021/06/29/nist-publishes-new-study-establishes-model-for-trust-and-artificial-intelligence/ )

But maybe the analogy isn’t a good one. A better one may be that this is a case where we DO want to see the sausage being made. I’ve talked about this before ( https://www.digitalwarroom.com/blog/is-ai-the-fight-club-of-legal-technology-and-ediscovery ) but the NIST paper dives into some detailed analysis.

NIST talks about trust as a key element in getting lawyer buy in on AI. True, but first it would be nice if vendors explained to us what they actually mean by AI. Keep in mind that eDiscovery is only one of many uses for AI. The ABA lists all the possible uses in legal as:

e-Discovery
expertise automation
legal research
document management;
contract and litigation document analytics and generation and
predictive analytics.

(see the full article online at: https://www.americanbar.org/groups/professional_responsibility/publications/professional_lawyer/27/1/the-future-law-firms-and-lawyers-the-age-artificial-intelligence/ )

So explaining how “revolutionary” or “groundbreaking” your AI is only helps me if I know specifically how it works in my particular use case. Specifically works.

Think I’m exaggerating the definition problem? Here’s some examples of what vendors say about AI right off their web sites.

AI is the next frontier.

AI is the future of eDiscovery

Our AI uses cutting edge artificial intelligence and machine learning.

Our AI gives extraordinary results.

AI … Believe the Hype

Embrace the groundbreaking magic of artificial intelligence with (name deleted)

Precise predictions in a fraction of the time required for traditional review

Infusing AI across the entire E-Discovery process

(name deleted) artificial intelligence capabilities are built on top of the latest innovations in Deep Learning, Natural Language Processing, compute intensive hardware processing and other related architecture approaches (editors note: “compute” is not a typo)

And my personal favorite

yeah … this is what the future feels like

Perhaps part of the problem is that people aren’t really sure how these programs work. I don’t say that to be critical but to point out how difficult this subject is. Tess Blair of Morgan Lewis does a wonderful presentation on the ethics of AI which has an equally wonderful overview of the history of the subject. In it, she has a slide with a quote from an MIT Technology Review article that says “No one really knows how the most advanced algorithms do what they do. That could be a problem.” Duh!! Can’t put anything over on those Techies.

You can see her entire slide deck at https://www.morganlewis.com/-/media/files/publication/presentation/webinar/2020/mayrathon/tech-mayrathon_ethics-of-ai-for-legal-profession_21may20.pdf . I highly recommend you take a look at the entire deck.

In his article, Doug uses the ATM example. “I trust it because it always gives me money.” But unlike eDiscovery, when I use the ATM, the results are completely predictable. I know I want 40 bucks, I know I have at least that much in my account, and I know that the if the ATM is working it will give me 40 bucks.

My trust in the ATM is well established because it is a simple system based on how much I have in my bank account. I know what the result is going to be, and I also know if I have a problem, I can call the bank and resolve it right away.

Perhaps that describes how lawyers work in eDiscovery more than we want to admit, getting what you already know is there, but is it really a test for trusting AI? If I know the expected result, do I need AI? Wouldn’t simple searches help me find what I already know is the right answer? Sort of like the “Check your balance” key on the ATM. OK, yup I have 60 bucks. Give me 40. Thanks.

Many years ago, a colleague in the Federal Defenders office was being given a demo of a new review tool. After he saw all the searches and filters and reports he could generate, the rep said to him “So what do you think?’

He paused for a minute and said, “I think I need a button on the main screen that let’s me search for my clients name.” He knew exactly what he needed from the documents when they first came in and wanted a simple way to get the result. Same thing.

The other analogy people like to use for AI is the comparison to on-line music search engines. I hate this one. Why? Because the music in these programs is specifically curated to give you the information you want.

“Curated”. As in already searched, tagged, indexed and ready for you to scroll through. When Pandora got in the streaming business, they used what they called the “Music Genome Project,” a review of 700,000 songs, by 80,000 artists. The idea was to have musicians listen to and decode a song’s “DNA” and then categorize it according to different musical qualities such as meter and tonality. Numbers are assigned to the over 450 different attributes, and when you search, Pandora’s algorithm goes to work, finding other songs that match these same qualities.

Hardly describes the average eDiscovery project, right? Sure, it would be nice if that 2TB of data was already categorized and indexed when you start your search, but it isn’t.

Furthermore, what if they get it wrong? What if they label Jimmy Buffett as country not rock? For that matter, what is rock? Is it Kiss, the Stones, AC DC, Metallica, Gnarls Barkley, Tom Waitts? What about the Beatles? How does Yesterday stack up against Yer Blues or I Wanna Hold Your Hand against Golden Slumbers/Carry That Weight/The End? And what in the wide wide world of sports is Revolution #9? Someone run that one by Yoko.

Now Spotify uses a different approach, which is music liked by other users or from the data received from millions of music blogs which they call collaborative filtering. For new music, Spotify analyzes the audio itself, training their algorithm to learn to recognize different characters to music, such as harmony or distorted guitars.

Spotify recommendations start on the home screen with an AI system called Bandits for Recommendations as Treatments or BaRT, which is the alternative to playlists. Like Pandora, it will allow you to add songs you find to your “channel”, which is where they both are similar to continuous active learning.

I found Ry Cooder in the artist list, saw one of my favorite songs of his, Ditty Wah Ditty and created a “radio station.” Spotify then showed my 50 songs, 7 by Ry and 43 others including Ooh La La by the Faces, Yes We Can Can by Allen Toussaint, Who Do You Love by Bo Diddley, Big Chief by Prof Longhair, Green Onions by Booker T and the MGs, Moondance by Van and Little Red Rooster by Howlin Wolf. Huh?

Then I did a specific title search for “Happy”. Lots of results. Songs, albums, artists. Lots. But no Keith on Exile on Main Street.

Hmmmm. I’m thinking maybe I’m not on the same page with a lot of those 248 million users.

So, to return to the opening comment. Aren’t we looking at the wrong end of the sausage making process if we only look at and rate results? If I go to the store and buy sausage, I don’t randomly buy 5 or 6 or even 10, take them home, throw them on the grill with some ribs and corn and then taste them all to decide which one I like.

First, I decide if I want creole sausage, hot links, beer brats, andouille (yummy) Jimmy Dean Pork Sausage, bratwurst, knockwurst, etc. Then I find the type I am looking for and look at the labels to check ingredients. Finally, I decide on a price point for the ones I am interested in. THEN I test the 2 or 3 that are left.

Shouldn’t we do the same with AI? Because let’s remember here folks that no two algorithms bring the same results. So, no two AI processes bring the same results or documents to the front of the line. Saying I have reliable results from ACME AI means well, I have reliable results from ACME AI. It tells me nothing about the results from RoadRunner AI.

In fact, it’s almost a guarantee that the results will be different. I’m not saying the accuracy or reliability will be different, I’m saying the actual results, the documents found, will be different. So how can I compare reliability scores if what they are saying is reliable is different from program to program?

How do I know the results will not be the same? Because a law librarian told me so. Growing up in Vermont, I remember many a long walk up and down the snow-covered hill where we lived to the old Carnegie funded library, a massive brownstone building. One of many such libraries I visited growing up in New England.

I came to believe in the giddy days of my yute that librarians were the font of all wisdom. And as Gayle never tired of reminding me, “when you absolutely, positively need the answer overnight, ask a librarian.” So when I saw an article by Bob Ambrogi about a study by law librarians showing that different legal research platforms deliver surprisingly different results, I paid attention.

Bob reported on a draft research paper entitled, The Algorithm as a Human Artifact: Implications for Legal {Re} Search. Some of the findings were presented by the author, Susan Nevelow Mart, director of the law library and associate professor at the University of Colorado Law School, in a program at the annual meeting of the American Association of Law Libraries. In 2017.

https://abovethelaw.com/2017/07/legal-research-services-vary-widely-in-results-study-finds/?rf=1

Bob noted that:

“Mart’s exploration of the differences among research services was spurred in part by an email she received from Mike Dahn, senior vice president for Westlaw product management at Thomson Reuters, in which he noted that “all of our algorithms are created by humans.” Why is that statement significant? Because if search algorithms are built by humans, then those humans made choices about how the algorithm would work. And those choices, Mart says, become the biases and assumptions that get coded into each system and that have implications for the results they deliver.”

“Significant” is the right word. There was hardly any overlap in the cases that appear in the top 10 results returned by each database as only about 7 percent of the cases were returned in search results in all six databases. Likewise, relevancy differed widely in the search engines with the percentages of relevant cases delivered ranging from a high of 67% to a low of 40%.

And given that figure, how many of those unique cases were relevant? The best figure was one vendor that returned 33.1 percent of its cases both unique and relevant, while the lowest was just 8.2 percent of unique cases also found to be relevant.

Mart’s conclusion? “Legal research has always been an endeavor that required redundancy in searching; one resource does not usually provide a full answer, just as one search will not provide every necessary result. This study clearly demonstrates that the need for redundancy in searches and resources has not faded with the rise of the algorithm.”

The answer, Mart believes, is that legal research companies need to be much more transparent about the biases in their algorithms. They need to employ what she called “algorithmic accountability”. I’d suggest the same is true for AI.

Given that caution, here’s another recent article on a way to look at validating AI by tying it first to the values of the underlying business model. It’s called Defining “Value” – the Key to AI Success and you can find it at https://www.datasciencecentral.com/profiles/blogs/defining-value-the-key-to-ai-success. Of course it’s from a librarian. In this case, Jeff Brandt, the CIO at Jackson Kelly and editor of Pinhawk Law Technology Daily Digest.

Now the article does define AI success in a manner that may not fit the eDiscovery workflow but the point is that it asks the user to think like a data scientist. The value proposition is that “… If we want AI to work for us humans (versus us humans working for AI), then we must thoroughly define “value” before we start building our AI ML models.”

Now unfortunately, thinking like a data scientist means using this chart.

But that’s OK, because to help the process, the NIST decided to coin an acronym for AI user trust characteristics in their model called Perceived System Trustworthiness. That’s right. PST, a phrase we already use in legal data management.

And if that’s not confusing enough, then consider the following formula they devised for determining (their) PST:

Well sure. I mean who doesn’t know that?

OK, maybe thinking like a data scientist isn’t the way to go. Maybe we should start by thinking like people who don’t speak math as their first language. But the point is we do need to examine how we look at AI and I’d suggest that the best place to start is at the front end.

This isn’t Star Trek nor should we just “… believe the hype.” We have an ethical duty to truly understand this technology in order to be able to explain it to our clients and the Court, when required. I once wrote that AI should really stand for Attorney Intelligence ( https://www.discoveredt.com/blog/ai-should-stand-for-attorney-intelligence )

Bob Ambrogi may have put it best in the article I mentioned at the top, “We can push companies to be more transparent about their algorithms, but in the meantime, we should remember that time-worn piece of advice: consider the source.”