Tuesday, September 29, 2020

Pittsburgh’s child welfare predictive analytics model has a failure rate of up to 99.8% – according to their own study. (So why are they calling this success?)

Pittsbugh's predictive analytics algorithm slaps a "scarlet number" risk score
on every child 
who is the subject of a report alleging neglect. 
And they're trying to do it to every child at birth.

Yesterday’s post to this blog noted that in Australia, an algorithm that wreaked havoc in the lives of poor people receiving public assistance was ruled illegal after it had a failure rate of 20 percent.  But apparently, in America, if you just whisper the words “child abuse” in everybody’s ears, even a failure rate of up to 99.8 percent is o.k.

 

          Here’s the thing about those predictive analytics algorithms that supposedly can predict who is going to abuse a child: They tend to fail spectacularly.

           In Los Angeles, the county decided not to roll out a predictive analytics model known as AURA after preliminary tests showed that 95 percent of the time when the algorithm predicted that a parent or guardian would do something terrible to a child – they didn’t.

           In Illinois, a program spreading across the country, Rapid Safety Feedback, managed to sound alarms on thousands of innocent families – and miss children in real danger.

           But what about Allegheny County, Pa. (metropolitan Pittsburgh)?  That’s the program that was supposed to be different from all the others.  That’s the one reporters all over the country have swooned over.  That’s the one where the algorithm is supposedly transparent (it isn’t). That’s the one certified ethical (but one of the ethics reviewers co-authored papers with one of the program’s designers).  Turns out the Allegheny Family Screening Tool doesn’t get it wrong 95 percent of time.  AFST gets it wrong up to 99.8 percent of the time.  That’s according to a study co-authored by the people who created AFST in the first place.

           And yet, they’re touting this as a success. 

 How AFST works

           AFST is used to screen reports of neglect and decide whether to send out a caseworker to investigate. (In Pennsylvania, any abuse report sent on from the state’s child abuse hotline must be investigated.) 

          AFST then consults a vast treasure trove of data gathered disproportionately on poor families without their informed consent. It then coughs up a risk score between 1 and 20 - a "scarlet number" that can wind up haunting a child for life.   And they're trying to do it to every child at birth.

          


Twenty is a risk level so high it literally flashes red on the screens of those responsible for deciding whether to send out an investigator - and the humans are strongly discouraged from overriding the algorithm.

           So of course if the risk score is 20, the investigators must be finding lots and lots of horrific child abuse, right?

Study findings

           Well, let’s see what the study found.

           The Pittsburgh study used a methodology similar to the study of AURA in Los Angeles.  They applied the AFST algorithm to past cases.  Then they looked at records from Children’s Hospital of Pittsburgh, a division of the University of Pittsburgh Medical Center (UPMC), to see how many of the children identified by AFST as high risk or low risk turned up at the hospital with an injury the hospital deemed to be child abuse.

           For low-risk cases  2/100ths of one percent eventually turned up with such an injury.  But for cases that got a risk score of 20, fully  two tenths of one percent – yes, two tenths of one percent! -- actually turned up at the hospital.  

            In other words, the odds of impurity in a bar of Ivory Soap are greater than the odds of a child labeled extremely high risk by AFST turning up at the Children's Hospital ER with an abuse-related injury.

           The conclusion of the researchers (most of whom also designed AFST) boils down to: See how great our algorithm is?  Those with a higher risk score were more likely to suffer an injury.  Indeed, the New Zealand “institute” run by one of the designers of the study put out a statement declaring:

 New research co-authored by Rhema Vaithianathan [the co-designer of AFST] and Diana Benavides-Prado confirms that children identified as at risk by the Allegheny Family Screening Tool, … are also at considerably heightened risk of injury, abuse and self-harm hospitalisation.

           Nowhere does the statement explain that “considerably higher” means two-tenths-of-one percent.

           


So what about the other 99.8 percent?  Unless a human overrides the algorithm, the worker must go to the door, demand entry, search the entire home, poking into cabinets and cupboards and refrigerators (which, at the moment also means increasing the risk of spreading or contracting COVID-19).  They must interrogate every member of the family, often an enormously traumatic experience for a child.  And they may well stripsearch the children. 

          Roughly 99.8 percent of the time, it will be for nothing.  In addition to inflicting all that needless trauma, workers wasted time that could have been used to find the very few children in real danger. 

Disingenuous definitions 

          No doubt AFST proponents will argue that the 99.8 percent figure applies only to the most serious injuries – injuries that involved “hospitalization.”  The statement touting the results repeatedly uses that term, with no further explanation.

           But take a close look at how the study defines hospitalization.  The authors use the term interchangeably with “medical encounters.” While that can mean hospitalization, as in, it was so serious the child had to be admitted to the hospital, it also includes simple visits to an emergency room – which, for many poor people, is the family doctor. 

            It could also be argued that some injured children might show up at other ER's.  But most of them, especially those whose injury is serious or where there's a suspicion of abuse, are likely to wind up at the hospital where the study was done. That's because, according to the authors, "UPMC Children's Hospital is the sole provider of secondary [meaning specialized] care for children in the Allegheny County area."

           Of course, those with a risk score of 20 might also be more likely to be “neglected” – at least in the minds of a caseworker, which would make the true positive rate higher.  But neglect often simply means the family is poor.  We don’t need an algorithm or the child abuse police to find where the poor people are and help them, preferably with money.

            But even if you give AFST so much benefit of the doubt that we assume it's really five times more accurate than this study revealed, that still would mean AFST is wrong 99 percent of the time.

           None of this stopped the AFST co-designer/study-co-author from declaring that their study “proves” AFST can detect actual child abuse, not just system involvement. But even that isn’t clear.  The study defines child abuse as when someone at the hospital said it was child abuse.  When it comes to making these judgments, hospitals in general and UPMC in particular  haven’t always acted wisely.

        


 
No amount of tweaking the algorithm is going to improve this track record – and, in one sense, that’s good news.  What the study really shows, once again, is that “child abuse” of the sort that comes to mind when we hear those words actually is extremely rare.  Think of it: Of those children AFST rated at the very highest risk, up to 99.8 percent did not experience a child abuse injury that required either hospitalization or an ER visit.

           There is no way to refine an algorithm to find these very few needles in a huge haystack without sweeping into the net vast numbers of innocent families.

           That may be why even the editors of the journal that published the study, JAMA Pediatrics, appear skeptical.  In an editorial, they write:

Much harm can be done under the umbrella of good intentions, because big data is a big weapon. …  the concerns about the accuracy of the algorithm deployed should be of paramount importance, since the thread of historic biases in large data sets has become increasingly apparent. … The child abuse literature reports that both the evaluation of suspected abuse and subsequent diagnoses can contain racial biases. …

         The editors suggest a better approach they call 

the policy principle of proportionate universalism—broadly providing services or resources without targeting specific families or people. The target of strategies to decrease rates of child maltreatment would be better directed to community-based strategies that support children and families facing adversities and living in poverty.

           That sounds like a fancy way of saying: Find the poor people; send money.