Before you can even think about building an algorithm to read an X-ray or interpret a blood smear, the machine needs to know what’s on an image. All the promise of AI in healthcare – an area that has attracted $11.3 billion in private investment by 2021 – cannot be realized without carefully labeled datasets that tell machines exactly what they are looking for.
Creating those tagged datasets is becoming an industry itself, with companies far north of unicorn status. Today, Encord, a small startup just outside of Y Combinator, wants to be part of the action. With the goal of generating labeled data sets for computer vision projects, Encord launched its own beta version of an AI-assisted labeling program called CordVision. The launch follows pilot programs at Stanford Medicine, Memorial Sloan Kettering and Kings College London. It has also been tested by Kheiron Medical and Viz AI.
Encord has developed a set of tools that allow radiologists to zoom in on DICOM images, a format universally used to transmit medical images. And instead of a radiologist sitting down and annotating an entire image, the software is designed to ensure that only the most important parts of the image are labeled.
Encord was founded in 2020 by Eric Landau, with a background in applied physics, and Ulrik Stig Hansen. Hansen worked on a master’s thesis project at Imperial College London, which focused on the visualization of large medical image data sets. It was Hansen who initially noted how time-consuming it was to put together labeled data sets.
Those labeled data sets are important because they provide “ground truths” from which algorithms can learn. There are some ways to build AI that don’t require labeled datasets, but for the most part AI (especially in healthcare) has relied on supervised learning, which requires them.
To create a labeled data set, more than one doctor will literally go through the images one at a time, drawing polygons around relevant features. Other times it can be done with open source tools or sensors. But either way, the scientific literature suggests that this move is a major bottleneck in the AI world in healthcare, especially when it comes to radiology, an area where AI is expected to make great strides, but it has largely failed to deliver. major paradigm shifts. †
“I know there is a lot of skepticism [of AI in the medical world]† We think progress is very slow,” Landau told MovieUpdates. “We think moving to an approach where you really think about the training data will help accelerate the progress of these models.”
As the authors of a 2021 paper in Frontiers in Radiology note, it takes human labelers as much as 24 years of work to label a data set of about 100,000 images. Another 2021 position issued by the European Association of Nuclear Medicine (EANM) and the European Association of Cardiovascular Imaging (EACVI) notes that “obtaining labeled data in medical image analysis can be time-consuming and expensive.” But it also points out that new techniques are emerging that could speed things up.
Ironically, those new techniques are themselves versions of artificial intelligence. For example, that Frontiers article in Radiology 2021 showed that an active learning approach could speed up the process by 87%. It would only take 3.2 work years, instead of 24 years, to go back to the 100,000-image example.
CordVision is basically a version of an active learning process called micromodelling. That technique roughly works by having a team label a small, representative sample of the images. Then a specific AI is trained on those images and then applied to the wider pool, which labels the AI. Then human reviewers can check the work of the AI instead of doing the labeling from scratch.
Landu explains it well in a blog post on his Medium page: Imagine creating an algorithm designed to detect The Batman in Batman movies. Your micromodel would be trained on five images depicting Christian Bale’s Batman. Another can be trained to recognize Ben Affleck’s Batman, and so on. All together, you build the bigger algorithm with each small part, then free it up on the series as a whole.
“That’s something that we found works pretty well because you could get away with very, very few annotations and kick-start the process,” he said.
Encord has published data to back up Landau’s claims. For example, a study conducted in collaboration with Kings College London compared CordVision to a labeling program developed by Intel. Five labelers treated 25,744 video frames for endoscopy. The gastroenterologists using CordVision moved 6.4 times faster.
The method was also effective when applied to a test set of 15,521 COVID-19 X-rays. Humans rated only 5% of the total number of images and the final accuracy of an AI labeling model was 93.7%.
That said, Enord is far from the only company to have identified this bottleneck and have attempted to use AI to facilitate the labeling process. Existing companies in this space are already reporting large valuations. For example, Scale AI reached a valuation of $7.3 billion in 2021 and Snorkel has reached unicorn status.
The company’s biggest competitor, as Landau admits, is probably Labelbox. Labelbox showed off about 50 customers when MovieUpdates covered them at the Series A stage. In January, the company closed a $110 million Series D, putting it within the $1 billion mark.
CordVision is still a very small fish. But it has been caught in a tidal wave of data labels. Landau says the company is looking for places that still use open-source or in-house tools to create their own data labels.
To date, the company has raised $17.1 in seed and Series A funding since graduating from Y Combinator. The company has grown from two founders to a team of 20 people. Encord, Landau says, isn’t out of money. The company is not currently looking for fundraising and believes that the current raises will be enough to get this tool through the commercialization process.