DEPARTMENT OF HOMELAND SECURITY (DHS) SMALL BUSINESS INNOVATION RESEARCH (SBIR) PROGRAM FY24

Active
No
Status
Closed
Release Date
November 8th, 2023
Open Date
December 18th, 2023
Due Date(s)
January 18th, 2024
Close Date
January 18th, 2024
Topic No.
DHS241-002

Topic

Data Labeling and Curation at Scale (DLCS) for Machine Learning Algorithms

Agency

Department of Homeland SecurityScience and Technology Directorate

Program

Type: SBIRPhase: Phase IYear: 2024

Summary

The Department of Homeland Security (DHS) is seeking proposals for the topic of "Data Labeling and Curation at Scale (DLCS) for Machine Learning Algorithms" as part of their Small Business Innovation Research (SBIR) program. The DHS Science & Technology Directorate (S&T) generates large volumes of data that are valuable for developing next-generation detection algorithms. Currently, data collection and annotation are time-consuming and labor-intensive processes. DHS is looking for innovative techniques to accelerate and improve data collection, labeling, storing, and distribution processes. The focus should be on novel data ingestion, labeling, and curation techniques, with the ability to process various file formats and generate ground truth data. The solution should also support Government-approved cybersecurity standards. The project duration is not specified, but it is expected to be scalable for long-term use by DHS. The application due date for this Phase I solicitation is January 18, 2024. For more information, visit the SBIR topic link or the solicitation agency URL.

Description

The DHS Science & Technology Directorate (S&T) laboratories and DHS component operational partners generate large volumes (up to 10,000 or more measurements per day) of data from test events, prototype demonstrations, or targeted stream of commerce (SoC) data collections. These data are incredibly valuable to DHS and our R&D partners to support the development of next-generation detection algorithms, like those used at airports for on-person and accessible property screening in order to detect explosives and prohibited items. Currently, any data that is collected must be hand-annotated and stored on physical hard disks. This process is extremely time and labor intensive, while limiting DHS's ability to develop curated data sets and share data with R&D partners. R&D partners must also accept the data in the formats and labels that were hand created, as DHS does not currently have the capability to rapidly re-annotate or reformat existing data sets.

DHS is seeking innovative techniques to accelerate and bring additional flexibility to DHS's data collection, labeling, storing, and distribution processes. The current state of the art relies heavily on human labeling and knowing desired metadata and curation schemes a priori. Successful solutions will limit the amount of human intervention required to perform these tasks, instead relying on automatic software to process most routine activities. It is assumed that the provided solution may include certain commercial-off-the-shelf (COTS) modules, but the focus of the research should be on novel data ingestion, labeling, and curation techniques. COTS modules included should support Government approved cybersecurity standards such as FedRAMP approval and/or compliance with FIPS 104-3 specifications.

Capabilities of particular interest include the ability to ingest interesting file formats such as Hierarchical Data Formats, Digital Imaging and Communications in Security (DICOS) (an adaption of Digital Imaging and Communications in Medicine), and other defined but unusual data types, and then processing the data to assess complexity, identify common features/defined labels, and generate ground truth data for these files. Areas of uncertainty may be flagged for human review at a future time (at which point the human-generated ground truth may be analyzed to enhance the automated tools). Once the data is stored, it should be able to be easily curated, reprocessed (e.g. change file formats or ground truth formats), and distributed as packaged data sets. A successful solution should be able to be scaled significantly to support long-term use by DHS.