Session Details
Plenary and Panel Sessions
Plenary I
Forum Introduction and Welcome
Date: Tuesday November 19th
Time: 8:30am-9:45am CST
Location: Kyle Field at Hall of Champions
Moderator: Daniel Anderson
Welcome remarks, forum overview, audience discussion.
Speakers:
- TAMU AgriLife Welcome: Dr. Jeff Savell, Vice Chancellor and Dean for Agriculture and Life
- REE Welcome: Dr. Chavonda Jacobs-Young, USDA Chief Scientist and Under Secretary for Research, Education, and Economics
- OCS Welcome: Dr. Deirdra Chester, Director
- APHIS Welcome: Dr. Michael Watson, Administrator
- NASS Welcome: Mr. Joseph Prusacki, Associate Administrator
- NIFA Welcome: Dr. Manjit Misra, Director
- ERS Welcome: Dr. Spiro Stefanou, Administrator
- ARS Welcome: Dr. Simon Liu, Administrator
- Forum Organizing Committee Chair: Dr. Marlen Eve
Plenary II
Moderated Panel to explore AI that drives innovation, challenges researchers to apply AI tools to their fields to speed innovation and increase efficiency
Date: Tuesday November 19th
Time: 10:00am-12:00pm CST
Location: Kyle Field at Hall of Champions
Moderator: Dr. Paul Bunje, Co-Founder and COO/CSO of Conservation X Labs
Panelists:
- Dr. Lauren Charles, Chief Data Scientist, leader of Applied Artificial Intelligence Systems group at DOE, PNNL
- Dr. Ali Fares, 1890 Land Grant Institutions, Endowed Professor Prairie View A&M University
- Dr. Gregory Hager, NSF lead for the Directorate for Computer and Information Science and Engineering
- Dr. Jason Holmberg, Co-Founder and Executive Director of Wild Me
- Dr. Hannah Kerner, Arizona State University, and NASA AI/ML Lead for NASA Harvest and NASA Acres
- Dr. Holger Klink, Director of the Cornell Univ K. Lisa Yang Center for Conservation Bioacoustics
Plenary III
Moderated Panel: Practical Applications of AI Across the USDA Research Portfolio
Date: Wednesday November 20th
Time: 8:00am-9:10am
Location: Kyle Field at Hall of Champions
Moderator: Dr. Paul Bunje, Co-Founder and COO/CSO of Conservation X Labs
Panelists:
- Hogland, John, Research Forester, USFS
- Liu, George, Research Biologist, ARS
- Lu, Renfu, Research Agricultural Engineer, ARS
- Sartore, Luca, Research Associate, NASS
- Woodward-Greene, Jennifer, Branch Chief of Indexing and Informatics, NAL
Plenary IV
Moderated Panel: Agency Leadership Moderated Panel on Policy and Practice, Setting and Communicating Boundaries for Appropriate Use of AI
Date: Wednesday November 20th
Time: 9:20am-10:10am
Location: Kyle Field at Hall of Champions
Moderator: Daniel Anderson
Panelists:
- Dr. Dierdra Chester, Director, OCS
- Dr. Fredy Diaz, Deputy CDO, USDA Enterprise Data Management Center
- Dr. Venu Kalavacharla, Deputy Director IFPS, NIFA
- Dr. Simon Liu, Administrator, ARS
- Dr. Spiro Stefanou, Administrator, ERS
AI Opportunities at Federal Agencies
Session Date: Wednesday November 20th
Session Time: 3:00pm-4:00pm
Session Location: Ross
Session Moderator: Daniel Anderson
Panelists:
- Dr. Fredy Diaz, OCIO, Deputy CDO
- Dr. Kal Kalavacharla, NIFA, Deputy Director IFPS
- Dr. Simon Liu, Administrator, ARS
- Mr. Joe Parsons, Administrator, NASS
- Dr. Spiro Stefanou, Administrator, ERS
Plenary V Scientific Closing
Date: Wednesday November 20th
Time: 4:30pm-5:30pm
Location: Kyle Field at Hall of Champions
- Facilitator Summary with Daniel Anderson
- Review of Trainings
- Closing Remarks with Forum Co-Organizers Dr. Marlen Eve and Dr. Kal Kalavacharla
- Closing remarks from TAMU: Dr. Amir Ibrahim, Associate Director & Chief Scientific Officer Texas A&M AgriLife Research Small Grains Program
Oral Presentations
Applied Tools
Session Date: Wednesday November 20th
Session Time: 10:30am-12:00pm
Session Location: Ross
Session Moderator: Daniel Anderson
1. (10:30-11:00) - Raster Tools: Using Predictive AI To Create Useful Information
Hogland, John, USFS
Jesse Johnson, Fredrick Bunt
Big Data streams and predictive AI are fundamentally changing the way resource management decisions can be made. The use of remotely sensed data, ever expanding computer technology, and enhanced processing techniques can provide natural resource managers with depictions of ecosystems at unprecedented spatial and temporal resolutions. While these sources of information are somewhat being leveraged to inform decision making, the sheer amount of data currently being collected has outpaced our abilities to efficiently manipulate and use those data to their fullest, for decision making. Newer tools, algorithms, and processing approaches are needed to realize the potential of predictive AI coupled with the volume, variety, and velocity of big data streams for natural resources. Important questions related to data scale, relevance, transformation, as well as the types of tools needed to efficiently extract useful information for decision making are at the forefront data science and natural resource management. To that end we have developed a python based geospatial processing library called Raster Tools that automates delayed reading and parallel processing while seamlessly integrating popular machine learning libraries and predictive modeling techniques through python’s software ecosystem. To demonstrate the utility of our processing paradigm we highlight two case studies that use Raster Tools to perform spatial, statistical, and predictive AI. Within a natural resource setting, representative training data can be expensive to collect. Our first case study uses Raster Tools, Landsat 8 remotely sensed imagery, and USGS’s national elevation dataset to address this issue by creating a well spread and balanced sample. Our spread and balance technique encodes multidimensional predictor variable space into one dimension using pseudo-Hilbert Space Fitting curve distances, orders those distances, and systematically select sample observations from the ordered distances. To illustrate these concepts, we present a Jupyter notebook on Google’s Colab. Our second case study uses Raster Tools, and the sample collected in our first case study to build an ensemble of K-nearest neighbor (EKNN) models. We then further demonstrate how to use our EKNN with Rater Tools to estimate percent forest cover and standard error for every 30 m2 in the area around Custer Gallatin National Forest located in Montana. Like our first case study, we demonstrate our processing approach and further evaluate our outputs using Colab. At the forefront of data driven decision making is the development of spatial, statistical, and machine learning techniques that fully leverage existing hardware and adopt newer processing strategies to integrate big data sources seamlessly and easily with the decision-making process. Packages such as Raster Tools facilitate this integration while also providing functionality that can be used to further our understanding of natural resources while simultaneously providing the computational framework to optimize and justify management decisions at both scale and extent. While these tools facilitate Big Data analytics, they also necessitate a broader understanding of the role of spatial data and analyses within decision making. Moreover, they highlight the need for easy access to and integration with various open source and proprietary software systems. Attendees may access two companion Jupyter Notebooks that are readily available online via github and that run in Colab.
- KNN Sample design - https://github.com/jshogland/SpatialModelingTutorials/blob/main/Notebooks/knn_sample_design.ipynb
- EKNN Modeling - https://github.com/jshogland/SpatialModelingTutorials/blob/main/Notebooks/knn_create_model.ipynb
2. (11:00-11:30) - CropWizard and the Potential for Generative AI in Agriculture
Adve, Vikram S., AIFARMS National AI Institute, University of Illinois
AIFARMS Team
I will start with a very brief overview of the research in the AIFARMS national AI institute for Agriculture, funded by USDA NIFA. In the rest of the talk, I will present the CropWizard project in AIFARMS, which is exploring ways in which generative AI can be used for a wide range of agricultural tasks. Today, the CropWizard system supports interactive question answering and research for agricultural professionals, by consulting a large database of over 400,000 technical documents, including Extension publications and open-access research documents. CropWizard can also answer questions about images and invoke computational tools for quantitative questions; these capabilities are being improved and enhanced for accuracy and scope. Several external users and companies are exploring uses for CropWizard in production settings. Ongoing research is exploring ways in which generative AI can be used for more advanced reasoning, planning, and data discovery, in order to support advanced quantitative decision making for complex, open-ended questions in modern agriculture. More information about these topics is available at:
3. (11:30-12:00) - An AI-driven decision support tool for real-time Integrated Pest Management in agriculture
Singh, Arti, Iowa State University
Soumik Sarkar, Baskar Ganapathysubramanian, Muhammad Arbab Arshad, Hossein Zaremehrjerdi, Timilehin Ayanlade, Shivani Chiranjeevi, Lucas Nerone Rillo, Venkata Naresh Boddepalli, Yanben Shen, Talukder Jubery, Asheesh K Singh, Adarsh Krishnamurth.
We present an end-to-end artificial intelligence-driven decision support tool that revolutionizes Integrated Pest Management (IPM) by combining state-of-the-art computer vision models with an intelligent conversational agent. Our InsectID model, trained on 16 million images across 4,000 insect species, achieves robust identification capabilities with 97.2% accuracy under field conditions. The companion WeedID system, leveraging 15 million training images spanning 1,581 weed species, demonstrates 96.8% accuracy in diverse agricultural settings. These deep learning models incorporate uncertainty quantification and out-of-distribution detection to ensure reliable real-world performance. The PestIDBot decision support tool integrates InsectID and WeedID applications to our specialized AgLLM trained on comprehensive IPM literature and expert knowledge bases. PestIDBot provides context-aware responses to farmers’ queries about pest management strategies, chemical interventions, biological control options, and economic thresholds. The platform delivers real-time insights through a unified mobile interface, enabling farmers to make rapid, informed decisions about pest management interventions. Our solution bridges the gap between advanced AI models and practical agricultural applications, demonstrating how integrated technological solutions can enhance sustainable pest management practices while remaining accessible to end-users.
Collaboration and Education
Session Date: Wednesday, November 20th
Session Time: 1:00pm-2:20pm
Session Location: Corps
Session Moderator: Alex Tarter
1. (1:00-1:20) - Smart Demo Farm Sites and Testbeds for AI enabled Technology Education
Khot, Lav, Washington State University
Bernardita Sallato, Markus Keller, R. Troy Peters, Manoj Karkee
This presentation will cover our team’s efforts tied to Washington (WA) Tree Fruit Research Commission, WA Wine Commission, and the USDA NIFA funded AgAID Institute in regard to establishing Smart Demo Farm Sites. These sites (e.g., Smart Apple Orchard, Smart Vineyard) are established for not only collecting user-inspired cases specific data generation to develop AI models but also serve as testbeds for testing, evaluating, & validate emerging smart agricultural technologies through synergies public-private partnerships. Presentation will also cover how these testbeds help disseminate knowledge in K-12 students, teachers and undergraduate internship trainings as well as grower education through on-site field days, workshops, etc. to help realize meaningful adoption of the relevant smart agriculture technologies.
2. (1:20-1:40) - Assessing the Performance of Generative AI in Retrieving Information against Manually-Curated Genetic and Genomic Data
Sen, Taner, ARS
Elly Poretsky, Victoria Blake, Carson Andorf
Curated resources at centralized repositories provide high-value service to users by enhancing data veracity. Curation, however, comes with a cost, as it requires dedicated time and effort from personnel with deep domain knowledge. In this paper, we investigate the performance of a Large Language Model (LLM), ChatGPT, in extracting and presenting data against a human curator. In order to accomplish this task, we used a small set of journal articles on wheat genetics, focusing on traits, such as salinity tolerance and disease resistance, that are becoming more important as climate change is continuously impacting agriculture globally. The 21 papers were then curated by a professional curator for the GrainGenes database (https://wheat.pw.usda.gov). In parallel, we developed a ChatGPT-based retrieval-augmented generation (RAG) question-answering (QA) system and compared how ChatGPT performed in answering questions about traits and quantitative trait loci (QTLs). Our findings show that on average GPT4 correctly categorized manuscripts 90% of the time, correctly extracted 82% of traits and 63% of marker-trait associations (MTAs). Furthermore, we assessed the ability of a ChatGPT-based DataFrame agent to filter and summarize curated wheat genetics data, showing the potential of human and computational curators working side-by-side. In one case study, our findings show that GPT4 was able to retrieve up to 91% of disease related, human-curated QTLs across the whole genome and up to 96% across a specific genomic region through prompt engineering. Also, we observed that across most tasks, GPT4 consistently outperformed GPT3.5 while generating less hallucinations, suggesting that improvements in LLM models will make Generative AI a much more accurate companion for curators in extracting information from scientific literature. Despite their limitations, LLMs demonstrated a potential to extract and present information to curators and users of biological databases, as long as users are aware of potential inaccuracies and the possibility of incomplete information extraction.
3. (1:40-2:00) - Building an AI and Climate-Smart Agriculture and Forestry Community of Best Practice: A Collaborative Approach
Haag, Shawn, University of Minnesota
This presentation proposes a framework for creating a community of best practice focused on integrating AI tools into climate-smart agriculture and forestry. The initiative aims to foster collaboration among USDA researchers, external partners, and practitioners to share insights, align efforts, and accelerate the adoption of impactful AI applications. Through this community, stakeholders will engage in peer learning, applied research exchanges, and knowledge-sharing activities that promote practical AI use cases, such as carbon sequestration monitoring, soil health prediction, and precision farming. This session will explore how USDA and partner organizations can collaborate to build this community, and I will seek input on the community’s structure, implementation, and alignment with USDA priorities.
4. (2:00-2:20) - Uses of Artificial Intelligence in Agricultural Statistics
Abreu, Denise, NASS
Luca Sartore, Linda J. Young
Artificial intelligence (AI) is revolutionizing the agricultural sector in many impactful ways: from precision farming that helps farmers make data-driven decisions by analyzing information from sensors, drones, and satellites; to drones equipped with AI and computer vision technology that can monitor crop health and detect diseases; to autonomous tractors that plant, water, and harvest crops with minimal human intervention thereby increasing efficiency and reducing labor costs for famers. The National Agricultural Statistics Service (NASS) is the statistical arm of the United States Department of Agriculture (USDA) with a mission to provide timely, accurate, and useful statistics in service to U.S. agriculture. To achieve this mission, the Agency has been exploring the multifaceted uses of AI in statistical techniques with an emphasis on innovating both research and production activities in the development and dissemination of official statistics. The Agency’s subject matter experts are using AI techniques, including neural networks, Random Forest, XGboost, and quantum-inspired neural networks, to provide accurate crop predictions, aid in the development of environmental policy making, automate routine tasks, streamline workflows, and enhance quality control measures, among other tasks. This presentation will showcase multiple research and production endeavors at NASS and discuss the Agency’s vision for the future.
Computer Vision Detection of Foreign Objects
Date: Tuesday November 19th
Session Time: 3:00pm-4:20pm
Session Location: Traditions
Session Moderator: Sunoj Shajahan
1. (3:00-3:20) - Getting Deep in the Weeds (with DeepWeeds)
McCollam, Gerald A., ARS
This presentation highlights a significant challenge in adopting automated weed control: accurately classifying weed species in their natural environments. It centers on a multiclass image dataset called DeepWeeds, which comprises 17,509 images. DeepWeeds was designed to capture the natural variation among weeds and their competing crops in situ. One key question we aimed to explore was whether various forms of data augmentation could improve learning and prediction. We employed transfer learning, image repetition, and image transformation techniques. We also examined different Generative Adversarial Network (GAN) models for generating synthetic images. These approaches were aimed at enriching the dataset and enhancing overall performance. A key finding was the impressive accuracy attained by our modified ResNet-50 model using k-fold cross-validation, even without the use of data augmentation. We demonstrate how this result contributes to a better understanding of the class distribution and design objectives of the DeepWeeds dataset. Furthermore, we outline a pathway for further improving our approach.
2. (3:20-3:40) - Semi-Automated Training of High-Speed Embedded AI Vision-Transformer Model
Holt, Greg, ARS
Mathew Pelletier, John Wanjura
This presentation details our advancements in machine-vision systems for detecting and removing plastic contamination from cotton, a critical issue for the U.S. cotton industry. The primary source of plastic contamination in marketable cotton bales stems from plastic module wrap used by John Deere round module harvesters. Despite efforts by cotton ginning personnel to remove these plastics during module unwrapping, contamination persists within the gin’s processing system. To address this, we initially developed a machine-vision detection and removal system using low-cost color cameras. The system identifies plastic on the gin-stand feeder apron and activates air jets to remove it from the cotton stream. However, the system required extensive manual calibration and tuning, involving 30-50 computers running on low-cost ARM Linux systems—a challenging task for typical gin workers due to its technical complexity. To streamline this, we developed AI models to eliminate the need for manual calibration, significantly reducing manual input and improving system performance. As robust AI models require an extensive number of images upon which to train, for example the Git and Blip foundational Vision-to-Caption models utilized over a million images in their training. Even with transfer learning techniques, a robust model still requires 10’s of 1000’s of images; each of which has to be manually classed by a human technician. Given the inordinate amount of time and cost associated with manually annotating this many images; we developed a novel approach where we leveraged Git and Blip, by running them in parallel and passed their outputs into a custom semantic classifier (used 500 images in training the semantic classifier, validated it on 2500 images). The result was a slow by high accuracy AI image classifier capable of automatically classifying our larger image dataset. This approach allowed for automatically classifying 20,000 images that were then used to develop a high-speed Vision Transformer (ViT) model that was designed to autonomously detect difficult-to-identify plastics and “Hand-Intrusion-Events” (HID). The high-speed AI ViT model enables real-time image classification for our plastic detection system by providing it with the capability to eliminate false-positives and thereby perform self-calibration, eliminating the need for skilled personnel and simplifying system operation and enabling wider stake-holder adoption.
3. (3:40-4:00) - Artificial Technology (AI)-Driven Approaches to Phytoplasma Disease Diagnosis in Agriculture
Shao, Jonathan, ARS
Wei Wei
Phytoplasmas are unculturable, plant-pathogenic bacteria responsible for severe crop diseases, such as little cherry and grapevine yellows. These diseases lead to significant crop losses, affecting agricultural productivity, farmers’ livelihoods, and global food security. Currently, diagnosing phytoplasma infections relies on labor-intensive molecular techniques, which are impractical for farmers and growers in the field due to the need for expert knowledge and specialized equipment. In response to these limitations, this study aims to develop an AI-based diagnostic system to improve early detection of phytoplasma infections. The study employs tomato plants infected with potato purple top (PPT) phytoplasma as a test case. A training dataset of 8000 images (4000 healthy and 4000 PPT phytoplasma-infected) and a testing/unseen dataset of 1600 images (800 healthy and 800 PPT phytoplasma-infected) were collected, respectively. The 8000 images were implemented in TensorFlow to train five convolutional neural network (CNN) models: four pre-trained architectures/models (VGG-16, Google Inception V3, NASNet, DenseNet201) using transfer learning techniques, and a custom CNN model. Pre-trained models achieved accuracy rates between 94% and 99%, while the custom model achieved 90% to differentiate healthy and phytoplasma-infected tomato plants. Validation using unseen data (1600 images) demonstrated strong performance, and ensemble learning is being explored to further enhance the model’s accuracy. This research highlights the potential for AI to revolutionize the diagnosis of phytoplasma diseases, making detection faster, more accurate, and accessible to farmers in the field.
4. (4:00-4:20)- Turfgrass Germplasm Improvement by Leveraging AI-Based High-Throughput Phenotyping Technologies
Barnaby, Jinyoung, ARS
Yonghyun Kim
Evaluating stress responses in large breeding populations is traditionally labor-intensive and time-consuming, often restricting phenotypic assessments to the final stages of stress. However, the rate of stress progression can vary widely across genotypes. To address this, we developed a low-cost, automated greenhouse-based red/green/blue (RGB) imaging system integrated with a machine learning-based image processing platform. This approach enables time- and labor-efficient monitoring of daily drought progression, facilitating the assessment of genetic performance in drought tolerance within a bentgrass hybrid population. The extensive temporal phenotypic data generated, combined with genomic information, supports quantitative trait loci (QTL) mapping—an essential step toward identifying candidate genes associated with physiological mechanisms of drought tolerance. Ultimately, this will lead to more effective and rapid selection of drought-resilient turfgrass germplasm. Tiller number is a crucial yield indicator across grass species. Many of our bentgrass hybrid lines produce over 1,000 tillers, necessitating an efficient counting method. To this end, we developed a tiller counting model using the You Only Look Once (YOLO)v8 convolutional neural network (CNN) framework, making tiller counting a practical yield metric for large-scale analyses of bentgrass, previously impractical due to labor demands. Together, these machine learning-based phenotyping systems greatly enhance breeding efficiency. Furthermore, the data generated through these systems will support genetic mapping in bentgrass breeding populations, facilitating the identification of genomic regions linked to drought tolerance and yield improvements. This work ultimately advances the development of improved turfgrass varieties for end users.
Computer Vision II
Session Date: Wednesday November 20th
Session Time: 1:00pm-2:20pm
Session Location: Ross
Session Moderator: Alexander Hernandez
1. (1:00-1:20) - Automatic Species Classification and Diversity Analysis Using Physical Features and Machine Learning
Serfa Juan, Ronnie O., ARS
Lester O. Pordesimo, Alison R. Gerken
In agricultural environments, even post-harvest, effective pest management relies on timely identification and monitoring of insect species to minimize crop damage and economic losses. This study proposes an automated approach to species classification and diversity analysis by leveraging machine learning (ML) and physical feature extraction from high-resolution images. Several species of beetles are major pests in stored grain environments. Here, we develop a system that uses identification and classification through image processing techniques to extract key morphological features such as body shape, antennae structure, elytra texture, and coloration. These features are then fed into advanced ML models, including support vector machine, kNN, random forest, Naïve Bayes, and decision trees, to classify species with high accuracy. A significant challenge in pest management is that multiple similar-looking insect species often coexist, contributing to infestations at varying levels of severity. To address this, the proposed system not only classifies individual species but also performs diversity analysis. By identifying and quantifying species richness and abundance, it enables the monitoring of mixed populations, which may require different pest control strategies. Features such as body aspect ratio, circularity, texture patterns, and segment length are automatically detected from images and used to distinguish between species, even in complex mixed-population scenarios. Techniques such as contour detection and segmentation are employed to isolate and measure specific body parts, which serve as key identifiers for each species. Machine learning models are trained on annotated datasets of five common stored-product pest beetle species—Maize Weevil, Rusty Grain Beetle, Sawtoothed Grain Beetle, Red Flour Beetle, and Lesser Grain Borer. By incorporating both species-specific morphological characteristics and machine learning techniques, the system achieves robust classification performance. Additionally, diversity analysis is applied to monitor temporal shifts in species composition, aiding farmers in understanding changes in pest populations over time. This not only helps in early detection of new pest species but also tracks the effectiveness of pest control methods by observing species abundance before and after treatments. The integration of physical feature extraction with machine learning offers an efficient, automated solution for pest species identification and diversity analysis. The system can be deployed in real-time to aid farmers in making informed pest control decisions, enhancing sustainable agricultural practices and reducing the impact of insect infestations on crop yield.
2. (1:20-1:40) - Spatial predictions of soil moisture across a longitudinal gradient in semiarid ecosystems using UAVs and RGB sensors
Duarte, Efrain, ARS
Alexander Hernandez, Peter Porter and Holden Brecht
Unmanned aerial vehicles (UAVs) offer an efficient method for assessing and monitoring physical phenomena, including soil moisture (SM), particularly in semiarid regions. UAV-based RGB sensors were used to collect high-resolution imagery, and hundreds of SM samples were gathered concurrently with the UAV flights across nine study sites over a large latitudinal gradient in the western USA. We evaluated the predictive power of RGB bands, texture metrics, and vegetation indices for estimating SM using machine learning algorithms. The model showed a moderately acceptable predictive accuracy (R² = 0.63 using cross validation) and (R² = 0.53 using a fully independent validation). Texture metrics such as “mean” and “entropy,” as well as the Excess Green Index (ExG) vegetation index showed the maximum predictive power while RGB bands showed minimal performance. The resulting spatial predictions showed high reliability (α < 0.01) for the States of Utah and California but exhibited a poorer performance for Idaho and Montana. We provide linear equations for the conversion of raw digital number (DN) values to reflectance, facilitating remote sensing applications that benefits from UAV simple and highly affordable RGB imagery. Our protocol provides a robust pathway to modelling SM with cost-effective solutions for monitoring semiarid ecosystems.
DASH Enterprising AI and Phenotyping through Digital Ag Systems Hub
Session Date: Tuesday, November 19th
Session Time: 3:00pm-4:30pm
Session Location: Oak
Session Moderator: Amanda Hulse-Kemp
1. (3:00-3:15) - Enterprising AI and Phenotyping through DASH
Hulse-Kemp, Amanda, ARS
Steven Mirsky, Chris Rehberg-Horton
As new methods for artificial intelligence and machine learning become available for utilization it is critical to apply and integrate these methods into breeding and production agriculture. The integration allows acceleration of breeding programs, more precise and sustainable agricultural management and potential for mitigating and adapting to climate change. All activities in agriculture rely on accurately being able to phenotype, or measure plant (or animal) characteristics in order to make decisions. There is currently a limitation of translation into the field in this space, leading to wasteful duplication of labor and resources and a mismatch between target and technology deployed. We are proud to launch the new Agricultural Research Service initiative, the Digital Ag Science Hub (DASH) to target enterprising AI and ML technologies into breeding and production agriculture. We will share some initial targets of DASH, including working with USDA’s SciNet team to expand capacity and utilization of the system. We look forward to working with groups across the agency and across USDA, to address stakeholder driven challenges.
2. (3:15-3:30) - MLOps for deploying and improving models
Reberg-Horton, Chris, North Carolina State University
Steven Mirsky and Amanda Hulse-Kemp
Computer vision is transforming agricultural research, sparking interest across diverse fields such as pest management and plant breeding, with the potential to revolutionize automated plant phenotyping. However, the challenge remains to move from proof-of-concept projects to practical, everyday tools for researchers. The DASH initiative is addressing this gap by developing a unified architecture that facilitates the training, reproduction, improvement, and deployment of machine learning (ML) models. This framework will integrate with platforms such as SciNet for large-scale model training, cloud and edge devices for deployment, and partnerships with key ARS initiatives. DASH will collaborate with SciNet, the AI Center of Excellence, and the Partnerships for Data Innovation (PDI) to ensure that these models are effectively scaled and accessible for agricultural scientists.
3. (3:30-3:45) - PlantMap3D: DASH use-case
Mirsky, Steven, ARS
Amanda Hulse-Kemp, Chris Reberg-Horton
The USDA ARS Digital Agriculture Science Hub (DASH) was established to accelerate the use of AI and phenotyping, enterprising technology development pipelines from proof-of-concept to broadly deployable solutions for crop production researchers, breeders, and farmers. Here, we explore the PlantMap3D pipeline as a use case for how DASH can support other enterprise efforts. PlantMap3D is an affordable, scalable technology for real-time mapping of plant species, biomass, density, and stress. It uses off-the-shelf RGB and stereo cameras combined with open-source software to collect data on crops, cover crops, and weeds. The technology requires high-resolution species and species-specific biomass training data. This is accomplished with automated phenotyping robotic platform, BenchBot, plants grown in pots under a semi-field environment, images collected of plants grown in the field, and biomass collected by species. The system generates high-resolution maps by combining color imaging for species identification with stereo imaging for estimating plant height and biomass. Customizable for farmers, researchers, and consultants, PlantMap3D can be deployed as a hand-held device, tractor-mounted system, or on robotic platforms. When combined with web-based decision support tools, PlantMap3D can enable precision weed and nutrient management. Modular, scalable systems like PlantMap3D enable broad access and deployment of computer vision and AI to myriad technical user levels.
4. (3:45-4:00)- Ag Image Repository: A resource for the Ag Community
Kutugata, Matthew, ARS
Maria Laura Cangiano, Søren Kelstrup Skovsen, Muthu Bagavathiannan, Steven Mirsky, Chris Reberg-Horton
The Agricultural Image Repository (AgIR) is a resource designed to advance computer vision, artificial intelligence, and image-based phenotyping in precision agriculture. AgIR offers a diverse collection of annotated images covering key weeds, cover crops, and cash crops under various growth conditions. Images are categorized into two main scene types: Semi-Field, which allows high-throughput image collection using the BenchBot system, and real-world Field settings that capture natural growth conditions but are limited by manual collection. This combination provides scalability while ensuring real-world relevance, enabling the development of robust AI models for weed detection, crop monitoring, and plant trait analysis. AgIR’s collaborative effort spans nine states and ten research institutions, making it a valuable tool for researchers, developers, agronomists, and industry professionals working on data-driven solutions for sustainable agriculture.
(4:00-4:15) Demo - Visualize automated annotation pipelines
(4:15-4:30) Q&A Session
Data Integration and AI in Knowledge Management
A soil carbon use case
Session Date: Tuesday, November 19th
Session Time: 1:00pm-2:20pm
Session Location: Ross
Session Moderator: Dan Roberts
1. (1:00-1:20) - Standardized Data Pipeline for Soil Organic Carbon Knowledge Graphs
Stewart, Cathy, ARS
Soil organic carbon is critical for sustaining agricultural productivity while also storing atmospheric carbon dioxide. Despite its potential to contribute significantly to climate solutions, uncertainty remains around the measurement and accounting of soil organic carbon storage. National, standardized, long-term (decadal or more) soil change data sets are critical to addressing these questions and for calibrating and validating process-based simulation models. We present a standardized data pipeline based on our efforts to integrate Knowledge Graph methodology and structure with the National Agricultural Library Thesaurus (NALT) concept space to create a query-enabled interface for two decades of soil research data. This standardized data pipeline approach will enable streamlined, consistent access to a cloud-based database, easing the roll-out, maintenance, and discovery of soil organic carbon data by researchers and modelers from a variety of disciplines.
2. (1:20-1:40) - NALT for the Machine Age (N4MA): Cultivating a Semantic System for Data Interoperability
Woodward-Greene, Jennifer, ARS
The National Agricultural Library (NAL), in partnership with the USDA Agricultural Research Service’s Partnerships for Data Innovations (PDI) initiative, is developing a unified system for USDA semantic standards related to vocabulary and semantic modeling for knowledge management. This system aims to make it easier for agricultural researchers to access and apply these standards without needing to be experts in the underlying semantic techniques or technology. The goal is to enhance USDA data interoperability by harmonizing the varied agricultural domain vocabularies to one standard - NAL’s NALT controlled vocabulary. This will improve the quality of agricultural information search, discovery, aggregation, and normalization across the diversity of agricultural research domains. The system also focuses on optimizing computational and metadata curation for text and tabular data, to bring together the semantics for both research literature and research data. This is supported by innovations in the NALT, which is transformed to a concept space, and based on the Simple Knowledge Organization System (SKOS) W3C standard. NALT Concept Space allows for multiple domain-specific vocabularies within NALT, and associated properties, mappings to other standards, as well as curated SKOS collections derived from data sheets, making NALT a valuable resource for subject indexing and for tagging standard schemas representing experimental or program designs (i.e., what are the variables tracked and how are they related). This semantic system aims to simplify the process of gathering and harmonizing semantic data from researchers, and to enhance the depth and reusability of knowledge capture in NALT in a sustainable and cost-effective manner.
3. (1:40-2:00) - SOCKG: A Knowledge Graph for Soil Carbon Modeling
Li, Chengkai, University of Texas at Arlington
The Soil Organic Carbon Knowledge Graph (SOCKG) enhances robust soil carbon modeling, which is crucial for voluntary carbon markets. These markets create incentives for climate-friendly practices by encouraging the retention of carbon in soil, reducing atmospheric carbon levels. Industry sectors, like energy and transportation, purchase carbon credits to offset unavoidable emissions, while farmers and land managers are rewarded for adopting practices that increase soil organic carbon (SOC). For these markets to function effectively, it is essential to have advanced soil carbon modeling technologies that accurately measure SOC content, predict changes, and link those changes to specific agricultural practices. High-quality data is central to this process. By integrating siloed data into a cohesive knowledge graph, linking it to broader datasets, and establishing infrastructure for its sustainability, SOCKG can play a transformative role in enabling and accelerating the growth of this emerging market.
4. (2:00-2:20) - The NOAA Cloud Archive and Digital Twins
Berkheimer, Ryan, NOAA
Recent advancements in scalable computational infrastructure and data management have led to a state where semantic web technology is being increasingly utilized to capture, define, and deliver data in more connected and valuable ways. This trend is demonstrated by the emergence of semantic digital twins pilots, ontological-driven governance, and integrated knowledge networks, such as the National Science Foundation’s Open Knowledge Network. The UN-GGIM Geoverse discussion paper summarized thinking around this state as an opportunity for transitioning to a largely automated and fully integrated virtual reality. It provides guidance on achieving this vision, emphasizing interoperable machine-to-machine communications, federated digital twins, participatory standards, data sovereignty, and democratized wisdom. We present an interoperable knowledge mesh developed at NOAA’s National Centers for Environmental Information (NCEI), called the Open Information Stewardship Service (OISS), aimed at integrating diverse NOAA domains. The OISS framework offers an object-oriented API for accessing a cloud-native knowledge graph, facilitating self-definition of patterns and automated contextualization of records. Utilizing an RDF-encoded dialect of information related concepts defined by the Open Archival Information System (OAIS), the OISS enables users to define processes and generate contextualized records in an event-driven manner. Stored as JSON-LD files, the OISS supports equitable access patterns and fosters the creation of earth system digital twins and foundation models across NOAA.
Disease Transmission Applications
Session Date: Tuesday, November 19th
Session Time: 1:00pm-2:20pm
Session Location: Traditions
Session Moderator: Lindsey Perkin
1. (1:00-1:20) - Geospatial predictions of equine West Nile outbreaks: Leveraging graphical neural networks in the Southern Climate Region of the United States
Mooney, Amber, ARS
John Humphreys, Brian Stucky, Lee Cohnstaedt, Mel Boudreau, Chad Fautt, Amy Hudson
West Nile Virus (WNV) poses a significant public health threat to
humans and livestock. Its transmission dynamics are influenced by various environmental factors. This study employs geospatial analysis and artificial intelligence techniques to identify and characterize WNV presence, focusing on counties with elevated disease risk in the Southern Climate Region, including the states of Kansas, Oklahoma, Texas, Arkansas, Louisiana, and Mississippi. The aim is to enhance our understanding of high-risk areas for targeted intervention strategies. We applied a graphical neural network (GNN) with long-short term memory (LSTM) to identify counties at risk of equine WNV transmission by utilizing data on reported WNV cases in avian, mosquito, and sentinel species; host information such as avian richness and equine and human population density; and climate and environmental variables. The model was trained using data from 2002-2011 and 2013-2017, validated on 2018-2019, and then tested on 2012, the year of a large outbreak. The AI model was most successful when integrating an LSTM layer and the five graph convolutional layers. Interestingly, there was little difference in variable importance among input features, indicating input layers may interact in complex ways across the geographic domain. Seasonal to sub-seasonal forecasts of dynamic input layers could be integrated into this AI framework to project regions of elevated risk in upcoming transmission seasons. This advancement would support targeted intervention strategies such as vaccine campaigns or mosquito control. This research contributes to the ongoing efforts to mitigate the impact of WNV by refining our understanding of the spatial determinants of virus transmission and informing evidence-based interventions.
2. (1:20-1:40) - Development and evaluation of a machine learning model to predict Rift Valley fever virus transmission risk for livestock
Willard, Lory, ARS
Heidi Tubbs, Karlyn Harrod, Bhaskar Bishnoi, Stephanie Schollaert Uz, Claudia Pittiglio, Assaf Anyamba, and Seth Gibson
Rift Valley fever (RVF) is a mosquito-borne viral hemorrhagic zoonosis largely confined to Africa and the Arabian Peninsula, which poses significant threats to public health, the agricultural economy, and food security worldwide. Rift Valley fever virus (RVFV) is transmitted to ungulate livestock primarily through the bite of infectious Aedes and Culex spp. mosquito vectors. Humans can become infected via handling or consumption of fluids or tissue of infectious livestock. Accurate forecasting of RVFV livestock transmission risk is crucial for mobilizing timely interventions and mitigating impacts. We previously implemented a threshold RVFV transmission model based on satellite derived normalized difference vegetation index (NDVI) data that has been deployed for use for over a decade. Recently, to improve both spatial and temporal accuracy of transmission risk, multiple machine learning models were developed utilizing a comprehensive suite of variables including: satellite-derived NDVI and rainfall datasets, human population and livestock distributions, soils and hydrologic data, and records of historical RVFV livestock cases. Recognizing that RVF outbreaks have been associated with El Niño–Southern Oscillation (ENSO) events across Africa, our model considers the unique teleconnection patterns of ENSO (El Niño and La Niña) across three regions of Africa (Southern Africa, Eastern Africa, and the Sahel). Classification models based on the presence or absence of RVF livestock cases were developed using random forest, XGBoost, K nearest neighbor, support vector machine, and neural network algorithms. Model performance was evaluated by comparing accuracy and ROC curves generated with an independent test set of livestock case data. All models have accuracy scores of 80% or greater. Validation against historical livestock case data demonstrates the model’s capability to identify high-risk periods and regions with improved precision and lead time, increasing the time available for health officials to implement mitigation measures in a more precise location.
3. (1:40-2:00) - Protein-protein interaction prediction and design in virus and antibody systems
Fenster, Jacob, ARS
Paul A. Azzinaro, Mark Dinhobl, Manuel V. Borca, Edward Spinard, Douglas P. Gladue
Recent AI protein structural prediction models (AlphaFold, RosettaFold, etc.) have enabled a significant increase in the success of computational prediction and design of protein-protein interactions. While these tools have high performance on many protein complexes, the successful prediction of virus-virus protein-protein interactions and antibody-antigen interactions remains challenging due in part to the lack of genetic evolutionary information that is present in proteins that undergo traditional evolution through clonal or sexual reproduction in cellular organisms. This talk presents benchmarking data on predicting genome-wide virus-virus protein-protein interactions in the model Vaccinia virus to gain insight to the performance of AlphaFold2 in African Swine Fever Virus. In addition, in silico success rates of de novo designed viral epitopes to bind neutralizing antibodies for subunit vaccine development will be discussed.
4. (2:00-2:20) - Use of machine vision in an existing fruit packing house system for a quarantine pest as part of systems approach for export to the U.S
Simmons, Gregory, APHIS
Food Science Applications
Session Date: Wednesday, November 20th
Session Time: 10:30am-11:50am
Session Location: Reveille
Session Moderator: Jacob Washburn
1. (10:30-10:50) - Machine Learning with Ingredient-Level Food Trees Reveals Contributors to Systemic Inflammation in the American Diet
Larke, Jules, ARS
Danielle Lemay
Background: Methods for modeling the relationship between self-reported diet records and inflammation are limited and lack the rigor to adequately assess dietary complexity. Machine learning (ML) combined with alternative representations of diet may help to improve predictions of health outcomes over traditional methods. Objective: To determine if hierarchical ingredient-level representations of diet improve predictive models of systemic inflammation from a cross-sectional analysis using data on US adults (N=19,460) from the National Health and Nutrition Examination Survey (NHANES). Analysis: Mixed meal disaggregation was performed to generate an ingredient level representation of diet which was further annotated to produce a hierarchical data structure, or food tree. Hierarchical feature engineering selected the most informative food tree features for predicting systemic inflammation CRP. ML models were used to assess the accuracy of predicting CRP from the food tree features compared with the Dietary Inflammatory Index (DII) score and logistic regression was used to calculate marginal effects of ingredients identified from ML models. Results: Representation of diet as an ingredient-level food tree reduced dietary features from 6,412 unique foods to 566 unique ingredients. ML classifiers trained on food tree data predicted high versus low systemic inflammation (CRP tertile) with marginally higher accuracy (0.761) on held out data compared with models trained using DII scores (0.757). Individual dietary components revealed contributions towards increased inflammation including fruit punch, soda, and high-fat milk (marginal effects: 0.001 – 0.005, P < 0.05), and foods associated with decreased inflammation such as herbal tea, coffee, brown rice, and pasta (marginal effects: -0.08 – -0.001, P < 0.05). Conclusions: Specific ingredients, selected from a food tree, perform as well as the DII at predicting systemic inflammation. Choice of common beverages and staples associated with inflammation varied in magnitude and direction, implying specific dietary swaps (e.g. soda for tea/coffee, white rice for brown rice, etc.) have practical use for dietary guidance.
2. (10:50-11:10) - AI enabled detection of microbes in food systems
Nitin, Nitin, University of California Davis
Luyao Ma, Howard Park, Jiyoon Yi, Nicharee Wisuthiphaet
The presentation will focus on AI approaches using optical imaging to improve speed, sensitivity, and specificity for detecting microbes, including bacteria and yeast, in food systems and their applications to enhance food safety and quality. The optical imaging approaches will focus on low-cost imaging measurements to acquire microbial data, and the data analysis methods will discuss various AI/machine learning approaches to detect and quantify the presence of target microbes in food systems. The presentation will also discuss the opportunities for industrial applications by simulating the detection of bacteria and yeast in different food matrixes, including fresh produce, dairy, and meat products. The presentation will also discuss the future steps to develop these technologies and their translation to field applications.
3. (11:10-11:30) - Application of artificial intelligence to enhance potato breeding and genetics
Feldman, Max ARS
Collins Wakholi, Devin Rippner, Mark Pavek, Manoj Karkee
Quantitative genetics and predictive breeding are data-intensive methods that associate haplotype inheritance and phenotypic characteristics in structured breeding populations. Scientists in the Temperate Tree Fruit and Vegetable Research Unit located in Prosser, WA and cooperators at Washington State University are using automation and machine vision to rapidly and inexpensively capture biologically important measurements from potato tubers. Our team developed an RGB-D imaging conveyor system that utilizes artificial intelligence to detect, track, capture, and extract measurements from images of individual potatoes. This platform was used to evaluate >75,000 individual tubers derived from ~1,300 samples (~32 breeding families). Our approach enables us to rigorously assess the inheritance of potato yield components (size of tubers, number of tubers), tuber shape descriptors, and potato skin color characteristics. We are currently working to train additional deep learning models to detect potato tuber defects including sprouting, growth cracks, secondary growth, and tuber greening.
4. (11:30-11:50) - Dietary polyphenol intake is associated with an altered gut microbiome and lower gastrointestinal inflammation and permeability
Wilson, Stephanie, Texas A&M University
Andrew Oliver (USDA ARS WHNRC), Danielle G. Lemay (USDA ARS WHNRC)
Background: Polyphenols are dietary bioactive compounds that can have anti-inflammatory and anti-oxidative properties. As most polyphenols reach the large intestine, they may influence inflammation within the gastrointestinal (GI) environment and impact the microbiome by shaping microbial community structure. However, few studies have assessed how polyphenol intake relates to GI health and the gut microbiome, particularly at a resolution higher than total polyphenol intake. Thus, we mapped diet data to FooDB to estimate intake of total polyphenols, polyphenol classes, and individual polyphenols in adults, then examined the relationship between polyphenol intake and markers of GI health and the gut microbiome. Methods: Healthy adults (n = 350) were recruited into an observational, cross-sectional study balanced for age, sex, and BMI (ClinicalTrials.gov: NCT02367287). We examined diet using multiple 24-hr dietary recalls (ASA24) then mapped ingredients to polyphenols within FooDB to estimate polyphenol intake. We analyzed whether dietary polyphenol intake at various resolution levels total, class, compound relates to systemic and GI inflammatory markers using standard and machine learning analyses. We also paired intake with microbial community profiles from fecal shotgun-sequenced metagenomes (n = 313) to assess whether microbial composition varied with different levels of polyphenol intake. Results: Mean total polyphenol intake was approximately 914 50 (SE) mg per 1000 kcal/d with flavonoids as the greatest class contributor at 495 38 mg per 1000 kcal/d. Total polyphenol intake negatively associated with the GI inflammation marker, fecal calprotectin (=-0.004, p=0.04). At the class level, polyphenols classified as prenol lipids (=-0.94, p<0.01) and phenylpropanoic acids (=-0.92, p<0.01) negatively associated with lipopolysaccharide-binding protein, a measure of GI permeability. Using hierarchical feature engineering and random forest regression, we found a positive relationship between C-reactive protein and polyphenols classified as “cinnamic acids and derivatives”. Furthermore, we found that gut microbial composition differed between upper and lower quartiles of polyphenol intake (p=0.007), accounting for age, BMI, sex, and diet quality. The top differentiating taxa were a greater abundance of the family Clostridiaceae with high polyphenol intake and greater abundance of Bacteroides stercoris with low intake. Conclusion: Our results indicate that polyphenol intake is associated with GI inflammation, GI permeability, and gut microbial community structure in healthy adults.
Genomics I
Session Date: Tuesday, November 19th
Session Time: 1:00pm-2:20pm
Session Location: Reveille
Session Moderator: Zhanyou Xu
1. (1:00-1:20) - Use of Artificial Intelligence for Vaccine Development Against Vector Pests: Challenges and Opportunities
Saelao, Perot, ARS
Bodine, D.M., Leucke, D., Bendele, K.G.
Conventional vaccine development can be costly and time intensive. In a field where candidate antigens are rapidly evolving, the pace of development and identification of vaccine targets needs to quick and efficient. Machine Learning (ML) and Artificial Intelligence (AI) have quickly become essential resources to identify candidate antigens through methods such as reverse vaccinology. This presentation will provide a broad overview of the use of AI in vaccine development against pathogens and describe several examples from ARS research at the Veterinary Pest Genetics Research Unit. The overall goals of this presentation will be: to describe a potential use of AI in genomic data, foster ideas for collaboration in enhancing or expanding datasets for these methods, and identify areas and other systems within ARS research that could benefit from these applications.
2. (1:20-1:40) - The Promises and Pitfalls of Deep Learning Methods for Plant Phenotype Prediction
Washburn, Jacob, ARS
Daniel Kick
Phenotype prediction is a grand challenge of 21st century biology! Predictive models and frameworks touch nearly every area of modern research and are particularly critical in agriculture for assessing crop loss risks, developing climate smart and sustainability agricultural solutions, and informing breeding decisions. Within plant agriculture, the substantial influence of genotype-by-environment effects and diverse growing conditions compound the challenge of prediction. Deep Learning offers a promising approach to phenotypic prediction as it allows for incorporation of large amounts of data and diverse data types into a single model. Our recent findings in the application of deep learning for agriculture, supporting both trait measurement and prediction, as well as the limitations and promises of these techniques will be presented. The effects of different data types and qualities, “ensemble” methods, training methods, and the potential for biological insights from these models will be discussed.
3. (1:40-2:00) - Machine learning and analysis of genomic diversity of ‘Candidatus Liberibacter asiaticus’ strains
Chen, Jianchi, ARS
Adalberto A. Perez de Leon
We are currently performing active research projects to study genetic diversity of “Candidatus Liberibacter asiaticus” (CLas), an alpha-proteobacterium associated with citrus Huanglongbing (HLB). HLB is a highly destructive disease in citrus production around the world. Because CLas is not culturable in vitro, DNA sequencing and analyses are the primary tools to study the bacterial diversity, which is crucial for HLB management. Genome sequence analyses involve data sets of Mbp or Gbp level that generate multi-dimensional data outputs. Handling and interpreting such complex data sets are highly challenging through conventional approaches. Machine learning (ML) is a technology that uses computational statistics to resolve complex data problems. To illustrate this point, our recent publication of Huang et al. 2022, entitled Machine learning and analysis of genomic diversity of “Candidatus Liberibacter asiaticus” strains from 20 citrus production states in Mexico, Front. Plant Sci. 13:1052680. doi: 10.3389/fpls.2022.1052680, is discussed here. CLas samples were collected from 20 citrus-producing regions in Mexico and sequenced using HiSeq platform with multi-Gbp data. An unsupervised ML was implemented through principal component analysis (PCA) on average nucleotide identities (ANIs) of CLas whole genome sequences; And a supervised ML was implemented through sparse partial least squares discriminant analysis (sPLS-DA) on single nucleotide polymorphisms (SNPs) of coding genes of CLas. Two CLas Geno-groups were established that extended the current classification system of CLas strains. We are currently looking for neural network-based algorithms to further evaluate the CLas population diversity.
4. (2:00-2:20) - AI-ML in Genome-Phenome associations for Crop Resiliency
Hudson, Matthew, University of Illinois
Zhihai Zhang, Ryan Disney, Lucas Borges, Joao Viana, Kim Walden, Samuel Mintah, Todd Mockler, Andrew Leakey, Lisa Ainsworth, Andrea Eveland, Alex Lipka, Kaiyu Guan, Heng Ji
The Crop Resiliency thrust in AIFARMS focuses on the use of AI and machine learning to optimize our ability to predict, optimize and improve crop yields in a changing environment. In particular, we focus on improving genomic breeding methodologies and environmental sensing using AI, and several examples of this in different disciplines will be provided. Our ultimate goals are to improve genetic resiliency of crops and larger agricultural systems to environmental and weather extremes, to allow early prediction of yields and supply problems, and to measure carbon sequestration on local and regional scales.
Genomics II
Session Date: Wednesday, November 20th
Session Time: 10:30am-11:50am
Session Location: Oak
Session Moderator: Perot Saelao
1. (10:30-10:50) - Quantify alfalfa digestibility with YOLO8 and segment anything model (SAM)
Xu, Zhanyou, ARS
Brandon J. Weihs, Zhou Tang , Somshubhra Roy , Zezhong Tian, Deborah Jo Heuschele, Zhiwu Zhang, Zhou Zhang, Garett Heineck
The low digestibility of fiber in alfalfa (Medicago sativa L.) limits dry matter intake and energy availability in ruminant animal production systems. Previously, alfalfa plants were identified for low or high rapid (16 h) and low or high potential (96 h) in vitro neutral detergent fiber digestibility (IVNDFD) of plant stems. Here, two cycles of bidirectional selection for 16 h and 96 h IVNDFD were carried out. Two hundred fifty genotypes from the resulting populations were evaluated for solid vs. hollow stem characterization at three maturity stages. Each genotype was photographed with an RGB camera to record the number of stems, the size of each stem, and the area of the internal polygon hole. The number of stems was counted using You Only Look Once version 8 (YOLOv8) with more than 91% accuracy. Otsu’s automatic image thresholding algorithm and the segment anything model (SAM) were used to segment each stem into two parts: the hollow stem’s central polygon and the stem’s outer layers. The medoid of the central polygon was identified by 5-cluster k-mean classification, and the total number of pixels for the polygon hole and each whole stem was counted. The percentage of hollow pixels was estimated and associated with digestibility and used as inputs for GWAS, genomic selection, and machine learning predictions. The application of AI changes breeders’ subjective phenotyping into digital phenotyping and precision agriculture.
2. (10:50-11:10) - Leveraging Genomics and AI to Differentiate Pest from Non-Pest Subspecies — A Boll Weevil Case Study
Perkin, Lindsey, ARS
Adama Tukuli, Zachary Cohen, Tyler Raszick, Gregory Sword, Robert Jones, Charles Suh, Xanthe Shirley, Julien Levy, Kiley Stout, Jayda Arriaga
The Boll Weevil Eradication Program, a national effort launched in 1978, has nearly eliminated the boll weevil, Anthonomus grandis grandis Boheman, from the United States. However, persistent populations in south Texas and Mexico continue to threaten cotton production in eradicated areas. Current control efforts rely on pheromone-baited traps and the organophosphate insecticide malathion. However, emerging environmental regulations are likely to restrict malathion use and pheromone-baited traps, while effective to monitor for boll weevil, also attract morphologically similar non-pest species that poses no risk to cotton. The most difficult non-pest weevil to discern from the boll weevil is its subspecies, Anthonomus grandis thurberiae Pierce, the thurberia weevil, which looks identical to the untrained eye. Accurately distinguishing these subspecies is critical for balancing eradication efforts and cost of mitigation, thus we strive to develop easy-to-use tools for use in-field by eradication personnel. To obtain diagnostic molecular markers, our pipeline leverages AI models integrated with genomics. Starting with a genotype matrix derived from variant call format (VCF) data, we tested sixteen machine learning models and selected CatBoost for its effectiveness in handling sparse data and categorical variables without additional encoding. CatBoost effectively differentiated boll weevil from thurberia weevil with high accuracy, achieving a StratifiedKFold mean AUC of 0.99 and nearly perfect classification metrics in cross-validation, with precision, recall, and F1-score all reaching 1.00. Feature importance analysis in CatBoost further highlighted key SNPs associated with each subspecies. These SNPs were then mapped to reference genomes to retrieve associated protein sequences. Functional insights were derived using InterPro, while AlphaFold2 provided 3D structural predictions. To leverage these insights further, we used AlphaProteo/RoseTTAFold to design high-affinity protein binders targeting subspecies-specific proteins. These binders form the basis for developing diagnostic assays, such as lateral flow devices to distinguish boll weevil from non-pest weevils. Our AI-driven genomic approach demonstrates a versatile and robust tool for boll weevil management, while also adaptable to other agricultural pest species.
3. (11:10-11:30) - AI-infused Breeding To Enhance Fruit Quality and Nutrition
Colantonio, Vincent, ARS
Anna Hermanns, Jillian Belluck, Andrew Horgan, James Giovannoni
Fleshy fruits serve as a crucial source of healthy foods in our diet. Improving access, affordability, and nutritional content of fleshy fruit products requires plant breeders to prioritize fruit quality traits in their breeding programs. However, the highly complex, quantitative nature of many quality components, such as color, flavor, texture, and nutritional composition, has historically led to the de-prioritization of these traits. Fortunately, machine learning and AI approaches are proving to be particularly useful for the measurement, prediction, and genetic dissection of fruit quality characteristics. Here, we will explore the use of machine learning and AI algorithms for improving quality in the model fruit crop, tomato. Examples from successful applications will be presented, including the development of AI-based phenomic prediction models for enhancing fruit flavor, computer vision tools for the measurement of fruit quality components, and the use of machine learning algorithms for genomic characterization of nutritional composition. Lastly, we will discuss considerations for setting up AI experiments useful for the improvement of fruit quality traits and how to successfully deploy these models in plant breeding programs.
4. (11:30-11:50) - Ideotype Breeding v2.0
Schnable, Patrick, Iowa State University
Nasla Saleem, Mozhgan Hadadi, Yan Zhou, Yawei Li, Adarsh Krishnamurthy, Baskar Ganapathysubramanian
Breeders have been tremendously successful at improving the performance of crops via selection and geneticists can now readily identify genes responsible for traits of interest and can use these genes to modify these traits. The challenge today is determining which traits and trait values should breeders be selecting for, particularly in a world facing climate change? This presentation explores the potential of ideotype breeding to provide data-driven answers to this question. An ideotype is an idealized plant model expected to exhibit improved performance relative to existing plants. When first proposed in the late 1960s it was not possible to define an optimal plant, and to a large extent, this remains challenging. In response we propose “Ideotype Breeding v.2.0”, in which we define the existing ranges of phenotypic variation for all characteristics of a trait such as canopy architecture. We then use HTP procedural modeling (an approach used to create 3D models from sets of rules) to create 3D models of plants with all possible architectures based on trait values from existing germplasm. To define breeding target(s), we model the efficiency of light capture by each type of virtual canopy. Subsequently, a genetic algorithm and further selection is used to optimize the canopy architecture.
Large Language Models
Session Date: Wednesday, November 20th
Session Time: 10:30am-11:50am
Session Location: Corps
Session Moderator: Taner Sen
1. (10:30 - 10:50) - All We Need to Know About LLMs: Towards Securely Harnessing the Power of Generative AI in USDA
Park, John Y., ARS
Large Language Models (LLMs) are revolutionizing the way we work, offering unprecedented potential for improved communication, productivity, and personalization. However, to fully harness this transformative technology while mitigating potential risks, a deeper understanding of LLMs is crucial. While commercial LLMs offer powerful capabilities, they raise concerns about data privacy and security. User interactions with these models inevitably involve the transmission and storage of potentially sensitive data, including Personally Identifiable Information (PII), creating opportunities for misuse or breaches. This risk underscores the importance of exploring alternative solutions, such as open-source LLMs. Open-source models empower users with transparency and control over their data. By deploying LLMs within organizational data centers and leveraging open-source code, organizations can minimize data leakage and ensure the privacy of sensitive information. This presentation offers a comprehensive exploration of Large Language Models (LLMs), demystifying their inner workings, data utilization, and associated policies. We’ll examine popular closed-form LLM services such as ChatGPT, Claude, and Gemini, alongside open-source alternatives like Llama and Mistral, delving into their development and deployment. Our analysis will cover core architecture, training datasets, model characteristics, and weighting mechanisms, all presented from a user-centric perspective. Furthermore, we’ll conduct a SWOT analysis to evaluate the strengths, weaknesses, opportunities, and threats associated with these technologies. Next, we’ll highlight key LLM functionalities and discuss how they can be adapted to meet the unique research needs of the USDA. Finally, we aim to provide a roadmap the steps for utilizing open-source LLMs within the organization, including secure storage and customization of user data. By understanding the nuances of LLMs, including their potential benefits and risks, we can make informed decisions about their deployment and usage, ensuring a future where this powerful technology could possibly be utilized responsibly, ethically, and effectively to meet the unique needs of the USDA.
2. (10:50 - 11:10) - Navigating Genomes of Cereal Crops with Generative AI
Lazo, Gerard, ARS
Devadharshini Ayyappan, North Carolina State University, Raleigh, NC 101010 USA,
Parva K. Sharma, and Vijay K. Tiwari, University of Maryland, College Park, College Park, MD 20742 USA.
The age of generative artificial intelligence (GenAI) is now upon us, and we are learning new ways to incorporate it into our daily life. The GrainGenes database has long housed information for the small grains (since 1992) on topics such as genomes, genes, and traits, and has evolved incrementally with technological advances along the way. We are working on enhanced methods to access information and wish to determine if GenAI can serve as a useful tool to propel our knowledge further. Improvements in the high-throughput sequencing technologies have added greatly in recent times to the quality and depth of genome coverage and the breadth of species covered to gauge the diversity of cereal germplasm. Within many of the species represented there are highly studied germplasm and progenitors which will become crucial for better understanding the biology of these systems for developing crop improvement strategies. There are now over seventy pseudo-molecules available, or soon to be available, for wheat, barley, rye, and oat. Having these genomes available will allow us to survey relatedness between species in a pan-genome sense utilizing the annotated transcriptomes of high-quality reference genomes. Early within the GenAI world (2023) there were constraints on the user with regard to required equipment, use fees, limited token-length access which reduced the parsed data volume, and the quality of training associated with the large language models (LLMs) then available. The ability to step-up these efforts have incrementally blossomed over 2024 with the availability of open-sourced tools with enhanced capabilities to provide a plethora of adaptive approaches to incorporate and analyze locally-sourced data collections. We have approached these studies on multiple fronts; through collections of research articles, building graphs based on database queries, and utilizing sequence annotations as paths for querying genome structure based on relevant genes associated with identifiable traits. Inter-specific crosses have played a role for bringing in new traits via gene introgression. The ability to survey cereals on a pan-genome scale may allow for new discoveries to aid this process. We have developed a capability of integrating the GFF3 files associated with genome descriptions as a roadmap for querying associated genes and traits using GenAI. We have also used collections of research papers resourced for a context-oriented topic of “cereal rust disease” to determine the extent for which it could deliver on problem solving. Specially crafted prompts provided interesting guides about the topic and were able to point to resources and aided expanded discussions to enhance information discovery. Such prompts have been able to direct attention to molecular markers, chromosome locations, dominant and recessive alleles, and some descriptions based on the context provided. As this technology evolves, it is hopeful even further enhancements to GenAI will extend our capabilities. We plan to present the state of our findings and open discussions on how such tools might be useful for our future.
3. (11:10 - 11:30) - Use of AI in Agricultural Studies: Examples in leveraging LLMs to help enable scientists to conduct research more effectively
Lau, Jeekin, ARS
Large language models (LLMs) are the poster child for Artificial intelligence. For example, much of the general public is familiar with ChatGPT. As a relatively new emerging technology, use cases are being deployed in many fields including the field of agricultural research. Three use case scenarios will show how we can leverage LLMs for in our agricultural research. 1) Using LLMs for exploring different mathematical models for agricultural data, 2) Use of LLMs to help analyze large complex trials including different annotation for different software packages, and 3) comments and annotation of code. These three showcase examples may help other researchers see potential new uses of LLMs in their own research.
4. (11:30-11:50) - Large Language Models (LLMs) for research data curation
Campbell, Jacqueline, ARS
Dr. Steven Cannon, Research Geneticist (Plants), Corn Insect & Crop Genetics Research Unit
USDA databases like SoyBase, MaizeGDB, GrainGenes, and the i5k Workspace at NAL play a critical role in providing well-organized, easily accessible biological data to researchers across the world. However, their ability to extract knowledge within a published manuscript from unstructured data into a structured, human and machine-readable form, known as biocuration, is a slow and painstaking process. Large Language Model (LLM) has emerged as a promising tool to accelerate biocuration by automatically extracting knowledge from a published manuscript. Several research groups have evaluated LLMs for biocuration in human medicine and genetics, and a number of databases funded by the NSF have started using LLMs for biocuration. The databases within the USDA are an invaluable resource because each database has built and continues to build upon previous data in a well-organized and structured way. The large amount of curated data within each of these databases is optimal for AI-driven meta-analysis. I would like to present an introduction to two important topics at the USDA Forum on AI in Federal Agricultural Research about the responsible use of AI involving LLMs for biocuation and the future trends in AI research using large amounts of professionally-curated data is optimal for AI-driven meta-analysis.
Modeling I
Session Date: Tuesday, November 19th
Session Time: 3:00pm-4:20pm
Session Location: Hullabaloo
Session Moderator: Luca Sartore
1. (3:00-3:20) - Prediction of aflatoxin contamination outbreaks in Texas corn by using mechanistic and machine learning models
Castano-Duque, Lina, ARS
Angela Avila, Brian Mack, H. Edwin Winzeler, Joshua Blackstock, Matthew D. Lebar, Geromy G. Moore, Phillip Ray Owens, Hillary L. Mehl, James Lindsay, Kanniah Rajasekaran, and Jianzhong Su
Aflatoxins are carcinogenic and mutagenic mycotoxins produced by fungi that contaminate the food supply under field or storage conditions. To predict mycotoxin outbreaks, we employed an ensemble of models to estimate the probability of high or low aflatoxin contamination in corn (maize) at the county level across Texas. Our models utilized high-throughput dynamic geospatial data from remote sensing satellites, soil property maps, and meteorological data at county levels. We developed three model ensemble analysis pipelines: two mechanistic models that used weekly aflatoxin risk indexes (ARI) as inputs, and one weather-dependent model. The ARI was determined by using two approaches: (a) the AFLA-MAIZE mechanistic model, and (b) the Ratkowsky mechanistic model. The third model relied solely on weather input features (temperature, precipitation, and relative humidity). For the ARI-dependent models, the ARIs were weighted based on a corn phenological model that estimated the planting times per growing season at a county level. The phenology model used satellite-acquired data of normalized difference vegetation index (NDVI) to estimate corn growth curves via a 3rd degree polynomial. In the second stage of the pipelines, we trained, tested and validated gradient boosting and neural network models by using as inputs ARI-only or weather-only and soil properties, and county geodynamic latitude and longitude references. Our finding indicated that the AFLA-MAIZE and Ratkowsky mechanistic models had similar accuracy, sensitivity and specificity to predict aflatoxin outbreaks, and they were at par with weather-only model. We recommend considering model sensitivity and specificity when evaluating mycotoxin outbreak models, as these metrics provide insights into false and positive rates. Our study concluded that Texas exhibits significant geographical variability in ARI and ARI-hotspots responses due to its diverse ecoregions across the state (hot-dry, hot-humid, mixed-dry and mixed-humid). This diversity leads to high temporal and latitudinal variability in weather and planting times, resulting on a wide variation of corn temporal developmental. For instance, peak corn flowering, which is crucial for predicting aflatoxin outbreaks (April, June and July), occurs 2-3 months earlier in southern Texas compared to northern Texas. Our weather-only-nnet model identified a correlation and hot-spot prevalence in the hot-humid areas of TX, where high relative humidity in March and October led to increased AFL events. Similarly, our Ratkwosky-ARI-GBM-standard model detected that high ARI and consequently high temperature in early mid mid-time corn growing season resulted in high AFL contamination. We found that depending on the ecoregion in Texas, there is a positive correlation between aflatoxin outbreaks and soil organic matter, pH, soil erodibility and soil sodium adsorption ratio. Conversely, there is a negative relationship between aflatoxin outbreak and available water holding capacity (AWC) and cation exchange capacity. Our results demonstrate intricate relationships between AWC, fungal communities and plant health. It is possible that soil fungal communities are more diverse, and the plants are healthier in high AWC, leading to lower AFL outbreaks. These findings suggest that any implementation of prediction and prevention strategies of mycotoxin outbreaks should consider this complex interaction throughout geographical ecoregions in Texas.
2. (3:20-3:40) - Field to Fiber: AI-Driven Cotton Yield Predictions
Mitra, Alakananda, ARS
Sahila Beegum, David Fleisher, Vangimalla R. Reddy, Wenguang Sun, Chittaranjan Ray, Dennis Timlin, Arindam Malakar
The United States cotton industry is committed to implementing sustainable production strategies that reduce water, land, and energy consumption and enhance soil health and cotton yield. More climate-smart agriculture solutions are being developed across the globe to increase crop productivity and lower operational costs. However, accurate crop yield prediction is complex due to the intricate and nonlinear interplay of factors such as cultivar, soil type, management, pest and disease, climate, and weather patterns on crops. In this study (Mitra, et al., 2024), we employed a machine learning (ML) method to predict cotton yield by considering climatic change, soil diversity, cultivars, and fertilizer applications to address this challenge. This study used two types of data: field data and synthetic data. Field data was collected from the US southern cotton belt during the 1980s and 1990s. Synthetic data, using a process-based crop model, GOSSYM, also have been generated to reflect the recent impacts of climate change from 2017 to 2022. The study areas include three southern states: Texas, Mississippi, and Georgia. A total of nine locations have been selected based on the cotton productivity. We used the accumulated heat units (AHU) to reduce computational complexity instead of time-series weather data. Random Forest (RF) regressor, Support Vector Regression (SVR), Light Gradient Boosting Machine (LightGBM) regressor, Multiple Linear Regression (MLR), and neural networks were tested to find the best ML algorithm for the work. Cross-validation was performed to avoid overfitting. RF Regressor performed best and achieved an accuracy of 97.75%, with an R2 of roughly 0.98 and a root mean square error of 55.05 kg/ha. The results demonstrate that a simple and robust model may be developed and used to aid the cotton climate-smart effort.
3. (3:40-4:00) - A Case Study Of Predicting Midrange Precipitation Using The K-Nearest Neighbors Method In Southern Plain
Guidry-Stanteen, Sean, University of Texas at Arlington
Jianzhong Su, John Zhang, Paul Flanagan
Weather, particularly precipitation, is a force on which agriculture is wholly dependent. Our hypothesis is that weather variables such as precipitation and temperature have their intrinsic periodicity and repetition over time. k-Nearest Neighbor (kNN) attempts to do this by looking into the past and seeing if what happened recently ever happened before, and if it did, what would happen next. This is done by compiling “features,” relevant data expressed numerically, into “feature vectors,” and matching with historical data sets. While looking at daily data tends to be the most accurate, it can be skewed by extreme bouts of precipitation. A novel method of grouping data by several days was developed and tested. Overall, the method can generate desirable results given the correct number of days grouped in just the right way. The method has been tested on two weather stations in Southern Plain region (Oklahoma) and predictions are found to be reliable.
4. (4:00-4:20) - A Spatially-Aware ViT-LSTM Architecture for Unearthing the Drivers of Soil Moisture Dynamics Across CONUS
Rahman, Mashrekur, ARS
Menberu B Meles, Scott Bradford, Grey S Nearing
Soil moisture dynamics play a crucial role in hydrological processes, influencing runoff generation, drought stress, and water management. To better understand the complex drivers of soil moisture dynamics, we present a novel hybrid architecture integrating Vision Transformers (ViT), spatial attention mechanisms, and Long Short-Term Memory (LSTM) networks. This architecture enables investigation of controlling factors across diverse landscapes in the Continental United States (CONUS) by incorporating spatial awareness at two levels: through ViT’s ability to capture spatial patterns and through explicit spatial attention between neighboring stations. We leverage a comprehensive set of environmental data sources, including in-situ measurements from the International Soil Moisture Network (ISMN), ERA5 climate reanalysis, USGS elevation products, MODIS land cover, and SoilGrids soil characteristics. Initial results from a one-year training period and three-month testing period (R² = 0.73, 0.72, 0.73 for 24h, 48h, and 72h predictions) reveal important insights about the hierarchical importance of different drivers across prediction windows. Our preliminary analysis shows that static physical properties (particularly slope and soil structure) and hydraulic characteristics maintain high importance across temporal scales, while the influence of dynamic weather features varies with prediction horizon. The model’s dual spatial attention mechanisms and temporal components enable discovery of both local and regional controls on soil moisture dynamics. The identified feature importance hierarchies provide initial insights into the spatiotemporal controls on soil moisture dynamics across CONUS. Ongoing work extends the training to the full temporal extent of available data to develop a more comprehensive understanding of these driving factors. This approach advances our fundamental understanding of soil moisture processes at continental scales, with implications for a future tool for land characterization and ecological site classification.
Modeling II
Session Date: Wednesday, November 20th
Session Time: 10:30am-10:50am
Session Location: Traditions
Session Moderator: Lory Willard
1. (10:30-10:50) - Deep neural networks autonomously learn and utilize critical environmental conditions to capture the impacts of climate change on wheat performance
Benke, Ryan, ARS
Linqian Han, Kimberly A. Garland-Campbell, Xianran Li
Environmental conditions play a critical role in shaping crop performance. Enviromic prediction (EP) aims to develop models that learn the hidden relationships between environmental conditions and crop traits, enabling forecasts of performance under new conditions. Here, we explored applications of deep learning for enviromic prediction. Deep neural networks (DNN) were trained on over 100,000 environmental features to predict wheat yield and other important traits recorded from 322 environments spanning two decades across 20 locations. The trained DNN models demonstrated a high level of prediction accuracy in both cross-validation and forecast schemes. To gain insights into the inner workings of these “black box” DNNs, we devised a bulk-feature sensitivity analysis and revealed that the DNNs autonomously learned and utilized key environmental conditions associated with phenotypic variation. Our findings highlight the potential of deep learning to develop interpretable artificial intelligence models that accurately predict crop performance, offering guidance towards optimizing agricultural practices under changing climate conditions.
2. (10:50-11:10) - Improvements to Deep Isolation Forests for Identifying Anomalous Records
Sartore, Luca, NASS
Yumiko Siegfried, Valbona Bejleri
The presence of outliers in a dataset can bias the results of statistical analyses. To correct for outliers in agriculture data collected through repeated surveys, micro edits are manually performed. A set of constraints and decision rules is used to simplify the editing process. However, agricultural data are characterized by complex relationships that make revision and vetting challenging. Also, outlier detection methods used in survey data to identify the records that need editing do not address the mixed (i.e., continuous or categorical) nature of variables. Isolation Forests (IF) have gradually increased in popularity as a distribution-free algorithm for screening high volumes of data with mixed-type variables. Although several variations have been proposed in the past decade, these improvements have been seldom tested at once. In this paper, deep random architectures are used within generalized isolation forests to perform nonlinear dimensionality reduction and outlier detection at the record level. Nested complex-value nonlinear transformations using activation functions are performed on random projections. The outputs of these nested processes are successively classified by improved generalized isolation trees and then combined using a scoring technique based on fuzzy logic. The performance of the proposed algorithm is tested on “raw” survey data for automatic early identification of anomalous records. Also, to assess the algorithm’s potential performance on a production environment, its outputs are compared to finalized “human-edited” data.
3. (11:10-11:30) - Integrating physics simulation and data-science approaches for monitoring airborne crop diseases
Ulmer, Lucas, ARS
Walter F. Mahaffee
Outbreaks of airborne crop diseases involve significant spatiotemporal uncertainty. Many crop diseases are di?icult to detect until it is too late, necessitating conservative fungicide spray programs. The ability to identify where and when disease is present or likely to appear can help minimize downsides associated with disease control programs, including fungicide resistance, financial cost, and human and environmental health risks. Combining field data with predictive models may help narrow this search. Aerial spore sensors (“spore traps”) sample large masses of air over crops. Atmospheric dispersion models can be used to invert trap data and gain information about the likely origins of intercepted spores, and thus the spatial extent of a current infection. This work investigates the feasibility of one class of inversion techniques known as source term estimation (STE) for identifying the origins of particles sampled during a plume release experiment in an Oregon vineyard. A probabilistic “risk map” of the artificial infection is constructed from the trap data using Bayesian inference. We explore the impact of sensor network density and dispersion model quality on the accuracy of this risk map, finding that more accurate models may help compensate for data scarcity. Future prospects for regional-scale disease monitoring networks will also be discussed. By integrating multiple data streams (e.g., trap data, manual scouting results, weather and rainfall data, and host phenology from remote sensing and grower-provided photos) with physics-based models for dispersion and host-pathogen interaction in a Kalman filter-like procedure, we may be able to provide growers with better estimates of the spatiotemporaldistribution of current and nascent outbreaks. Additionally, prospects for accelerating the dispersion simulations with regression-based ML models will be discussed.
4. (11:30-11:50) - Modeling Climate Change Impacts on Agriculture: Integrating CEAP, APEX, and Machine Learning for Adaptive Strategies
Osorio-Leyton, Javier M. Texas A&M University
Karen Maguire, Siwa Msagni
This research proposal integrates the Conservation Effects Assessment Program (CEAP) and the Agricultural Policy Environment eXtender (APEX) model to project climate scenarios and assess their impacts on agriculture. By using the APEX model in conjunction with the Regional Environment and Agriculture Programming (REAP) model at the Economic Research Service (ERS), the study aims to improve our understanding of how agricultural markets and systems respond to climate change, providing insights into how farmers can adapt practices and land use strategies. This work will strengthen the REAP model’s ability to address key research questions related to agricultural adaptation. Agricultural systems are complex, and the APEX model helps simulate these dynamics. However, its effectiveness is limited by the availability of spatially specific management data. To address this, the research introduces a machine learning-based methodology using the k-nearest neighbors (k-NN) algorithm. This algorithm imputes missing management practices by matching environmental variables such as topography, soil properties, and climate between the CEAP data and target sites across the continental U.S. The study uses CEAP data from two survey periods, combining data from 2003-2006 (CEAP-I) and 2012-2016 (CEAP-II). By calculating Euclidean distances between target and donor sites, the k-NN algorithm identifies the most suitable management practices, enabling more realistic simulations in the APEX model. This approach helps expand the geographic applicability of CEAP data and provides crucial inputs for climate adaptation research. Overall, the project enhances the REAP model’s capacity to explore questions about agricultural adaptation to climate change and environmental outcomes, offering valuable insights for the USDA and other stakeholders to support sustainable agricultural practices in a changing climate.
Multimodal Learning
Session Date: Wednesday, November 20th
Session Time: 2:30pm-3:50pm
Session Location: Corps
Session Moderator: Luca Sartore
1. (2:30-2:50) - All-in-One: Unifying Multimodal Generative and Discriminative AI in the 4D World
Wang, Yuxiong, University of Illinois
Generative AI has emerged as the new wave following discriminative AI, as exemplified by various powerful generative models including visual diffusion models and large language models (LLMs). While these models excel at generating images, text, and videos, mere creation is not the ultimate goal. A grand objective lies in understanding and making decisions in the world through the generation process. In this talk, I discuss our efforts towards unifying generative and discriminative learning, empowering autonomous agents to perceive, interact, and act in the 4D physical world. I begin by elaborating on how we advance generative modeling to be geometry-aware, physics-informed, and multimodal in the 4D world. Next, I delve into several representative strategies that exploit generative models to improve comprehension of the 4D world. These strategies include repurposing latent representations within generative models, treating them as data engines, and more broadly, formulating generative models, especially LLMs, as agents for problem-solving and decision-making. Finally, I explore how to synergize knowledge from different multimodal generative models in the context of modeling human-object interaction. Throughout the talk, I demonstrate the potential of multimodal generative AI in scaling up in-the-wild perception and decision-making across application domains such as agriculture, transportation, robotics.
2. (2:50-3:10) - AIIRA: AI Institute for Resilient Agriculture
Ganapathysubramanian, Baskar, Iowa State University
AI Institute for Resilient Agriculture: Case study of building models using large scale multimodal data.
3. (3:10-3:30) - Advances in Multimodal AI and its Applications to Livestock and Beyond
Ahuja, Narendra, University of Illinois
AIFARMS Team
We present an overview of some of our recent research related to people, animals and plants under AIFARMS Thrusts 2 and 3. The research is to build the following capabilities in various combinations:
1. 3D reconstruction of moving articulated objects, such as active humans (e.g., caretaker personnel) and farm animals (e.g., pigs and goats) from a single short video, while using text descriptions (e.g., captions) if available.
2. Hierarchical representation, and simultaneous recognition and segmentation of the activities in multi-activity videos, captured from an uncalibrated camera.
3. Joint audio-visual understanding of scenes, and video guided separation of sounds in audios containing mixed sounds from simultaneously active multiple sources.
4. Methods for explainable (white box), efficient (implementable on small devices at acceptable speeds), multilabel (multiple objects present, or activities happening, simultaneously), and metric learning (creating learning representations that reflect semantics), for problems in and out of training distribution (similar to problems previously seen as well as those not seen).
Protein Structure Prediction Applications
Session Date: Tuesday, November 19th
Session Time: 3:00pm-4:20pm
Session Location: Eagle
Session Moderator: Elly Poretsky
1. (3:00-3:20) - Exploiting AI/ML-based protein structure prediction tools for data-driven design of gluten digestive enzymes
Weigle, Austin, ARS
Chris P. Mattison, Gerard R. Lazo, Brenda Oppert
The status of AI-based molecular modeling tools has ushered a new era of computational molecular design. Synthetic data can now be generated, and annotated, at high-throughput to guide the experimental validation of valuable protein or molecular products. As a case study, we exploited multiple sequence alignment (MSA) subsampling in AlphaFold2 and RoseTTAFold2 to accurately predict diverse protein structure conformations. We applied this methodology for structure-based design of an enzymatic digestive aid to remediate gluten sensitivity. Currently, gluten sensitivity is primarily treated by a strict gluten-free diet. To this end, we selected the main digestive cathepsin L enzyme from Tribolium.castaneum (Tc) as our molecular design target. We employed a structure-based engineering strategy to shortlist mutations that can (1) improve Tc cathepsin pH-optimum activity in the acidity of the human stomach for probiotic use; and (2) improve Tc cathepsin thermostability for commercial applications. From our resulting conformational ensemble, we observed transitionary conformations relevant to cysteine protease structural biology. Given the residues associated with functional conformational change, we prepared Tc cathepsin variants for in.silico annotation with desired pH optimum and thermostability. Our multi-objective approach will motivate experimental validation of Tc cathepsin variants with fine-tuned function for the digestion of consumed and commercially-prepared gluten products.
2. (3:20-3:40) - Enhancing Resistance Gene Discovery With AlplaFold Multimer
Ingram, Thomas, ARS
Matthew N. Rouse, Matthew J. Moscou
Innate host resistance against wheat stem rust relies on recognition of the pathogens avirulence proteins (AVRs) by host resistance proteins (SR). Discovery of SR and AVR encoding genes is currently reliant on DNA/RNA based bioinformatics techniques to narrow down candidates for experimental validation. Gene discovery in plants and pathogens with large genomes is hampered by linkage disequilibrium, and often generates too many candidate genes to practically screen. AlphaFold Multimer provides in silico prediction of protein-protein binding. In wheat, secreted pathogen AVRs bind to the leucine-rich repeat (LRR) of nucleotide-binding site–leucine-rich repeat (NLR) proteins. Pathogen protein recognition is followed by a defense cascade, often leading to cell death. AlphaFold Multimer 2.3.1 was used to predict the interaction between known LRR host proteins against their known AVR counterpart, with their signal peptide removed, and against control proteins. In five out of six protein-protein combinations, AlphaFold Multimer correctly predicted the host-pathogen protein interaction. In some scenarios the LRR region alone provided the best interface predicted template modeling (iptm+ptm) scores, while in others the LRR and C-terminus provided the best prediction. The known AVR-SR-gene combinations were consistently in the top 5% highest iptm+ptm scores when screened against an arbitrary set of 50 SR proteins. Candidate unknown AVR and SR proteins were screened using AlphaFold Multimer for downstream functional assays.
3. (3:40-4:00) - Empowering plant breeding: generative AI in the fight against plant diseases
Edwards, Jeremy, ARS
Li Wang, Yulin Jia
Plant diseases significantly threaten global food security, and there is a pressing need for effective resistance strategies in crops. Central to plant immunity is the interaction between pathogen avirulence (AVR) proteins and plant resistance (R) proteins, often following a gene-for-gene model. Identifying these specific protein interactions is challenging due to the limitations of traditional experimental methods. This presentation will explore the application of artificial intelligence, particularly AlphaFold 2, to predict protein structures and interactions between plant R proteins and pathogen AVR proteins. We demonstrate how AI accelerates basic discovery and mapping of gene networks and facilitates the discovery of new disease resistance alleles. Additionally, we discuss the potential of generative AI methods to achieve wide-spectrum durable resistance through design and optimization of novel R genes for deployment via gene editing. Our findings highlight the transformative potential of AI in advancing plant breeding and developing robust disease resistance.
Remote Sensing Applications
Session Date: Tuesday, November 19th
Session Time: 1:00pm-2:20pm
Session Location: Oak
Session Moderator: TBD
1. (1:00-1:20) - Leveraging Cloud Computing to Forecast Rangeland Fuel Dynamics
Reeves, Matthew USFS
Robb Lankston
In this project we document our use of remote sensing and weather data to project fine fuel on a deployed AI platform for the Mojave Desert.
2. (1:20-1:40) - Using deep learning to map prairie dog colonies from remote sensing imagery
Kearney, Sean ARS
Lauren Porensky, David Augustine, David Pellatz, Erika Peirce, Mikael Hiestand, Justin Derner
The ability of deep learning models to detect individual objects from air- and space-borne remote sensing imagery provides unprecedented opportunities to produce new map products for use by land managers at fine spatial scales. Much work remains to understand what kinds of objects can be detected, how well models can transfer across locations and time periods, and what image specifications (e.g., resolution, spectral ranges, pre-processing, etc.) provide optimal results. We present methods, results, and lessons learned from studies conducted in two US Forest Service National Grasslands in which we trained deep convolutional neural networks (dCNNs) to detect individual black-tailed prairie dog (Cynomys ludovicianus) burrows using imagery from unoccupied aerial vehicles (UAVs; i.e., drones). The black-tailed prairie dog is both a keystone species of conservation concern and an agricultural pest. Thus, it is a wildlife species for which detailed monitoring is a high priority, especially in areas where public and private land ownership converges. Cost-effective ground-based mapping is difficult to achieve due to the remote and vast landscapes occupied by prairie dogs. We set up several studies to analyze (1) how burrow-detection accuracy changes depending on image resolution (from 2 – 30 cm ground sampling distance) and inputs (e.g., red-green-blue channels [RGB], Normalized Difference Vegetation Index [NDVI] derived from multispectral channels, and a photogrammetrically-derived topographic position index [TPI]) and (2) how well models can transfer across seasons, years and geographical regions with different vegetation plant communities. Specifically, we conducted UAV flights over multiple sites, seasons, and years totaling an area of more than 30,000 acres, and trained a dCNN in Python using the DeepLabV3+ architecture with a Resnet34 encoder initialized with pretrained weights from the imagenet dataset. Burrow detection accuracy remained stable up to 5-10 cm image resolution, but declined substantially at coarser (<10 cm) resolutions. The RGB and TPI together provided the best results; addition of NDVI did not improve results. Models transferred better across time than across regions, although adding more data nearly always improved models, regardless of whether additional data were added from a new region or time period. In addition to presenting the results from our studies, we also discuss some takeaways from our model training procedures and considerations for developing secondary map products from our dCNN object-detection output.
3. (1:40-2:00) - Remote sensing of microbial quality of irrigation water for food safety: machine learning applications to the UAV-based imaging
Pachepsky, Yakov, ARS
Seokmin Hong, Mathew Stocker, Billie Morgan, Jaclyn Smith, Moon Kim
The growth of the proportion of produce-associated sicknesses has been connected to the microbial quality of irrigation water. Escherichia coli is used as the microbial water quality (MWQ) indicator. The MWQ monitoring is complicated due to the high spatial and temporal variation of E. coli concentrations, requires substantial resources, and yet results remain uncertain. UAV-based imaging provides the dense spatiotemporal coverage of inland water. Essential metrics of E. coli aquatic habitats, such as chlorophyll concentration, turbidity, and dissolved organic matter content, can be retrieved using remote sensing imagery. That indicates the possibilities of characterizing spatial variation of E. coli habitats, discovering persistent spatial patterns, and designing more efficient monitoring. Our goal was to research the applicability of RGB (GoPro), multispectral (Micasense), and hyperspectral (Headwall) imaging to estimate spatial E. coli patterns. The study was performed on commercial irrigation ponds in MD and Georgia. Imaging of surface water sources can be complicated for various reasons, including spectral complexity, reflectance properties, calibration and validation needs, cloud cover, etc. The data are primarily imbalanced. In the work with irrigation ponds, we determined that the efficiency of E. coli estimation can be enhanced by applying the data pre- and postprocessing. Specifically, classifying large number of multispectral images by their quality with the successive projection algorithm and the random forest algorithm leads to the objective selection of imaging for further processing. In the project with multispectral imagery, reflectance performs much worse as the input for E. coli estimation with the random forest (RF) algorithm compared with situ water quality variables. Replacing the reflectance with remote sensing indices led to a drastic improvement in E. coli estimation accuracy. The quantile splitting used to define train and test datasets substantially improved E. coli retrieval results. Postprocessing of modeling results was required due to the dependence of residuals on the absolute value of E. coli concentration early proposed for nitrogen also improved E. coli modeling results with RGB data and several machine learning algorithms, including the Gradient Boosting Machine and Extreme Gradient Boosting (XGB). New challenges in the MWQ arena arise with the growing attention to antibiotic-resistant bacteria and cyanotoxins. Our pilot projects showed that the combination of UAV imaging and machine learning can effectively evaluate spatial distributions and find persistent spatial patterns of emerging MWQ pollutants.
4. (2:00-2:20) - Using Super-Resolution to Extend the Spectral, Spatial, and Temporal Ranges of UAS Imagery
Masrur, Arif, ESRI
Peder A. Olsen, Paul R. Adler, Matthew W. Myers, Nathan Sedghi, Ray R. Weil
Unmanned Aircraft Systems (UAS) and satellites provide important tools in precision management practices. However, while satellite imagery is too coarse for targeted applications, UAS also becomes impractical for large areas and/or frequent coverage. Furthermore, a broad spectral range for UAS is currently only available with expensive sensors. This study proposed and demonstrated novel methodological approaches with super-resolution convolutional neural network (SRCNN) and statistical projection methods to fuse targeted UAS with satellite imagery across spatial, temporal, and spectral domains, allowing the best characteristics of both to be extracted and merged into a cost-effective platform for precision management. The main objective was to compare the performances of the reconstructed high-resolution Sentinel-2 and spectrally extended UAS RGB (red, green, blue) imagery for an improved estimation of cover crop biomass and nitrogen (N), compared to the low-resolution Sentinel-2 and very high-resolution hyperspectral ground truth. Using an example case from winter cover cropping in Maryland, United States, our cross-validated random forest modeling results suggest that the UAS data does not have to be collected at all locations and time points that satellite data is available for, rather a farmer could fly a subset of their fields with an RGB camera and leverage this to extend 1) the spectral range and resolution of available satellite imageries and 2) coverage across their whole farm and 3) growing season at the frequency of satellite flights. This would allow them to extend spectral range of UAS-RGB over the critical red edge, near infrared, and short-wave infrared regions which improve biomass and N estimation (as suggested by an error reduction between ~14% and ~68%) and could also increase detection sensitivity to weeds, disease, and insect infestations. A spectral resolution of about 100nm is sufficient for characterizing both biomass and N, however, spectral range matters more for N than for biomass, whereas spatial resolution is very important for both. We have also shown that, without flying, but by using the proposed Spatial-SRCNN, the biomass and N predictions are better than predicting from actual UAS-RGB data. Thus, the farmer could stop flying UAS once a specialized spatial extension model has been trained from targeted UAS-RGB data. The robustness of our proposed strategies should be further investigated with other potential applications (e.g., detecting weeds, disease, and insect infestations) in precision farming. Future studies should also focus on further improving spectral, spatial, and temporal extension models using datasets with larger spatiotemporal coverage and various combinations of spectral bands and sensors.
Responsible Use of AI In USDA Research
Session Date: Tuesday, November 19th
Session Time: 3:00pm-4:20pm
Session Location: Corps
Session Moderator: Cynthia Parr
1. (3:00-3:20) - Responsible Use of AI for Researchers
Parr, Cynthia, ARS
Ricardo Millan
USDA has a robust artificial intelligence strategy that drives innovation through workforce empowerment and capacity-building, and that provides risk-based frameworks to ensure ethical and responsible use of AI. But USDA is not the only stakeholder in the arena, and a plethora of federal and industry policies and resources challenges researchers to understand what all of this means for them. When should they be contributing to an AI inventory? When and from whom do they need approval to proceed? The answers often depend on the nature of the work (scholarly research versus research towards operations; generative AI versus other kinds of AI) and which component agency funds the work. Ultimately, researchers want to know how they can spend more time doing exciting research, with appropriate safeguards, and less time on bureaucracy. In this talk, we review current USDA policies, analyze definitions, and highlight their impact on the research enterprise. We compare policies and practices from other federal research funders, and identify additional needs for guidance, shared tools, and processes for USDA-funded researchers. What are considerations for proposal or peer-review preparation? How can we avoid the risks of bias, privacy loss, cybersecurity threats, and unsustainability? We provide nuggets answering these questions, complementing deeper discussion in training workshops. Finally, we describe opportunities to engage on any of these topics with USDA staff by participating in the USDA Center of Excellence, and with the international community by participating in the Research Data Alliance.
2. (3:20-3:40) - Data Leakage and Dataset Shift: the twin gremlins of AI
Rivers, Adam, ARS
Papers that apply predictive machine learning to biological questions are increasingly filling journals. How do we evaluate the quality of this research and catch potential errors in the AI methods we apply? While there are many ways to flub an AI analysis, most published mistakes arise from data leakage and dataset shift. These errors falsely inflate performance metrics, so they are “naturally selected” in our current publishing ecosystem. Data leakage arises when information used to evaluate the model is shared with it during training. This issue can occur in subtle and unintuitive ways, even when data is split into testing and training data sets. The second issue arises when the data a model was trained on differs from the data it was applied to. This talk will explain the different types of errors associated with data leakage and dataset shift and provide examples and practical advice for identifying these issues in scientific literature and in your experimental designs.
3. (3:40-4:00) - Assessing Use of Generative AI in Converting Legacy Software to Python and R
Tarter, Alex C., NASS
Linda J. Young
The USDA’s National Agricultural Statistics Service (NASS) uses hundreds of programs in various computer languages with thousands of lines of code to draw samples, analyze the data collected through surveys, and produce official statistics for its reports. Various computer programming languages have been used over time, some of which are expensive, no longer supported, or otherwise not recommended for contemporary use, including SAS-AF, Visual FoxPro, and Perl. The modernization process of converting NASS’s legacy production code to freeware, such as Python and R, is challenging and time consuming. Our research focuses on assessing whether Generative Artificial Intelligence (GenAI) can be used effectively to assist in the timely and accurate conversion of scripts into Python and R. As an early proof of concept, we are exploring the proportional time savings in the conversion process of the Genesis software from SAS-AF into Python, in which study participants from NASS’s Research & Development Division (RDD) are assisted by the ARS GovChat tool. Perceived time savings and challenges associated with the learning curve of the built-in language assistant are also explored to help establish best practices for future NASS-wide GenAI usage. Analysis of the completion time metrics intends to assess the viability of using GovChat to convert NASS’s legacy systems.
4. (4:00-4:20) - AI-assisted rapid reviews and summaries of ARS’s scientific research
Stucky, Brian, ARS
Lorna Drennen, Troy Hamilton, Haitao Huang, Stan Kosecki, Simon Liu, Joon Park, Cyndy Parr, Heather Savoy, Pam Starke-Reed
As USDA’s principal scientific research agency, the Agricultural Research Service (ARS) generates an enormous volume of research outputs every year. At any given time, ARS scientists are working on hundreds of active research projects which result in peer-reviewed publications, news releases, annual research accomplishments reports, and collaboration agreements with external research partners. ARS’s scientific leadership is frequently called upon to provide review-style syntheses or briefings of ARS’s scientific research on particular topics or questions. Writing scientific reviews is typically an extremely laborious process. We are investigating the efficacy of combining semantically indexed, text-based research products with generative large language models to reduce the labor required to write research reviews and summaries. In this talk, we will present our implementation approach, source datasets, and progress to date. We will also discuss methods for assessing and ensuring the quality of the results generated by our system.
Robotics and Sensors
Session Date: Wednesday, November 20th
Session Time: 1:00pm-2:20pm
Session Location: Oak
Session Moderator: Raymond Ansotegui
1. (1:00 - 1:20) - Enhancing Precision Agriculture: The Symbiosis of Sensors and AI for Smarter, Sustainable Farming
Tabassum, Shawana, University of Texas at Tyler
Ling, Kai-Shu, USDA-ARS, U.S. Vegetable Laboratory, Charleston, SC 29414
The precision agriculture technology including image sensors, unmanned aerial vehicles, and crop and soil sensors, rely on surface-level information and lack real-time chemical profiling, which makes them inadequate for detecting the early onset of crop or soil health issues. Advanced sensors can significantly improve artificial intelligence (AI) models in agriculture by enhancing quality and accuracy of the data used for analysis. The Center for Smart Agriculture Technology (CeSAT) of University of Texas at Tyler is developing low-cost and field deployable sensors and integrated systems for monitoring crop biotic and abiotic stresses, water quality, and soil nutrient dynamics, which will enable AI on accurate data collection, multi-modal data integration, and real-time monitoring and feedback. In collaboration with ARS scientists at the USDA-ARS, U.S. Vegetable Laboratory at Charleston, SC, CeSAT is currently conducting research on mountable sensors to help develop an autonomous robotic crop production system for fruiting crops (tomato and strawberry) in controlled environment agriculture. Data from these sensors will be integrated into a machine learning model to develop an extensive database. The clean, accurate sensor data enhances AI model training by allowing the model to identify patterns more reliably. Well-trained models are less prone to overfitting or underfitting, thus improving generalization.
2. (1:20-1:40) - From Labor to Automation: AI Innovations in Horseradish Weed Management
Shajahan, Sunoj, University of Illinois
Abhinav Pagadala, Elizabeth Wahle, Dennis Bowman, John F Reid
Illinois has a rich history as a leading producer of horseradish, dating back 150 years. However, a persistent challenge in horseradish production is the lack of effective weed management solutions throughout the growing season. While AI-based commercial solutions for weed management, such as John Deere’s See & Spray, are emerging in commodity crop production, specialty crops like horseradish remain underserved. Horseradish requires season-long weed control, as it is planted from April to May and harvested from October through April of the following year. The application of post-emergent herbicides is restricted. Currently, weed control relies heavily on manual labor and tool implements that can only pull weeds taller than the crop. Compounding these challenges, increasing herbicide resistance among weeds, such as Waterhemp and Palmer Amaranth, further complicates management efforts. This project aims to develop AI vision-based weed control solutions explicitly tailored for horseradish producers, addressing the weed challenges they face. We are developing a mechanical weeding robot based on Farm-ng’s Amiga platform, integrating advanced technologies for improved weed management. Our approach comprises three stages: lane detection and autonomous navigation, weed detection through AI-based object detection, and an actuator mechanism for removing weeds between rows. Currently, the platform is in the development phase. We aim to have a fully functional weeding robot ready by next summer. We plan to perform frequent data collection in horseradish farms next year. These datasets will be valuable in developing a reliable AI model for object detection with improved inference times and the mechanical weeder’s response capabilities. By integrating advanced AI solutions, we hope to offer horseradish growers an effective alternative for weed management, ultimately improving productivity and sustainability in the industry.
3. (1:40-2:00) - Leveraging Anomaly Detection and Timeseries Data in Expert-In-The-Loop Self-Supervised Learning for Powdery Mildew Classification in Microscopy Imaging
Cadle-Davidson, Lance, ARS
Anna Underhill, Lance Cadle-Davidson, Yu Jiang
Blackbird is a microscopic imaging robot that significantly enhances the throughput of acquiring and analyzing microscopic images, originally developed for selecting disease-resistant grape cultivars. While this system has garnered interest across various crops and diseases, its reliance on supervised learning requires numerous labeled data instances for model training. Self-supervised learning (SSL) can leverage the available data but is dependent on the quality of the training dataset to learn the relevant features for a given task. Previous methods leverage techniques such as similarity to generate better datasets for the task. In this work, we leverage the spatiotemporal changes of disease infection to generate an optimized dataset for learning binary classifiers to enable genetic insights into disease resistance mechanisms. We aim to develop an expert-in-the-loop self-supervised learning-based pipeline for binary classification of powdery mildew in microscopy imaging. Specifically, we evaluate anomaly detection (AD) to improve dataset selection for SSL and evaluate different input modalities to the SSL pipeline. Leaf disk images were collected using a leaf disk assay with Blackbird, capturing time-series images from 0 to 9 days after inoculation (DAIs) with a time interval of 1 day. Anomaly detection is trained in the first day of imaging when the leaves are healthy, and the reconstruction error for subsequent days is used to prepare the dataset for SSL. After SSL, an expert selects the best clusters in the projected dataset for the extraction of pseudo-labels. The best SSL model can achieve similar performance to zero-shot inference from a fully supervised model. Pseudo-labels are then used for supervised training and further evaluation. We show that AD provides better embedding for the extraction of pseudo-labels that lead to a higher F1-score. Further, we demonstrate that when the SSL pipeline leverages the time series data for improved embedding generation, it generates a projection that provides superior pseudo-labels and encodes different disease progressions. Furthermore, the developed pipeline enables genetic analysis of PM infection dynamics, providing valuable insights into plant disease response mechanisms. This innovative approach holds promise for advancing plant pathology research and breeding programs.
Soil Science Applications
Session Date: Tuesday, November 19th
Session Time: 3:00pm-4:20pm
Session Location: Ross
Session Moderator: Curtis Ransom
1. (3:00-3:20) - Soil Health Classification Framework for Florida Soils using K- Means Clustering
Chatterjee, Amitava, ARS
Yaslin Gonjalez, and Gabriel Maltais-Landry
Soil health assessments aim to quantify soil health indicators linked to specific ecosystem services such as crop production, nutrient dynamics, and water storage. The main challenge to interpret soil health indicators is distinguishing between properties corresponding to management and inherent soil properties like topography. The goal of grouping similar soils is to increase likelihood of detecting changes in soil health indicators linked to management by accounting for variation arising from inherent soil properties. This study explores utility of cluster analysis to group similar soils and aid in soil health classification for the state of Florida. Soil Property data from The Soil Survey Geographic Database (SSURGO) was used for the k-means cluster analysis. Components (map units) that covered 15% of the map unit were selected and the depth weighted and component weighted averages from 0-10 cm for selected soil properties were calculated. The Calinski-Harbasz index (variance ratio criterion) was used for comparing clustering solutions by identifying the values of K at the index is at absolute maximum. Clustering into six conceptual groups based pm top five soil properties was determined to be the optimal algorithm output. The clusters could differentiate between zones of topsoil variation at the field scale. This approach could be easily adopted at other locations and scales to produce conceptual soil groups and associated maps to support soil health sampling and interpretation at the field scale.
2. (3:20-3:40) - Using a random forest regression model for determining plant available nutrient concentrations from portable x-ray fluorescence and Mehelich-III measurements
Blackstock, Joshua M., ARS
T. G. Brewer; M. Mancini, D. M. Miller; M. Berry; H. E. Winzeler; Z. Libohova; P. R. Owens
Plant available soil nutrient concentrations (PASNC) determined by the Mehlich-III (MIII) method is time and resource intensive. Application of machine learning for developing pedotransfer functions (PDF) that use paired portable x-ray fluorescence (pXRF) and MIII measurements building predictive PASNC models has been successfully demonstrated in tropical and temperate soil studies, but the accuracy of estimations varies depending on pedogenic characteristics, requiring further testing. In this study, we developed a random forest model (RF) to predict PASNC of Ca, K, Mg, P, S, and Zn from the total elemental concentrations of soils measured by pXRF. Soil samples were collected in the central U.S. state of Missouri at the Cornett Research Farm. Coefficient of determination (R2) values using RF for Ca, K, Mg, P, S, and Zn were 0.76, 0.19, 0.71, 0.008, 0.79, and 0.29, respectively. Root mean square error percent error (RMSEPE) values for Ca, K, Mg, P, S, and Zn were 9.9, 36.5, 22.9, 41.4, 17.2, and 35.6%, respectively. Model performance for Ca, Mg, and S was better compared with K, P, and Zn predictions. The most common important pXRF parameters were transition metals, silica, and aluminosilicate and we inferred this relation to be driven by parent material mineralogy and differential weathering across the study area. Preliminary findings from this study are encouraging and suggest that pXRF-MIII PDF could be used in predicting PASNC in temperate soils. Increasing the number of samples and the spatial extent of temperate soils sampled would yield more accurate predictions. Increased accuracy will allow more rapid soil fertility and soil health assessments at finer spatial resolutions in temperate agricultural regions.
3. (3:40-4:00) - Machine Learning Algorithms for complex distributed soil hydrology models and digital soil mapping
Libohova, Zamir, ARS
Jiaxiao Wei, Jainy Bhatt, Joshua Blackstock, Edwin H. Winzeler, Marcelo Mancini, Jianzhong Su, Phillip R. Owens, Kabindra Adhikari, Amanda Ashworth
Machine-learning and geostatistics are widely used tools in digital soil mapping (DSM) to predict soils and their properties at higher resolution suitable for precision agriculture and other applications. Soil water movement, as one of the major drivers of soil development is represented by terrain features derived from high resolution digital elevation models (DEM) like LiDAR and remote sensing (RS). DSM reliance on DEM and RS data as well as quality and density of point field observations often leads to high resolution but static 2-dimensional (2D) maps of soils and properties. Distributed hydrological models are processed based and physically driven models that capture soil water movement over and through the soil at fine spatial and temporal resolutions. Using such models provides an opportunity to add two other dimensions(depth and time) to mapping soils and their properties leading to a four-dimensional (4D) digital soil mapping (4DSM) approach, but utilizing outputs from hydrological model simulations for DSM is not well understood owing to model output complexity. Use of a machine learning algorithm for time series classification of model cells, i.e. pixels, based on spatiotemporal variation among model cells could elucidate patterns not recognized through simple visual or single pixel time series analysis. Distributed Hydrology Soil Vegetation Model (DHSVM) was applied to a pasture watershed to simulate soil moisture (SM) distribution with depth and water table (WTD) at a daily time step. The simulation generated moisture time series for three depths and multiple pixels across the watershed. Temporal patterns simulated by DHSVM matched measurements given by moisture sensors and wells installed throughout the watershed. The multidimensional data set of SM and WTD were analyzed using Dynamic Time Warping (DTW) machine learning algorithm, which identified similar time series at varying timescales for each depth and clustered them annually and seasonally. The distinct clusters among seasons and with depth showed that spatiotemporal soil variability is lost when statically assessing soils. This approach provides a step forward towards the transformation of US Soil Survey Geographic Database (SSURGO) from 2-D to 4-D. However, the distributed hydrological models are data demanding and comprise of multiple feedback and links for representing the complexity of physical laws and processes governing the soil water movement. And their potentials can be fully realized only when the impact of hundreds of parameters (individually or in combination) on the prediction power and accuracy can be assessed. Such sensitivity analysis can be substantiated using machine learning to improve the predictive ability concerning hydrological dynamics and soil variability. By employing refined hyperparameters, a model-independent neural network can be established for digital soil mapping as demonstrated from some of the preliminary results presented here.
4. (4:00-4:20) - Leveraging Random Forest Models to Optimize Phosphorus Fertilization in Rainfed Corn: Insights from Texas Blackland Prairie Fields
Stevens, Bo Maxwell, ARS
Kabindra Adhikari, Peter J.A. Kleinman, Joshua M. McGrath, Kyle R. Mankin, Douglas R. Smith
Soil phosphorus (P) management is crucial for both crop yield optimization and environmental sustainability. In this study, we leveraged machine learning techniques, specifically random forest models, to analyze soil test data and optimize P fertilization strategies for rainfed corn (Zea mays L.) grown in Central Texas. The focus was on comparing predictions from two commonly used soil tests: Mehlich-3 and Haney (H3A) in five fields with varying soil P levels. Traditional recommendations often yield conflicting results, presenting challenges for precision agriculture. Random forest models were applied to four years of high-resolution yield data (2018-2022), including spatiotemporal soil characteristics, environmental variables, and soil test results. Our analysis demonstrated that random forest models, when incorporating multiple soil test metrics alongside environmental data, significantly outperformed traditional methods for predicting yield responses to P application. The Mehlich-3 test provided more consistent predictions of yield response, accurately identifying P needs 85% of the time, compared to 65% for H3A. A key insight from our analysis is the potential to reduce P fertilization rates by up to 50% without compromising yield in low-P fields. This approach not only optimizes yield but also minimizes environmental impacts associated with over-application of fertilizers.
Sustainability
Session Date: Wednesday, November 20th
Session Time: 10:30am-11:50am
Session Location: Hullabaloo
Session Moderator: Menberu Meles
1. (10:30-10:50) - Integrating domain knowledge and artificial intelligence for sustainable agriculture
Peng, Bin, Assistant Professor, Crop Sciences, University of Illinois
Kaiyu Guan, Licheng Liu, Zhenong Jin
This presentation will review different applications of AI in food safety including food safety risk prediction and monitoring as well as food safety optimization throughout the supply chain. However, AI technologies in food safety lag behind in commercial development because of obstacles such as limited data sharing and limited collaborative research and development efforts. Future actions should be directed toward applying data privacy protection methods, improving data standardization, and developing a collaborative ecosystem to drive innovations in AI applications to food safety. While the causes of limited use of AI in food safety are multi-faceted, the lack of publicly available and well curated data sets that can be used by different groups to develop and validate AI data sets represents one likely barrier and contributor to this issue that has been addressed in other fields, including through the distribution of publicly available datasets that can be used to develop and train AI tools. To address this issue we have developed an initial dedicated Food Safety Machine learning Repository, which will also be discussed as part of this presentation.
2. (10:50-11:10) - Prediction of turfgrass quality using multispectral UAV imagery and Ordinal Forests: Validation using a fuzzy approach
Hernandez, Alexander, Research Computational Biologist, ARS
Shaun Bushman, Paul Johnson, Matthew Robbins and Kaden Patten
Protocols to evaluate turfgrass quality rely on visual ratings that depending on the rater’s expertise can be subjective, and susceptible to positive and negative drifts. We developed seasonal (spring, summer and fall) as well as inter-seasonal machine learning predictive models of turfgrass quality using multispectral and thermal imagery collected using unmanned aerial vehicles for two years as a proof-of-concept. We chose ordinal regression to develop the models in-stead of conventional classification to account for the ranked nature of the turfgrass quality assessments. We implemented a fuzzy correction of the resulting confusion matrices to ameliorate the probable drift of the field-based visual ratings. The best seasonal predictions were rendered by the fall (multiclass AUC: 0.774, original kappa 0.139, corrected kappa: 0.707) model. However, the best overall predictions were obtained when observation across seasons and years were used for model fitting (multi-class AUC: 0.872, original kappa 0.365, corrected kappa: 0.872), clearly highlighting the need to integrate inter-seasonal variability to enhance models’ accuracies. Vegetation indices such as the NDVI, GNDVI, RVI, CGI and the thermal band can render as much information as a full array of predictors. Our protocol for modeling turfgrass quality can be fol-lowed to develop a library of predictive models that can be used in different settings where turfgrass quality ratings are needed.
3. (11:10-11:30) - Development and implementation of AI tools for farm-to-table food safety
Qian, Luke, Cornell University
Wiedmann, Martin, Cornell University
Renata Ivanek, Jayadev Achary, Qing Zhao
This presentation will review different applications of AI in food safety including food safety risk prediction and monitoring as well as food safety optimization throughout the supply chain. However, AI technologies in food safety lag behind in commercial development because of obstacles such as limited data sharing and limited collaborative research and development efforts. Future actions should be directed toward applying data privacy protection methods, improving data standardization, and developing a collaborative ecosystem to drive innovations in AI applications to food safety. While the causes of limited use of AI in food safety are multi-faceted, the lack of publicly available and well curated data sets that can be used by different groups to develop and validate AI data sets represents one likely barrier and contributor to this issue that has been addressed in other fields, including through the distribution of publicly available datasets that can be used to develop and train AI tools. To address this issue we have developed an initial dedicated Food Safety Machine learning Repository, which will also be discussed as part of this presentation.
4. (11:30-11:50) - Machine learning applications in hydrology, with application to drought frequency, water quality, and flood forecasts
Livsey, Daniel, ARS
Examples of machine learning applications in hydrology will be discussed. Case examples will include prediction of past drought conditions in the Southwestern United States, water quality impacts in the Great Barrier Reef, and lake level forecasting in the Central United States. Following case examples, a discussion on how to: a) reduce monitoring costs using machine learning algorithms and b) advance physical processes understanding from empirical approaches will be provided.
Systems-Level Applications
Session Date: Tuesday, November 19th
Session Time: 1:00pm-2:20pm
Session Location: Corps
Session Moderator: Jacqueline Campbell
1. (1:00 - 1:20) - Examples for AI Use and Emerging Trends in Systems Dynamics Modeling
Papanicolaou, Thanos, ARS
Keshav Basneta, Peter L. O’Brien , Ken M. Wacha , Robert W. Malone , David W. Archer
Agroecosystems comprise environmental, economic, and social components with complex interactions that affect systemwide performance. Attempts to describe or predict how agroecosystems respond to management must account for these interconnected components, so approaches that are limited to a single discipline cannot capture the complexities necessary for a holistic understanding of performance. The goal of this research is to develop a system dynamics (SD) modeling framework that can provide quantitative measures of consequences of management on each component of an agroecosystem. A SD framework is proposed with a description of model components, as well as an illustration of methodological steps to evaluate model performance through calibration, validation, and sensitivity testing. The structure of the model relies on a complex web of (i) stocks that describe the system status, (ii) flows that represent the directionality and rates of change, and (iii) auxiliary parameters that provide quantitative values to each component. The capacity of the model to adequately evaluate agroecosystem response is demonstrated using a case study investigating environmental, economic, and social indicators while manipulating multiple management practices, including cover crops, tillage, and integration of crop and livestock operations. Importantly, the SD model identified tradeoffs in the three indicators that accurately reflects producer experiences when making management decisions.
2. (1:20-1:40) - Multi-Task Learning for Low-Data Problems: Case Study in Cold-Hardiness Prediction
Fern, Alan, Oregon State University
Aseem Saxena, Kristen Goebel, Paola Pesantez-Cabrera, Rohan Ballapragada, Kin-Ho Lam, Markus Keller
There are many prediction problems in agriculture where large datasets are not available. One approach to addressing this issue is to combine multiple small datasets for related problems via multi-task and transfer learning. This talk will describe two applications of these approaches to the problem of cold-hardiness prediction for grapes and cherries, which is critical for frost-mitigation decision making. The results demonstrate the utility of combining multiple small datasets and have produced models that are currently available to crop managers on AgWeatherNet.
3. (1:40-2:00) - Machine Learning Algorithms for Digital Soil Mapping of Soil Properties
Winzeler, H. Edwin, University of Texas at Arlington
Marcelo Mancini, Joshua M. Blackstock, Jianzhong Su, Phillip R. Owens, Amanda Ashworth, Zamir Libohova
Accounting for spatial heterogeneity of soil properties relevant to soil fertility and crop management is a daunting task. Point samples must be extracted at appropriate density to account for changes over geographic space. The extent to which soils vary over space for a site is often not fully known, leading to problems of over-sampling and/or under-sampling to produce continuous prediction surfaces or maps. Over-sampling occurs when an excess of soil samples is extracted, leading to expensive and time-consuming analysis for many samples that are sometimes similar in value and geographic space to neighboring samples that they contribute little to understanding the variability. Under-sampling occurs when the number of samples is insufficient to account for spatial patterns of geographic heterogeneity, leading to holes or gaps in the understanding of a site’s variability. In this study we intentionally developed an over-sampling approach to characterizing the variability of soil properties at the Stoneville ARS Long-term agricultural research (LTAR) site in the Lower Mississippi River Basin (LMRB) network. We extracted and analyzed 2,145 soil samples for a 250-ha research site and analyzed soil properties including soil organic matter (SOM) content. We then applied a random forest machine learning on the full dataset (RF), and on subsequent subsets that were randomly generated by iteratively and progressively omitting data points and assessing model performance. The objective was to find the sampling density whereby over-sampling grades into under-sampling. The machine learning models were trained using the point samples and reflectance data from the ESA Sentinel-2 program. To assess the usefulness of the machine learning models we also applied traditional geostatistical techniques of ordinary spherical kriging (OSK), Empirical Bayesian Regression Kriging (EBK), and inverse distance weighting (IDW). We also assessed their performance with subsets of the original data to determine the inflection point or region at which over-sampling graded into to under-sampling. The degradation of the performance of the RF models as the sample size decreased (progressively smaller datasets) was compared to the same degradation of performance for models made without machine learning, IDW, EBK, and OSK. This use-case study shows that models of soil properties made with machine learning informed by spatial statistics of satellite reflectance are much more robust than models made without machine learning, and therefore require a smaller sampling density to achieve accurate results. Additionally, the predicted spatial patterns of SOM reflected signatures of well understood soil-landscape processes indicating that incorporation of prior knowledge to machine leaning can further contribute to optimization of sample size.
4. (2:00-2:20) - Modeling the dynamic fire management environment with FireCon, a daily fire control suitability surface.
O’Connor, Christopher, USFS
Rahul Wadhwani, Eduardo Rodriguez, Matthew Whitley, Zack Holden
Successful wildfire containment is influenced by a range of physiographic and risk factors that can be know in advance, as well as dynamic factors that evolve throughout the incident. In the Western United States, a widely adopted approach utilizes the Potential Control Location suitability (PCL) model, an annually produced wildfire control probability surface that incorporates terrain, vegetation, and fire behavior under 90th percentile fire weather conditions. Although this model demonstrates significant effectiveness, it does not account for changing weather, fuel, and fire behavior conditions. This presentation introduces a framework that integrates daily weather, soil moisture, vapor pressure deficit, and potential evapotranspiration into the existing PCL model. We use daily hold and breach locations from fires sampled from the years 2020 to 2022 to train a neural network to adjust PCL maps up or down depending on current inputs. ROC statistics for the independent test set demonstrated improved differentiation of control and breach locations by up to 10% (ROC 0.47 – 0.59 ) over static PCL scores (ROC 0.49 ). We present a case study applied to the 2023 Six Rivers Complex fire in, California, demonstrating its potential to enhance situational awareness and inform fire management strategies through real-time data integration.
Lightning Talks + Poster Sessions
Lightning Session I
Session Date: Tuesday, November 19th
Session Time: 1:00pm-2:20pm
Session Location: Hullabaloo
Session Moderator: Raymond Ansotegui
1. (1:00-1:05) - Machine learning assisted functional protein identification focused on binding Per-and poly-fluoroalkyl substances(PFAS)
Maul, Jude E., ARS
Clifton K. Fagerquist
Using a combination of open source AI protein folding models, next-generation space filled molecular models and machine learning assisted kinetic simulations we will create libraries of predicted protein binding sites for all PFAS compound from real and/or theoretical proteins.
2. (1:05-1:10) - PhosBoost: Improved phosphorylation prediction recall using gradient boosting and protein language models
Poretsky, Elly, ARS
Carson M. Andorf, Taner Z. Sen
Protein phosphorylation is a dynamic and reversible post-translational modification that regulates a variety of essential biological processes. The regulatory role of phosphorylation in cellular signaling pathways, protein–protein interactions, and enzymatic activities has motivated extensive research efforts to understand its functional implications. Experimental protein phosphorylation data in plants remains limited to a few species, necessitating a scalable and accurate prediction method. Here, we present PhosBoost, a machine-learning approach that leverages protein language models and gradient-boosting trees to predict protein phosphorylation from experimentally derived data. Trained on data obtained from a comprehensive plant phosphorylation database, qPTMplants, we compared the performance of PhosBoost to existing protein phosphorylation prediction methods, PhosphoLingo and DeepPhos. For serine and threonine prediction, PhosBoost achieved higher recall than PhosphoLingo and DeepPhos (.78, .56, and .14, respectively) while maintaining a competitive area under the precision-recall curve (.54, .56, and .42, respectively). PhosphoLingo and DeepPhos failed to predict any tyrosine phosphorylation sites, while PhosBoost achieved a recall score of .6. Despite the precision-recall tradeoff, PhosBoost offers improved performance when recall is prioritized while consistently providing more confident probability scores. A sequence-based pairwise alignment step improved prediction results for all classifiers by effectively increasing the number of inferred positive phosphosites. We provide evidence to show that PhosBoost models are transferable across species and scalable for genome-wide protein phosphorylation predictions. PhosBoost is freely and publicly available on GitHub.
3. (1:10-1:15) - Generalizing applications of the Blackbird microscopy robot
Cadle-Davidson, Lance, ARS
Rafael Bidese, Yu Jiang
For many foliar pathogens, uncontrolled variables of field phenotyping prevent detection of genetic loci of minor and moderate effect sizes. In the SCRI-funded VitisGen grape breeding project, we developed controlled laboratory automated microscopy phenotyping pipelines to quantify accurately powdery mildew and downy mildew disease severity. Each microscopy robot (which we call Blackbird) images about 100 to 200 discs per hour depending on sample topology. While patch-based convolutional neural network analysis of hyphae or spores outperformed experts at microscopes for QTL detection, saliency mapping even further improved the precision and confidence in disease quantification. In some cases, repeated measures over a timecourse revealed resistance loci that were undetected at single timepoints. Blackbird is now commercially available, and our collaborators have drastically expanded the diversity of applications. To help biologists make progress on many new use cases, we are developing unsupervised machine learning strategies to guide model training.
4. (1:15-1:20) - Exploring putative enteric methanogenesis inhibitors using molecular simulations and a graph neural network
Chowdhury, Ratul, Iowa State University
Randy Aryee, Noor S. Mohammed, Supantha Dey, Arunraj B., Swathi Nadendla, Karuna Anna Sajeevan, Matthew R. Beck, A. Nathan Frazier, Jacek A. Koziel, Thomas J. Mansell
Atmospheric methane (CH4) acts as a key contributor to global warming. As CH4 is a short-lived climate forcer (12 years atmospheric lifespan), its mitigation represents the most promising means to address climate change in the short term. Enteric CH4 (the biosynthesized CH4 from the rumen of ruminants) represents 5.1% of total global greenhouse gas (GHG) emissions, 23% of emissions from agriculture, and 27.2% of global CH4 emissions. Therefore, it is imperative to investigate methanogenesis inhibitors and their underlying modes of action. We hereby elucidate the detailed biophysical and thermodynamic interplay between anti-methanogenic molecules and cofactor F430 of methyl coenzyme M reductase and interpret the stoichiometric ratios and binding affinities of sixteen inhibitor molecules. We leverage this as prior in a graph neural network to first functionally cluster these sixteen known inhibitors among ~54,000 bovine metabolites. We subsequently demonstrate a protocol to identify precursors to and putative inhibitors for methanogenesis, based on Tanimoto chemical similarity and membrane permeability predictions. This work lays the foundation for computational and de novo design of inhibitor molecules that retain/ reject one or more biochemical properties of known inhibitors discussed in this study.
5. (1:20-1:25) - Using AI, machine learning, and ‘omics’ methodologies to explore the ruminant microbiome, enteric methane emissions and mitigation interventions
Frazier, Anthony Nathan, ARS
Ratul Chowdury, Matthew R. Beck, Aeriel D. Belk, Randy Aryee, Noor S. Mohammed, Supantha Dey, Arunraj B, Swathi Nadendla, Karuna A. Sajeevan, Thomas Mansell, Jacek A. Koziel
Agricultural food systems and livestock operations account for 30-40% of anthropogenic greenhouse gas (GHG) emissions. Enteric fermentation represents 46% of GHG emissions in the form of methane (CH4), primarily in ruminant livestock operations. Due to an increase in global population, there is a high probability of an increase in ruminant livestock to meet the necessary food demands. Given this, many research efforts have been made to combat enteric CH4 emissions. One area of recent interest is the ruminant microbiome. The rumen microbiota has a unique purpose in serving its host with the required nutrients and energy needed for performance and growth. However, these interactions also result in CH4 production and emissions. While many efforts have shown promise in reducing enteric CH4, there remains a need for integrating new methodologies into CH4 mitigation. Leveraging the strides of artificial intelligence (AI) and computational methods is a potential way to design novel anti-methanogenic compounds that empirical data have yet to elucidate. In our preliminary work, machine learning (ML) frameworks reasoning over bovine metabolites associated with the rumen microbiome has been shown to extract important molecular features for the intended functionality. Past research has indicated the ability of complex calculations via AI, and back-of-the-envelope Fermi calculations to enhance our understanding of the microbiome’s relationship to animal production and allowing microbial biomarkers to be examined for CH4 emissions. Therefore, current collaborative efforts are exploring AI, ML and ‘omics’ measurements to reduce CH4 emissions in ruminant livestock. These efforts are geared to increase our abilities to both predict microbial interactions resulting in CH4 emissions, as well as explore the biochemical signatures of putative methanogenesis inhibitors.
6. (1:25-1:30) - Image-Based Honey Bee Larvae Viral and Bacterial Diagnosis Using Machine Learning
Copeland, Duan, ARS
Brendon M. Mott, Oliver L. Kortenkamp, Robert J. Erickson, and Kirk E. Anderson
Honey bees are essential pollinators of ecosystems and agriculture worldwide. It is estimated that 50-80% of crops are pollinated by honey bees, generating a market valuation of approximately $20 billion in the U.S. alone. However, commercial beekeepers often face an uphill battle, losing around 50% of their hives annually, and must effectively manage disease and parasites to remain economically viable. Colony losses are described as multifactorial, involving combinations of environmental factors and various disease agents including bacteria, virus, and fungi. Strongly associated with colony decline, these disease agents largely target honey bee larvae and are commonly known as brood diseases. Antibiotics are used to combat brood disease and are effective for the treatment of bacterial pathogens European Foulbrood (EFB), and American Foulbrood (AFB) but both have evolved antibiotic resistance. Although efforts are in place to control and verify the use of antibiotics on honey bees, many undiagnosed brood diseases with a superficial resemblance to EFB (EFB-like disease caused by virus) are often misdiagnosed due to similar symptomology. Under these circumstances, commercial beekeepers often prophylactically treat entire apiaries with antibiotics based on the field diagnosis of one or two weak colonies. This action results in dysbiosis of the native gut microbiome, and over the long term, continues selection for antibiotic resistance. Thus, the correct field diagnosis of brood disease is challenging and requires years of experience to identify and differentiate various disease states according to subtle differences in larval symptomology. To explore the feasibility of an AI diagnosis tool, we collaborated with apiary inspectors and researchers from around the country to survey brood disease in their local apiaries. We photographed and sampled diseased larvae identified in the field as EFB or virus. Using next generation sequencing of the larval microbiome and molecular viral screening, we created a dataset of imaged larvae with correct diagnoses to generate a machine learning/AI algorithm. Our approach leveraged transfer learning techniques, utilizing deep convolutional neural networks pre-trained on large-scale datasets. These networks, originally designed for general image classification tasks, were fine-tuned to discriminate between EFB and viral infections in unclassified diseased honey bee larval images. This proof-of-concept study highlights the potential of AI-driven diagnostics in apiculture, offering a tool that could significantly improve the accuracy and speed of brood disease diagnosis in the field. By enabling more precise diagnoses, this technology could lead to more targeted treatments, reduce unnecessary antibiotic use, and slow the development of antibiotic resistance in bee pathogens. Furthermore, this approach opens possibilities for continuous learning and improvement as more data is collected, potentially leading to even more accurate and comprehensive diagnostic capabilities in the future.
7. (1:30-1:35) - Bayesian Learning for Predicting Causal Relationships Among Rice Traits Under Varying Growing Conditions
Richardson, Jared, University of Texas at Arlington
Dr. Shannon Pinson, Dr. Jeremy Edwards, Dr. Jianzhong Su
This collaborative study leverages data-driven agriculture and computational tools to predict causal relationships among rice traits under various growing conditions. Utilizing a combination of multi-omic datasets, we developed unique mixed models for the calculations of Best Linear Unbiased Predictors (BLUPs) and employed Bayesian Networks for the construction of Directed Acyclic Graphs (DAGs). By integrating these datasets with additional single nucleotide polymorphism (SNP) data, we created probabilistic graphical models that illustrate relationships between traits and genetic SNP markers. These findings potentially provide valuable insights for plant pathology and agronomy, with implications for optimizing growing conditions and enhancing our understanding of genomic-trait interactions.
8. (1:35-1:40) - Leveraging AI and High-Performance Computation for Structural Prediction and Dynamics Analysis of Foodborne Bacterial Colicin-Immunity Protein Complexes in Food Production and Safety
Koirala, Mahesh, ARS
Clifton K. Fagerquist
Artificial Intelligence (AI) is revolutionizing our understanding of foodborne bacteria and playing a greater role in enhancing food production and safety. Our work emphasizes the importance and use of AI-powered tools such as AlphaFold2 to predict the 3D structures of bacterial protein complexes, specifically colicin D, E3 and E8 and their immunity protein cognates. Colicins are bacterial toxins produced by pathogenic Escherichia coli that target and kill competing bacteria, while immunity proteins protect the host bacteria from their own colicins. These protein complexes are crucial to microbial competition, survival and regulation within bacterial communities, directly affecting the safety of food production environments. To explore the dynamics of interactions of these protein complexes, molecular dynamics (MD) simulations were performed using GROMACS, using the computational power of SCINet’s Ceres and Atlas systems with GPU nodes. These MD simulations provided detailed insight into the structural stability and binding behavior of colicin-immunity complexes over time. The ability to model such dynamics allows for better understanding how bacteria regulate intra- and inter-species competition, which is essential for controlling microbial populations in food-related environments. By integrating AI and high-performance computing, this research provides valuable knowledge on how colicin-immunity complexes function, offering potential strategies for managing the microbial population in food safety. This will contribute to the advancement of food safety protocols, enhancement of microbial control and promote better practices in food safety, ensuring positive benefits for public health.
9. (1:40-1:45) - Developing AI-Enhanced Statistical Process Control Tools for Controlling Salmonella In Poultry
Stasiewicz, Matthew, University of Illinois
Cecil Barnett-Neefs, Minho Kim, Erin Kealey, Brad Yang, Renato Orsi, Cristina Resendiz Moctezuma, Martin Wiedmann
Formal statistical process control methods (SPC), like control charts, have not been widely applied to food safety. It may be possible to use SPC for controlling Salmonella in raw poultry because industry best practices for monitoring and preventative control include detection of this hazard directly in the product. This projected adapted statistical process control charts to Salmonella testing and total plate counts (TPC, indicators of quality and sanitary dress) in commercial poultry processing data. Salmonella and TPC data provided by one large processor, for one facility, over about one year, was cleaned, adapted to account for assay limits of detection and quantification ranges, and then subset by processing stage. Salmonella and TPC data were plotted on run charts and evaluated for control chart metrics like mean (centerline), deviation (control limits) and the 8 Nelson’s rules to identify special causes of variation. To identify additional metrics for SPC, classification trees were used to analyze if 28 calculated data features were associated with Salmonella prevalence. SPC methods identified at least 5 clusters of special causes of variation in TPC data. Two of these clusters corresponded with obvious process changes, such as a switch in wash chemical formulation, suggesting potential for SPC to identify key process shifts. Similarly, the classification tree analyses identified one significant factor associated with higher Salmonella prevalence, time between bird harvest and cutting into parts, that may be a new target for control. Using SPC to detect trends in quality indicators and Salmonella contamination over time can reveal patterns which may be lost in the sheer volume of information, allowing for more specific corrective actions. This method can be supplemented with methods such as decision trees to identify additional external factors for control.
Lightning Session II
Session Date: Tuesday, November 19th
Session Time: 3:00pm-4:20pm
Session Location: Reveille
Session Moderator: Raymond Ansotegui
1. (3:00-3:05) - Exploring DeepVariant Space: Assessing the accuracy of custom-trained variant-calling models across species.
Arnold, Haley, ARS
Sheina Sim
Variant-calling has become so ubiquitous to genetic studies that is has become rare to find one that does not rely on accurate calls. While there have been many advancements in this process over the years, we are now able to gain an even greater accuracy in variant-calling using machine learning. DeepVariant not only has a highly accurate built-in variant calling model for whole-genome sequences, but also has the ability to custom-train variant-calling models for specific datasets. DeepVariant has been shown to be able to improve the accuracy of variant calls made by other software and through custom model training. Here, we explore the performance of custom models trained with DeepVariant with varying amounts of training data, parameters, and species, in order to assess the wider applicability of machine learning-trained models of variant-calling.
2. (3:05-3:10) - Using Machine Learning to Predict Planting Dates
Avila, Angela, University of Texas at Arlington
Jianzhong Su, Lina Castano-Duque, Gary Marek , Prasanna Gowda
This study presents an approach to estimating crop planting dates by integrating ground-based time series Leaf Area Index (LAI) measurements with satellite images, using machine learning. Eighteen years of time-series ground-measured LAI data, collected in Bushland, Texas, are used to represent the pure growth of crops. This data is costly and time-consuming to collect. To unify each year of LAI growth overtime, we use third-degree polynomials. To leverage the computationally fast model, neural networks, in associating crop time series growth with their planting dates, more data is necessary. Data augmentation is used to simulate possible LAI growth curves, for training the neural network. Training also requires the output, planting dates, which necessitates theoretical planting dates for the theoretical curves. Features of the curves are extracted and used in multiple linear regression to predict these theoretical planting dates. In time comparison, predicting 17,000 planting dates using feature extraction from growth curves would take approximately 30 minutes, compared to 1 second using a trained neural network. While the neural network model is based on pure LAI growth, satellite data is more practical and abundant. To link satellite information with LAI, we use Orthogonal Canonical Correlation Analysis (OCCA), which maps satellite data to LAI by finding optimal linear transformations that maximize the correlation between the two data views. The OCCA map combined with the trained neural network then allows us to estimate crop planting dates using purely time series satellite images.
3. (3:10-3:15) - Identification of plant-parasitic nematodes using deep learning tools
Waldo, Benjamin, ARS
Vikram Rangarajan and Fereshteh Shahoveisi
Plant-parasitic nematodes are an important threat to turfgrass. Left unmanaged, they can cause serious reductions in the quality and playability of golf greens. Nematode diagnostics are dependent on accurate identification of specimens extracted from soil, but expertise in nematology required to make reliable identifications is often limited in plant diagnostic laboratories. In this study, we evaluated the performance of EfficientNet V2, MobileNetV3, ResNet101, and Swin Transformer V2 image classification model architectures for identification of plant-parasitic nematode images. Models were trained, tested, and validated from a dataset of 5,400 plant-parasitic nematode images across seven genera associated with turfgrass captured using inverted and compound microscopes. Images were cropped and augmentation was performed using image rotations and random brightness. Classification accuracy was highest for EfficientNetV2 and Swin Transformer V2 at 95% and lowest for ResNet101 at 86%. The current findings demonstrate the potential application of deep learning tools for accurate nematode identification to aid in diagnostics.
4. (3:15-3:20) - Landscape Ecological Site Group Mapping Using Gradient Boosted Learning Applied to Climate and Soil Data
Meles, Menberu, ARS
D. Phillip Guertin
An Ecological Site Group (ESG) is a framework used in rangeland and ecosystem management to classify similar Ecological Sites (ESs) based on their responses to land management, disturbances, conservation practices, and environmental changes. Traditionally, ESGs have been assigned by expert groups. This study demonstrates the use of machine learning, specifically the XGBoost algorithm, to predict ESGs, replacing the need for expert-assigned classifications. By utilizing key soil properties and climate data, XGBoost accurately predicted ESGs at selected NRI points within Major Land Resource Areas (MLRAs) 65 and 69, achieving 95% and 99% accuracy—outperforming methods like decision trees and random forests. The model was further scaled to the landscape level using SSURGO soil map units, providing a broad, scalable, and data-driven alternative to traditional expert-based approaches.
5. (3:20-3:25) - Community-Led Assessments Using Remote Sensing: Monitoring the Impacts of Climate Change on Cloudberry (Rubus chamaemorus)
Kassama, Sire, ARS
Claire Friedrichsen, Sean Gleason, Lynn Marie Church, Grace Hunter, Bryan Jones Jr., Jaqueline Cleveland, Katie Pisarello, Warren Jones
Rural Indigenous Alaskan villages have long wrestled with food insecurity. During early colonization, restrictions on subsistence food practices put a strain on local consumption patterns, and today, climate instability continues to threaten indigenous food production systems. Rising sea levels, coastal erosion, decreased winter snowpack, and increased rates of permafrost thaw threaten the growing range and increase variability in harvests of cloudberry. Cloudberry (Rubus chamaemorus) is a vital subsistence plant that is a significant source of fiber, Vitamin C, and other micronutrients. The purpose of this research is to integrate oral histories of Indigenous community members with unmanned aerial system (UAS) imagery to monitor and predict future harvests of cloudberry. Red-green-blue (RGB) UAS surveys were conducted in a small Indigenous village, Quinhagak, which has been particularly impacted by climate change in the last decade and has a strong research capacity through Nalaquq, a Yup’ik owned 14(h) ANCSA corporation subsidiary. Nalaquq has spearheaded efforts to train community members in UAS use for search and rescue efforts. Thus, community members have extensive training in UAS data collection methods, which provided valuable skills that are transferrable to other tasks. This empowers the community to independently contribute to monitoring the landscape. In the summer of 2024, approximately 25 ethnographic interviews were conducted to evaluate the location of major berry picking areas, attitudes to climate change, how climate change has altered berry harvests, and values associated with managing cloudberry. UAS surveys were conducted to include areas with variable elevation and proximity to major waterways to capture different landscapes and differing loads of berries. Field surveys were conducted to estimate berry size and weight per patch to assist in the development of a harvest index. Together, these data streams will inform predictive yield models to assist in determining the approximate yield for berry patches that are accessible to the community. Under changing climatic conditions in the region, this study provides significant framework to support and sustain important Indigenous knowledge and practices associated with subsistence harvesting.
6. (3:25-3:30) - Future Directions in Generative Social Science
Gillette, Shana, APHIS
Generative Artificial Intelligence tools such as ChatGPT have the potential to increase understanding of human behavior and social phenomena by supporting new approaches for measurement and interactions. For example, AI assistants can support survey administration by monitoring for bots and assigning individuals to groups. AI interviewers can interact with survey respondents, formulating supplemental questions and designing treatments based on responses in real time. AI analysts can generate personalized feedback for survey respondents and analyze respondent feedback within a broader social context. AI tools, such as Vegapunk, can be embedded in survey software to interact with survey participants. AI also has a potential role in training, interventions, and research trials. Researchers are exploring the use of AI tools in evaluating and customizing training as well as the application of AI for Just-in-Time Interventions, Multimodal Adaptive Interventions, and Micro-Randomized Trials. However, it is important to note that the black box nature of existing technologies used in generative AI makes it difficult to assess and replicate results. Therefore, open-source and transparent approaches are needed to ensure credible and reliable AI-driven social science.
7. (3:30-3:35) - CameraTrapDetector: detect, classify, and count animals in camera trap images using deep learning
Burns, Amira, ARS
Ryan S. Miller, Hailey Wilmer
Camera traps are a widespread, non-invasive, cost-effective method to monitor animal populations; researchers using camera traps comprise diverse disciplines and geographies. The time and labour required to manually classify potentially millions of images generated by a single camera array presents a significant challenge. Reducing this burden facilitates implementation of larger, longer-lasting camera trap arrays, resulting in more comprehensive analyses and better decision frameworks. To address this challenge, a multi-agency USDA team has developed CameraTrapDetector - a free, open-source tool that deploys computer vision models at the class, family, and species (Nclasses=63, mAP(50-95)=0.878, F1=0.919) taxonomic levels to detect, classify, and count animals in camera trap images. The tool is available as an R package with a R Shiny interface, a desktop application, or a command-line Python script for easy integration into many analytical pipelines. The tool enables users to retain complete data privacy, and developers maintain a transdisciplinary, multi-institutional working group of camera trap researchers to advance best practices. An iterative training cycle uses state-of-the-art computer vision approaches and adds new images from project partners to train new models and incorporates user feedback and goals into the tool’s development. A primary goal, and challenge, for the models is generalization to out of site images; results are less accurate, and more variable, compared to metrics for test (unseen) in-site images. Results on test data (Nclasses=12) show major improvements in generalization from the version 2 model (mAR = 0.195, range 0.07-0.98) to the version 3 model (mAR = 0.606, range 0.05-1.00). Faster, more accurate, more generalizable models allow CameraTrapDetector users to turn raw images into quantifiable data for answering questions about estimating animal presence, population size, and movement. Our open-source pipeline may also be leveraged to train species-specific computer vision models to answer questions about animal behaviour or disease detection. By automating image processing, CameraTrapDetector streamlines research speed and redirects critical human resources to more analytical research tasks.
Poster Displays
Date: Wednesday November 20th
Time: 7:00am-10:30am
Location: Hall of Champions at Kyle Field
Digital posters may be found with each corresponding abstract for
- Lightning Session I
- Lightning Session II
Trainings
Introduction to Machine Learning for Science
Course Date: Wednesday November 20th
Course Time: 1:00pm-4:00pm
Course Location: Hullabaloo
Course Instructors: ARS SCINet Office
Machine learning underlies the vast majority of modern AI methods, including the ever-expanding applications of deep learning and generative AI. This workshop will give participants a hands-on introduction to the basic concepts and techniques needed to understand machine learning and to apply machine learning methods to scientific research. Participants will learn how to train, evaluate, and use a variety of machine learning models for data analysis tasks. This session will also help participants critically evaluate the use and application of machine learning in science.
Prerequisites: Familiarity with basic Python concepts and Jupyter notebooks. We will offer virtual training for these skills before the Forum begins.
AI Project and Product Management
Course Date: Wednesday November 20th
Course Time: 1:00pm-4:00pm
Course Location: Reveille
Course Instructors: Nick Pallotta, PMP, NASS, Head of Workforce Performance and Staff Development
Effective project management is crucial for the successful implementation of AI initiatives. This workshop provides a framework for managing AI projects from inception to completion, integrating project management methodologies with the unique challenges and opportunities presented by AI projects. Attendees will explore the CRISP-DM framework, the OSEMN framework, and other key challenges unique to AI projects such as defining a performance-based project scope, building a successful team, and model support considerations for long term success. This workshop is ideal for AI project managers and business leaders looking to guide technical resources toward successful implementation of AI.
Data Preparation and Quality Assessment in Genome Assembly and Annotation
Course Date: Wednesday November 20th
Course Time: 1:00pm-4:00pm
Course Location: Traditions
Course Instructors: Genome Informatics Facility at Iowa State University
In this workshop, participants will explore techniques for evaluating the accuracy and completeness of genome assemblies and annotations, helping attendees understand key metrics and statistical methods used to assess the quality of genomic data. Knowing how to evaluate a genome will ensure reliable results for downstream, AI-based analyses like gene model prediction, variant detection, or comparative studies. Participants will also learn how to extract the transcripts and proteins from their genomes, to enable a variety of downstream AI-based applications, such as protein structure prediction. By the end of the workshop, attendees will be better equipped with the practical skills necessary to evaluate genomes and annotations for a range of bioinformatics applications.
Prerequisites: Familiarity with basic command-line concepts. We will offer virtual training for these skills before the Forum begins.
Predicting functional roles of proteins using AI-driven bioinformatics tools
Course Date: Thursday November 21st
Course Time: 9:00am-12:00pm, 1:30pm-4:30pm
Course Location: Corps
Course Instructors: Genome Informatics Facility at Iowa State University
In this hands-on workshop, participants will learn how to predict the functional roles of proteins by analyzing their sequence data using state-of-the-art bioinformatics tools powered by AI. The focus will be on understanding how AI-based methods are applied to predict protein characteristics and other downstream uses for gene annotations. Two such examples will be predicting signal peptides (indicators of protein secretion) and subcellular localization (where the protein operates in the cell). Participants will use sample datasets to explore how computational models can interpret protein sequences and provide insights into their biological roles. By the end of the session, attendees will have the knowledge and skills to functionally annotate proteins in any gene annotation.
Prerequisites: Familiarity with basic command-line concepts. We will offer virtual training for these skills before the Forum begins.
From reads to variants: a pipeline for variant calling using DeepVariant
Course Date: Thursday November 21st
Course Time: 9:00am-12:00pm
Course Location: Ross
Course Instructors: ARS scientists Sheina Sim and Craig Carlson
DeepVariant is a DNA sequence variant caller that uses a convolutional neural network (CNN) to call genotypes relative to a reference genome assembly. In this workshop, we will discuss a workflow for calling variants from whole-genome data for multiple individuals. This workflow involves trimming and filtering raw reads, mapping them to a reference assembly, calling variants for each individual, merging the variants of all individuals into a single variant call format file (.vcf), and filtering the resulting variant file. We will guide participants through this pipeline step by step, providing generalized commands for each phase of the process, as well as strategies for optimizing cluster usage and reducing compute time. The final product will be a .vcf containing variants for all individuals which can be used for downstream analyses, along with a solid understanding for performing variant detection using DeepVariant.
Prerequisites: Familiarity with basic command-line concepts. We will offer virtual training for these skills before the Forum begins. Understanding of genomic sequencing. General optimism.
Protein Structure Prediction, Search, and Analysis with AI
Course Date: Thursday November 21st
Course Time: 1:30pm-4:30pm
Course Location: Ross
Course Instructors: ARS scientists Hye-Seon Kim and Carson Andorf
In this workshop, participants will learn how to use cutting-edge, AI-based tools for analyzing protein structure and function. The workshop will start by exploring 3D protein structure prediction using AlphaFold for alignment-based structure prediction and ESMFold for single-sequence structure prediction. Participants will then learn how to use FoldSeek for structure-based protein similarity search. The last part of the workshop will bring all of these concepts together by using PanEffect to explore how genetic variations in protein sequence can influence an organism’s phenotype.
Prerequisites: Familiarity with basic command-line concepts. We will offer virtual training for these skills before the Forum begins.
Computer Vision: Introduction and Image Classification
Course Date: Thursday November 21st
Course Time: 9:00am-12:00pm
Course Location: Hullabaloo
Course Instructors: ARS SCINet Office
This workshop will teach participants the concepts and tools they need to begin applying modern, deep learning-based computer vision methods to their own scientific research. This will be an interactive, hands-on workshop that will offer plenty of opportunities for practice and experiential learning. By the end of the session, participants will have trained and evaluated a state-of-the-art image classification model on a custom image dataset.
Prerequisites: Familiarity with basic machine learning concepts. The workshop on November 20 will provide this background, if needed. Familiarity with basic Python concepts and Jupyter notebooks. We will offer virtual training for these skills before the Forum begins.
Computer Vision: Object Detection and Instance Segmentation
Course Date: Thursday November 21st
Course Time: 1:30pm - 4:30pm
Course Location: Hullabaloo
Course Instructors: ARS SCINet Office
In this workshop, participants will learn the key concepts and techniques needed to use modern, deep learning-based computer vision methods for object detection and semantic segmentation. Learners will practice training and evaluating state-of-the-art computer vision models on custom image datasets. This workshop is intended as a continuation of “Computer vision: introduction and image classification”, but participants do not need to take the earlier workshop if they already have a basic knowledge of machine learning and computer vision concepts.
Prerequisites: Familiarity with basic machine learning concepts. The workshop on November 20 will provide this background, if needed. Familiarity with basic computer vision concepts (e.g., an understanding of how image data are structured in computer memory). The morning computer vision workshop will provide this background. Familiarity with basic Python concepts and Jupyter notebooks. We will offer virtual training for these skills before the Forum begins.
Data Management Planning for AI Projects
Course Date: Thursday November 21st
Course Time: 9:00am-11:00am
Course Location: Reveille
Course Instructors: NAL
This workshop will help participants learn how to address AI-related research data (e.g., training datasets) in Data Management Plans (DMPs). There will be a brief presentation on DMPs by the National Agricultural Library (NAL) followed by examples of DMPs for research involving AI and discussions on common challenges and solutions for developing DMPs.
Spatial Modeling with Machine Learning
Course Date: Thursday November 21st
Course Time: 1:30pm-4:30pm
Course Location: Reveille
Course Instructors: ARS SCINet Office
This workshop will explore examples of spatial modeling tasks (e.g., spatial interpolation from point data to gridded data) with machine learning methods. The content of the session will primarily focus on the spatial component (e.g., how to include spatial proximity as a predictor) although machine learning concepts will be discussed as relevant. The goals of this session are to 1) introduce key concepts about incorporating spatial data in machine learning and 2) provide examples in Python on how to manipulate spatial datasets to use in machine learning functions, compare the performance of machine learning approaches for spatial prediction, and visualize observed spatial data and the prediction results.
Prerequisites: Familiarity with basic machine learning concepts. The workshop on November 20 will provide this background, if needed. Familiarity with basic Python concepts and Jupyter notebooks. We will offer virtual training for these skills before the Forum begins.