WEBVTT

1
00:00:04.660 --> 00:00:17.940
E3410 x8539 Conf Room: Good morning, everyone my name's Simon Malcolm. I'm the Deputy Assistant Director for the Biological Sciences here at the National Science Foundation. It's my pleasure to welcome you all to this morning's

2
00:00:17.940 --> 00:00:35.630
E3410 x8539 Conf Room: bio distinguished lecture. So the Nsf. Biological Sciences directorate supports biological research cutting edge biological research that spans the different scales from temporal through geographical, and also to

3
00:00:36.651 --> 00:00:56.659
E3410 x8539 Conf Room: temporal and geographical scales, and it also supports the physical and human infrastructure to be able to conduct that research. So the the aim of the biological Sciences distinguished lecture series is to bring in some researchers who have cutting edge research to share with us.

4
00:00:56.660 --> 00:01:25.769
E3410 x8539 Conf Room: And often this research is not completely aligned with the biological sciences as we have today, we have the recognizing that a lot of the research that happens in other disciplines can also play a role in biology. So it's my pleasure today to welcome the today's lecture, which is presented by the division of molecular and cellular Biosciences division and we're going to be bringing 2 lectures today. We have Dr. Scott Jackson, and Dr. Ethan Pickering.

5
00:01:26.151 --> 00:01:55.530
E3410 x8539 Conf Room: both of whom are at Bayer crop sciences. And did you have a celebration when the buyer labor, cousin won the Bundesliga or not? Okay, good. I'm pleased. I'm pleased you celebrated that. So Dr. Jackson, is the genetic pipeline design lead at Bayer where he leads a team of researchers who work to design optimal crop improvement strategies. He holds an Ms. And A. BA. Phd. From the University of Wisconsin, Madison.

6
00:01:55.690 --> 00:02:16.910
E3410 x8539 Conf Room: and conducted a post office fellowship with the University of Minnesota. Prior to joining Bayer, he held faculty positions at Purdue University, beyond the University of Georgia, leading the center for applied genetic technologies at Uga, and his research is focused on understanding the evolutionary history of plant genomes, allowing us to better engineer crops for the future.

7
00:02:16.910 --> 00:02:27.059
E3410 x8539 Conf Room: and this work has been funded by several Nsf. Awards, not surprisingly, including those focused on soybeans, rice and peanuts. Peanuts. Always a good idea when you're in Georgia.

8
00:02:27.060 --> 00:02:44.150
E3410 x8539 Conf Room: Dr. Ethan Pickering leads by his AI genomics modeling team where his work focuses on building novel, AI models, architectures, and other tools that overcome challenges in crop genomics. He is a lecturer at Mit, where he previously had a postdoc

9
00:02:44.465 --> 00:03:02.469
E3410 x8539 Conf Room: and he also holds degrees from case Western Reserve University and Caltech. So, as I mentioned, the research in biology can be advanced by tools developed in other disciplines. And so one such tool which we're hearing a lot about these days, of course, is the focus of today's lecture, which is artificial intelligence or AI

10
00:03:02.822 --> 00:03:15.520
E3410 x8539 Conf Room: so today Scott and Ethan are going to discuss how they combine the fields they cover AI and biology to advance, plant and animal breeding, and how the 2 fields can can work in concert.

11
00:03:15.660 --> 00:03:29.510
E3410 x8539 Conf Room: So and during the talk they will weave in their career journeys and work that highlights the ways in which AI and biology can work together. So it's my pleasure in welcoming Scott and Ethan to Nsf.

12
00:03:29.780 --> 00:03:31.430
E3410 x8539 Conf Room: Okay, thank you.

13
00:03:32.560 --> 00:03:33.420
E3410 x8539 Conf Room: Her pitch

14
00:03:34.050 --> 00:03:44.569
E3410 x8539 Conf Room: perfect. Great to be back here. I guess the one thing that was not in my bio, and it's actually not in my bio slightly. There's I did a couple of stints here at Nsf. As a rotating program officer.

15
00:03:44.660 --> 00:03:46.969
E3410 x8539 Conf Room: Some of the people in the room were

16
00:03:47.240 --> 00:03:54.203
E3410 x8539 Conf Room: here when I did those they were in Arlington at the old place, for security is much easier. But just

17
00:03:56.750 --> 00:04:13.409
E3410 x8539 Conf Room: there's some message at the top. There may just a quick, you know, this is something I I spent 19 years in academia and last 5 years in industry. And one thing I learned about industry is, you almost start. You start almost every talk or presentation with a bio slide your journey, and how you got there.

18
00:04:13.820 --> 00:04:23.879
E3410 x8539 Conf Room: And so I have this up, and I didn't have the the soccer barely recruiting, which just won for the first time in their 100 plus your history. The German championship, football, slash soccer.

19
00:04:24.100 --> 00:04:27.040
E3410 x8539 Conf Room: which is really cool. So

20
00:04:27.310 --> 00:04:38.349
E3410 x8539 Conf Room: I did my graduate work at at Wisconsin, work, my potato and a number of other things working chromosome biology, solidogenetics. But within a plant breeding program. So my Phd's and plant breeding but did a lot of work on chromosome biology

21
00:04:38.830 --> 00:04:44.630
E3410 x8539 Conf Room: did a postdoc of Minnesota. I got the football, or they're the mascots for all the places I've been up there.

22
00:04:44.920 --> 00:04:47.520
E3410 x8539 Conf Room: so as a gopher for 2 years

23
00:04:48.084 --> 00:04:58.609
E3410 x8539 Conf Room: for those that haven't lived in Minneapolis is cold in the winter, very cold, and up there I worked on a wild rice, and that started

24
00:04:58.650 --> 00:05:06.420
E3410 x8539 Conf Room: part of my future lab that worked on rice species while on relatives of cultivated rice, and how you might use information for most cultivated relatives to improve rice.

25
00:05:07.310 --> 00:05:14.519
E3410 x8539 Conf Room: I got a faculty position at Purdue in 2,001 started the exact same day that Cliff Wild started on Faculty Purdue.

26
00:05:15.177 --> 00:05:17.440
E3410 x8539 Conf Room: I know he looks much older.

27
00:05:17.867 --> 00:05:29.960
E3410 x8539 Conf Room: and they hired me to work on Soybean. The only thing I knew about Soybean I went there is that you drove down the highway in the Midwest. Tall stuff was corn in the short stuff. It's probably soybean, and that's literally all I knew about Soybean.

28
00:05:30.100 --> 00:05:36.199
E3410 x8539 Conf Room: and they took a chance and hired me, and I spent the next 15 plus years, working on soaving other legumes.

29
00:05:36.370 --> 00:05:37.600
E3410 x8539 Conf Room: And

30
00:05:38.790 --> 00:05:52.650
E3410 x8539 Conf Room: one interesting aspect, this is the early days of sequencing. We're part of the group that helps sequence swinging and some other leg use. And it sort of ties into some discussions we've been having this morning around workforce training and machine learning AI and bring that to bear on under biological questions and problems.

31
00:05:52.780 --> 00:05:55.010
E3410 x8539 Conf Room: As we were generating all genomic data

32
00:05:55.370 --> 00:06:06.280
E3410 x8539 Conf Room: biologists regenerating it, but didn't know what to do with it. So how do you get the mathematicians and computer scientists and data scientists to have an interest in this biological problem? So we went through. So the same morning curve. But 20 years ago

33
00:06:07.100 --> 00:06:14.850
E3410 x8539 Conf Room: I went to Georgia, became a bulldog in 2,011 after a year. Here my first stint as a rotating program officer

34
00:06:16.093 --> 00:06:25.598
E3410 x8539 Conf Room: and finally, but I was at a place that won a National championship with their football team, and I worked on peanuts. Another legium. Why is that that bear correct? Uga!

35
00:06:26.430 --> 00:06:36.690
E3410 x8539 Conf Room: I joined Bayer 5 years ago this August, and I think my Bible is a little bit off. I actually lead the North America swinging and cotton pipelines now, as of a year and a half ago.

36
00:06:37.150 --> 00:06:49.109
E3410 x8539 Conf Room: So on the R&D scale, I'm more on the development side of it. Now spend a lot of time with commercial partners and the growers talking about our products. What is that they need and how we're gonna deliver those from a genetic perspective.

37
00:06:49.680 --> 00:07:10.650
E3410 x8539 Conf Room: So it's my academic career. One slide, 19 years bunch of students and postdocs. Not all of them. But my, my, my training is in plant breeding my passion was chromosome biology, and as you start sequencing genomes, genomes being able to tie to any sequence to understand how chromosomes behave, how they function, what the structure is.

38
00:07:10.660 --> 00:07:16.440
E3410 x8539 Conf Room: And then one of my other passions is poly play, which is prevalent in plants. So we did a lot of sequencing poly plants

39
00:07:16.480 --> 00:07:30.152
E3410 x8539 Conf Room: looking at us, structural phase of polypoi genes, or the the Beta poly of genes and polypoons and other aspects of of how polyplays evolve, and then try to use that information, to understand how that's approved. Crops in a more efficient way

40
00:07:33.170 --> 00:07:36.110
E3410 x8539 Conf Room: moved to Bayer 2019.

41
00:07:36.810 --> 00:07:42.509
E3410 x8539 Conf Room: And this is getting to to the, to the purpose of the talk today. So talk about

42
00:07:43.890 --> 00:07:54.240
E3410 x8539 Conf Room: background and and plant breeding the genomics for a number of years hired in the bear. When I first was hired in bear, I was leaving a group in R&D focused on, how do we use genomic information?

43
00:07:54.590 --> 00:08:03.430
E3410 x8539 Conf Room: Where do we generate that genomic information? And how do we use it more efficiently? What tools we put on top of that to make better decisions and breeding pipelines to get the products to our growers that they want

44
00:08:04.260 --> 00:08:13.730
E3410 x8539 Conf Room: and very quickly realize realize the scale, the scope, the the pace of everything happen, industries dramatically, dramatically, dramatically different than it than academia.

45
00:08:14.216 --> 00:08:29.909
E3410 x8539 Conf Room: When you think about a genome, a genetic experiment, academia, it's 3 reps, 3 locations 3 years we don't use that doesn't even we don't. Don't talk about those numbers at all. You know we're talking 60 80 reps a year and tens of thousands of genetic entities within those reps.

46
00:08:30.450 --> 00:08:41.730
E3410 x8539 Conf Room: and having genetic information on all those. And so what we we have is a is a massive, pipeline pushing project of millions of pro hundreds of thousands of progeny through on an annual basis

47
00:08:41.870 --> 00:08:44.530
E3410 x8539 Conf Room: in in very, in various steps of that pipeline.

48
00:08:44.720 --> 00:08:45.850
E3410 x8539 Conf Room: And if you

49
00:08:46.310 --> 00:08:56.360
E3410 x8539 Conf Room: think about breeding, basically, it's a large funding. You create a bunch of progeny. You take them, take through various cycles of testing to get down to the very few that you want at the end. That's sort of like looking for a needle in the haystack.

50
00:08:56.840 --> 00:09:11.110
E3410 x8539 Conf Room: You create a huge pile, hey? You want to find that one winner. So you that one winner. So you spend the next 10 years after you create this huge pile trying to figure out which of these hundreds of thousands you created is going to be the one that's going to be a successful variety or hybrid.

51
00:09:11.820 --> 00:09:18.830
E3410 x8539 Conf Room: and we generally a lot. Generate a lot of data, genotyping things along the way, sequencing things along the way, collecting phenotypic data.

52
00:09:18.870 --> 00:09:23.959
E3410 x8539 Conf Room: And you can begin to build automation and tools around that to connect things together and be able to

53
00:09:24.290 --> 00:09:32.140
E3410 x8539 Conf Room: impute genetic information, infer what the phenotype might be based on relatives and prodding your grandparents of that that entity.

54
00:09:32.480 --> 00:09:39.960
E3410 x8539 Conf Room: And so we built a lot of resources. We hire a lot of data scientists, computer scientists to help build this infrastructure. These models tie these things together.

55
00:09:40.200 --> 00:09:42.939
E3410 x8539 Conf Room: But at the end of the day we're still looking for that needle.

56
00:09:43.070 --> 00:09:55.410
E3410 x8539 Conf Room: And so we get a little bit more efficient doing using these things to find the needle. We're still making hundreds and hundreds and hundreds of thousands of progeny, Regina, typing them. We're testing them, trying to get down to those those few needles that we want. We want to move forward.

57
00:09:56.740 --> 00:10:06.769
E3410 x8539 Conf Room: So maybe just on this slide here, that that thing that looks like a cross section of a brain is actually representation of the of our maze germ plasm based on genetic information.

58
00:10:06.990 --> 00:10:15.010
E3410 x8539 Conf Room: and it looks like 2 lobes of a brand. Those are the male and female hedonic pools. So we create hybrids. And those are the 2 pools that we breed with them.

59
00:10:17.020 --> 00:10:27.299
E3410 x8539 Conf Room: So as you can imagine over the past 20 years. As we built and scale this infrastructure to to try and find these needles and this massive amount of of plant entities that we generate.

60
00:10:27.470 --> 00:10:36.279
E3410 x8539 Conf Room: we create a lot of automation to to get them to collect the data. We need everything from the the genetics all the way down to how they perform the field.

61
00:10:37.350 --> 00:10:49.180
E3410 x8539 Conf Room: And so we have. We have centers where this the seeds are sent, the seeds are chipped, so they take a small section out of a of a piece of seed the genotype that section, and and then we move that seed board either into

62
00:10:49.350 --> 00:10:54.760
E3410 x8539 Conf Room: a waste can, because we don't want to plant it, or we put in a greenhouse or field based on the genetic information that we get from that chip.

63
00:10:55.080 --> 00:10:59.049
E3410 x8539 Conf Room: And this is all automated and and central lab facilities.

64
00:11:00.880 --> 00:11:21.199
E3410 x8539 Conf Room: Once we go from N, knowing what of the millions of seeds that we ship annually and get genetic information on, to know which of the hundreds of thousands of one actually plant those get sent to a central central packaging facility which looks a lot like an Amazon warehouse. It's conveyor belts. It's automation. These things come in. They get packaged into what we call sets.

65
00:11:21.440 --> 00:11:26.219
E3410 x8539 Conf Room: The cassettes can get sent out to to centers planning centers around the world.

66
00:11:26.760 --> 00:11:38.300
E3410 x8539 Conf Room: and they're planted in the fields. Bottleneck assessment. We know where every plot, every seat, you know, genotype of everything in the in that, in the in the, in, the, in that field? And we know where it is geographically geocated.

67
00:11:39.090 --> 00:11:55.070
E3410 x8539 Conf Room: And then we collect data throughout the season. So how does it perform. How's it perform and and stress? How's it perform? with? To various disease pressures we fly uids or drones to collect that? When does it flower. When does it mature? When is it setting seed? All these other things? We collect all this data.

68
00:11:55.920 --> 00:12:07.769
E3410 x8539 Conf Room: So we start with millions of projects with genotype plant hundreds of thousands to start collecting scientific information. And over the next 7, 8, 9 years. When are those hundreds of thousands down to the 10 or 20 that we're gonna move towards commercial products.

69
00:12:08.750 --> 00:12:10.290
E3410 x8539 Conf Room: It's an expensive process.

70
00:12:10.390 --> 00:12:12.789
E3410 x8539 Conf Room: Generate lots and lots and lots of data.

71
00:12:13.590 --> 00:12:18.130
E3410 x8539 Conf Room: Lot of this is automated within within large greenhouses. So the one here in Marana

72
00:12:18.270 --> 00:12:23.319
E3410 x8539 Conf Room: so 5 or 10 acres I can't remember. It's 10 acres under glass, 10

73
00:12:24.227 --> 00:12:35.619
E3410 x8539 Conf Room: all automated to to to to to start cycling populations more rapidly to move the genetics of a population doing multiple cycles per year rather than one cycle per year and planting in the field.

74
00:12:35.975 --> 00:12:40.090
E3410 x8539 Conf Room: So we can move the genetics of a population more quickly and then move them out into the field for testing.

75
00:12:42.020 --> 00:12:53.179
E3410 x8539 Conf Room: So if you think about breeding over time, going back to domestication thousands years ago, where people are picking things that didn't, for the seats didn't fall on the ground. So we got not shattering. Those are sort of major changes

76
00:12:53.430 --> 00:13:03.690
E3410 x8539 Conf Room: to breeding. In the early 19 hundreds we started applying statistical models. Hybrid seed was first developed in 1920, 1930. And commercially, in 1940, s. 1950, S.

77
00:13:04.330 --> 00:13:06.990
E3410 x8539 Conf Room: We started applying

78
00:13:09.030 --> 00:13:19.480
E3410 x8539 Conf Room: modern harvesting tools, catching yield as they come up to Harvester. We start doing local markers in the the nineties, and really full full blast in the 2,000.

79
00:13:20.050 --> 00:13:23.489
E3410 x8539 Conf Room: And those are sort of like evolutions. And how we've done plant improvement.

80
00:13:24.510 --> 00:13:26.640
E3410 x8539 Conf Room: At Bayer Monsanto

81
00:13:27.190 --> 00:13:30.530
E3410 x8539 Conf Room: Bear bought Monstano 5 years ago since fair.

82
00:13:31.040 --> 00:13:42.449
E3410 x8539 Conf Room: but they they sort of break it into breeding 1.0 2.0 and 3.1 point 0 is just they acquired a lot of genetics and germ plasminc companies to get the genetics. Get that? Get those tools to start creating those those winning varieties.

83
00:13:43.150 --> 00:13:52.980
E3410 x8539 Conf Room: Reading 2.0 and 3 point are really, really about increasing the precision. So with knowing where you're planting things predicting where you want to plant them, based on what the expected performances

84
00:13:53.360 --> 00:14:01.900
E3410 x8539 Conf Room: breeding 3.0 has really around the digital enablement. So all the automation around c chipping, getting genetic information on all the millions of progenies at the very beginning

85
00:14:01.910 --> 00:14:05.680
E3410 x8539 Conf Room: to know which one the which ones you want to plant in those initial stages of testing

86
00:14:06.660 --> 00:14:15.080
E3410 x8539 Conf Room: and what we're the phase we're in now. And this is where we're Ethan's gonna take over here in a minute is really thinking more about design.

87
00:14:15.160 --> 00:14:29.509
E3410 x8539 Conf Room: So can we flip this breeding strategy from creating millions of progeny trying to get down to those 10? They're gonna be the winners. Can we think more intentionally about how we create those populations at the beginning, knowing what our growers need? And can we design the genetics more intentionally

88
00:14:29.570 --> 00:14:33.190
E3410 x8539 Conf Room: using modern tools? All the data that we've generated over the past 10 years

89
00:14:33.220 --> 00:14:37.719
E3410 x8539 Conf Room: to note to more, to to create the the the chances

90
00:14:37.890 --> 00:14:42.402
E3410 x8539 Conf Room: and reduce the haystack to get those needles that they're gonna be those winners in in the growers fields.

91
00:14:43.362 --> 00:14:47.797
E3410 x8539 Conf Room: So with that, I'm gonna turn over to Ethan.

92
00:14:49.294 --> 00:14:57.205
E3410 x8539 Conf Room: Alright. Thanks for really excited to be here. Written number of, you know, Nsf proposals and things like that. And seeing this

93
00:14:57.520 --> 00:15:06.962
E3410 x8539 Conf Room: all over the place, and having the opportunity to actually go talking and stuff so nice. Oh, thanks, it's gonna help a lot. So

94
00:15:07.500 --> 00:15:31.920
E3410 x8539 Conf Room: let's say, I think we have a couple of slides to push through here. I just wanna quickly acknowledge. So I get to lead an AI genomics research team right now there, and a number of different Phd, researchers who've done phenomenal work of last year, too. Just wanna make sure I mentioned them Bobby and Katie Alexis Katiana, shiny but kouchering so a little bit about myself since we always have these timelines and

95
00:15:33.025 --> 00:15:45.030
E3410 x8539 Conf Room: scott gave a little bit of background, so I'll do it as well. Even though we look the same age. Mine's a lot more abbreviated. In time, and let me move something here real fast.

96
00:15:45.030 --> 00:16:09.310
E3410 x8539 Conf Room: So so my youth was actually in in agriculture. So I grew up on a vegetable farm in Ohio, and we're primarily growing sweetcorn. Really enjoyed it a lot. But I started to recognize biology was for very unpredictable, complex, and going all over in different directions. But some of the machines that we were using and kind of the engineering that was around, agriculture

97
00:16:09.310 --> 00:16:34.050
E3410 x8539 Conf Room: was much more predictable, and something that you could fix it really fix a broken implement or something, whereas with the biology had a little bit more luck. So I decided to take a route into engineering at Case Western. And then calc mit. And this was really looking at, being much more based in math and physics and calculus to explain the physics. And then AI, to explain some of the other

98
00:16:34.050 --> 00:16:45.019
E3410 x8539 Conf Room: components. And this was all about trying to learn. How are we going to be able to predict something, build some model to predict an outcome, and if you can bridge, predict it.

99
00:16:45.090 --> 00:17:10.160
E3410 x8539 Conf Room: then you can start designing for very intentionally and then I had the unpredictable move that got a call one day about a position at Bayer, and whether or not I'd be interested, starting to go back into these messy biological complex problems that are not so predictable. So it's been a very uncomfortable jump into the unpredictable aspect. But it's been a lot of fun. And so one of the things about

100
00:17:10.190 --> 00:17:12.640
E3410 x8539 Conf Room: jumping into the biological domain.

101
00:17:12.849 --> 00:17:21.630
E3410 x8539 Conf Room: One of the questions that I get very often. It's it's consistent. And I have to wrestle with every days I'll get the question.

102
00:17:22.480 --> 00:17:47.827
E3410 x8539 Conf Room: can you interpret your model? Can you give us the interpretation of your model? And generally that answer today is going to be. No, it's a nonlinear AI model. No? Well, not. I cannot give you an interpretation, not today. But that's not necessarily the purpose. It's for prediction, not necessarily interpretation. So I'm gonna make a couple of arguments about why, that's particularly important here.

103
00:17:48.320 --> 00:17:50.240
E3410 x8539 Conf Room: oh, this isn't clicking anymore.

104
00:17:51.020 --> 00:17:51.725
E3410 x8539 Conf Room: So

105
00:17:52.660 --> 00:18:08.979
E3410 x8539 Conf Room: in the background of playing around in physics for a long time and being very interested in physics and calculus. I think it's interesting to look back at how physics changed in time, and how that was, how it developed. So for most of history, physics was a field of philosophy.

106
00:18:09.609 --> 00:18:19.520
E3410 x8539 Conf Room: So there are 3 branches. You had physics, then you had logic and ethics, and if you were to propose anything physics, you had to reason with that between ethics and logics.

107
00:18:19.520 --> 00:18:44.419
E3410 x8539 Conf Room: logic, and a human experience. And so you were not able to propose something unless you could interpret it and explain it within all 3 parts of the field. And so this was a very qualitative over quantitative approach to how physics was described, and that was for 2 millennia starting with Aristotle and the Aristotelian physics all the way up until Copernicus, Copernicus, and Galileo were starting

108
00:18:44.420 --> 00:18:50.249
E3410 x8539 Conf Room: to change some things. There's really Newton and Leibniz when they introduce calculus

109
00:18:50.270 --> 00:18:57.060
E3410 x8539 Conf Room: and calculus absolutely transformed the way that physics move forward and how things were designed.

110
00:18:58.260 --> 00:19:06.960
E3410 x8539 Conf Room: But there's a what calculus was not seen is necessarily a golden. It wasn't perfect right off the bat. So

111
00:19:07.170 --> 00:19:35.409
E3410 x8539 Conf Room: like neural networks. And AI, this lack of interpretability also plagued calculus when it was originally introduced. And I really like this? this quote here, that calculus is often taught as if it is a pristine thing emerging Athena like complete and hole from the head of suits. It is not, it's take. It took over 200 years for us to actually create the foundations of modern calculus, and there was a lot of concern about how it worked.

112
00:19:35.410 --> 00:19:59.869
E3410 x8539 Conf Room: So in particular noon, and Leven said, Hey, here's a tool. It predicts particularly accurately. It works very well, and it works very effortlessly, but they couldn't articulate or explain or interpret this to the various philosophers and physicists of the seventeenth century. And so a lot of people push back on this and really what noon. And Leibniz said back, well, this wasn't exactly our goal.

113
00:20:01.340 --> 00:20:08.203
E3410 x8539 Conf Room: but the engineers and I'm an engineer. So I really like this? Kind of approach, said, Well, whatever that's fine.

114
00:20:08.640 --> 00:20:33.749
E3410 x8539 Conf Room: I don't. We don't care necessarily about the interpretability, but if we can predict accurately or predict something we can design. And this is gonna be really nice. And we can move forward. And it was this interaction between those new designs. Those new steps that engineers took that provided a lot of the data that essentially created the foundations of calculus which took about 200 years before. We had the modern calculus set of work for

115
00:20:33.750 --> 00:20:38.709
E3410 x8539 Conf Room: work with now. So I think the the purpose here is to really mention that I think

116
00:20:39.670 --> 00:21:03.910
E3410 x8539 Conf Room: this is a provocative statement that in interpretability is not necessarily the goal of what we're trying to do with AI. But that prediction is the goal. And here's my statement that I believe neural networks or AI will be to biology. What calculus was provides us a way to start interpreting or predicting from some input variables, some downstream output variables.

117
00:21:04.398 --> 00:21:31.510
E3410 x8539 Conf Room: And there's a particular reason for why neural networks, I think, are are unique and useful for biology versus calculus with physics. And because physics has a ton of classical laws. And it's relatively the universe is always seeking equilibrium. So it's kind of fall that's rolling down the hill the entire time. It's relatively elegant. It's relatively, and calculus is also very elegant.

118
00:21:31.510 --> 00:21:44.479
E3410 x8539 Conf Room: When we look at biology we don't have all these laws. And I really like this. This is something pulled out of the dissertation from 2022 from a caltex student

119
00:21:44.770 --> 00:22:12.149
E3410 x8539 Conf Room: that life perpetuates its existence out of equilibrium against the will. The second law firm, and I think that aspect there, against the will of what thermodynamics wants to do is why biology is so complex, and why we've had such a hard time understanding it from other tools like calculus, because it's using it is fighting. And if you've ever seen a fight. It's never elegant. It's always something crazy. That's it's going up this

120
00:22:12.220 --> 00:22:16.580
E3410 x8539 Conf Room: inclined march. So conflict of biology.

121
00:22:16.690 --> 00:22:23.229
E3410 x8539 Conf Room: this complex neural nets and AI are. So I think it's the right tool for us to start predicting.

122
00:22:24.979 --> 00:22:48.370
E3410 x8539 Conf Room: So now this kind of gets more into just the general motivation of of why we're doing this in agriculture and such and we know that agriculture must adapt faster than ever. We have a number of different pressures going on. We have massive population increase. It's gonna require 60% increase in agricultural production. We have ever changing growing conditions that we have to deal with, we have

123
00:22:48.890 --> 00:23:05.810
E3410 x8539 Conf Room: larger spreads of disease due to globalization. We need to make sure that with regulations that we meet the societal demands for how food is produced. And finally, we have to do all this, somehow, the 60% increase and all those other constraints without blowing up the planet

124
00:23:06.328 --> 00:23:27.319
E3410 x8539 Conf Room: and when we look at generally what we have with respect to data that's in agriculture, I think we have a really great opportunity to start accelerating even faster about. How we start designing because of all the different data sets that are popping up across the planet and the different opportunities that we can hopefully pull from that data.

125
00:23:29.240 --> 00:23:34.309
E3410 x8539 Conf Room: now, Scott was mentioning this, and I think, and so I'll go somewhat quickly here. But

126
00:23:34.450 --> 00:23:55.669
E3410 x8539 Conf Room: when we look at agricultural data, we see it increasing in a number of different ways. So it's not only in scale, but it's in resolution, and it's source and type. And so this demands that we have likewise advancements in modeling capabilities. In particular, on the AI side of things. So I like. This is kind of a nice example of

127
00:23:55.670 --> 00:24:12.584
E3410 x8539 Conf Room: what were the genomic resolution resolutions that you could get that skill or a big company? And we're very close to seeing the ability to look at full full assemblies for a lot of the different lines that we're we're producing.

128
00:24:12.960 --> 00:24:36.229
E3410 x8539 Conf Room: Now, this is a similar we see it not only in base pair resolution, but also transcript domics, gene expression data that's coming on other advancements and and gene ontology. And as we continue to learn about gene interactions. And there's a similar story here between weather soil, management and imaging. So we're getting all this data. It's all increasing in scale.

129
00:24:36.230 --> 00:24:59.269
E3410 x8539 Conf Room: and all has different data types. And so that traditionally would be a problem. Because we have base pairs here, we have time series weather data. We have care categorical management approaches. We have scalar variables that we see in the soil. All of these are very different data sources.

130
00:24:59.886 --> 00:25:13.119
E3410 x8539 Conf Room: And so AI provides a really unique, flexible opportunity that you can start synthesizing all these different multimodal data streams to one particular architecture to help you design.

131
00:25:14.180 --> 00:25:40.000
E3410 x8539 Conf Room: Today, I'll show a couple of quick examples. Just focusing on the G part. So we're just gonna focus on the genomics and what we can do of modeling genomics to a phenotype. So the phenotypes that I'll mention here some observations that say yield height, disease, resistance, and we'll be using a genotype factor of some sort, some resolution to map to that phenotype. And then we, of course, have some noise.

132
00:25:40.650 --> 00:26:01.520
E3410 x8539 Conf Room: and there's 4 pieces. Of this approach of going from genotype to phenotype that we'll care about the first one is the architecture of an AI model. So an architecture that is the bones. This gives the structure that skips most of the properties that we can expect out of a model will be embedded in the design of the architecture.

133
00:26:01.520 --> 00:26:17.839
E3410 x8539 Conf Room: And we'll show a kind of a pool of well, I think it's cool approach where we start putting information biologically informed components into our architecture to make it predict that increase accuracy.

134
00:26:17.960 --> 00:26:46.739
E3410 x8539 Conf Room: the second one is lost. Functions so lost functions are the learning criteria that you can use for your AI model? And they're very important because they define the design. The design question that you care about. And so we should make sure that our learning and our loss functions align with those, and then also show 2 quick other approaches. Here active learning is an idea. And AI, where you're. It's very similar to jump genomic selection

135
00:26:46.960 --> 00:27:07.869
E3410 x8539 Conf Room: where you have an AI model and you have your system. And you're gonna allow them to interact with each other. So they get to talk, and they get to update and continue to progress towards some downstream goal. And then we'll say a couple of quick things about large language models and their applications right now. So, jumping into the architecture.

136
00:27:08.830 --> 00:27:24.560
E3410 x8539 Conf Room: so one of the questions that we wanted to answer was, could we start embedding domain knowledge into our models? And so first, when we look at the left side of the data that we have at scale. At Bayer we have

137
00:27:24.560 --> 00:27:46.030
E3410 x8539 Conf Room: tens of millions of phenotypes, these being in yield disease, etc, and we have perhaps over 100,000 unique genotypes. But these are at marker levels. So we have very coarse information. It might only be 10,000 base pairs or something around those lines. So we're missing a lot what's really going on? And the genotypes that we care about

138
00:27:46.690 --> 00:28:03.619
E3410 x8539 Conf Room: now. On the other hand, when we look at domain knowledge and things like, say gene regulatory networks or gene ontology terms these provide some really high fidelity, information, things that we clearly know, or at least at this point in time believer particularly important.

139
00:28:03.720 --> 00:28:16.319
E3410 x8539 Conf Room: Those are really high fidelity pieces of information. But the problem is is that we have very little data called model. So if we have gene expression data, typically, we might only have a couple of different gen types. So you can't really make a design model with that.

140
00:28:16.880 --> 00:28:36.599
E3410 x8539 Conf Room: So we said, Well, what if you could combine those 2? So you could take the general structure of a neural net with all of these parameters, and we can embed that domain knowledge in the center of it and make the model have to learn to predict through this particular graph, and so

141
00:28:36.870 --> 00:28:48.456
E3410 x8539 Conf Room: to give a couple of more reasons for why this. We think this is a good idea, not only from a biological standpoint, but from a mathematical standpoint, is that. Graphs are very attractive for this.

142
00:28:49.353 --> 00:29:11.349
E3410 x8539 Conf Room: this approach is one off the shelf AI models which we see a lot of off the shelf, AI models being used. And that's that's a bit of a concern. I would say we wanna be very particular of how we're using our AI models. And so we're gonna get over parameterization. Now, if we build a graph, we can reduce that complexity substantially

143
00:29:11.750 --> 00:29:15.319
E3410 x8539 Conf Room: the other problem with off the shelf. AI models are

144
00:29:15.540 --> 00:29:27.749
E3410 x8539 Conf Room: pretty much all AI models is that they struggle with understanding very long range interactions. So if we know that we have some gene say chrome one and another gene grows on 10, their Billings and base pairs away.

145
00:29:27.940 --> 00:29:29.669
E3410 x8539 Conf Room: An AI model

146
00:29:30.180 --> 00:29:46.519
E3410 x8539 Conf Room: generally is never going to be able to pick that up. It's never gonna be able to understand that if we have a graph, we can call out those known interactions very quickly and very explicitly. And so that provides a very big Mac Mini, advantage.

147
00:29:47.470 --> 00:30:06.049
E3410 x8539 Conf Room: So here's an example of building one of these Bio gn ends. So yeah, we call them, bioinform Gn ends. And we're building, this is all open source data. Actually. So we built the graph from the genontology resource. So we asked, okay, here are various genes that we have in the maze genome

148
00:30:06.050 --> 00:30:18.820
E3410 x8539 Conf Room: build us a graph of all the different interactions. And then we took that graph. And then we linked that graph up to the marker sets that we have. So that way. You had base pairs within a certain distance are going to be linked to that gene.

149
00:30:18.850 --> 00:30:38.449
E3410 x8539 Conf Room: And then we put there's some that were just really far away. We didn't necessarily need to do this, but they're really far away. And so we put them into their own little neural net. And this was using the genomes fields data set. And we were able to see somewhere around 1520 increases. And our routine squared route mean squared error

150
00:30:38.960 --> 00:31:06.169
E3410 x8539 Conf Room: with yield plan. Heighten your head. What I'm most excited about this approach is that this is organism agnostic. So there's a ton of other genontology graphs that you could build for a number of other different data sets that exist out there and start to continuously learn through other organisms about what these graphs could look like. These graphs are not unique. There's not one silver bullet to graph, most likely. But you could tune these to

151
00:31:06.320 --> 00:31:24.659
E3410 x8539 Conf Room: explicit questions that you care about. So here we cared about yield. So we kind of just have to have everything. But if we cared about something much more specific. Say something like flowering time, we could build a graph that's very explicitly defined for flowering time, and we don't really care about a number of other interactions. Perhaps.

152
00:31:26.680 --> 00:31:44.259
E3410 x8539 Conf Room: So now, to loss functions, this is gonna be the most mathematical component of this. I'll I'll go a little bit quicker through it. I think we have so much time. So to talk about lost functions which are learning functions. The general goal of creating a lost function is that, or whenever you have any model.

153
00:31:44.300 --> 00:32:10.050
E3410 x8539 Conf Room: you want your observed values to align with your predicted values. So you want to be along this perfect prediction line and so anything above this line is over predicted anything below this line it's under predicted. And the goal is that you want to push these as close together as possible. So typically when we train a model, we'll use something like mean squared error or use mean average air. And generally, this is just gonna take all the points and try to squish them.

154
00:32:10.610 --> 00:32:21.289
E3410 x8539 Conf Room: But when we look at a lot of the data that we work with, and the design rule that we care about for genomic selection and crop improvement is that if we look at all the data that we have.

155
00:32:21.686 --> 00:32:43.130
E3410 x8539 Conf Room: We're trying to. Typically, if this is yield, we're trying to improve, yield and most of our data does not sit anywhere near the upper bounds of the things that we really care about, products that we want to design. So what can this lead to? What can lead to very poor tailwise, because mean

156
00:32:43.700 --> 00:32:51.750
E3410 x8539 Conf Room: mean that we're only going. We tend to emphasize all the data points where all the data points exist. And there's no.

157
00:32:51.820 --> 00:32:57.369
E3410 x8539 Conf Room: If these are anti correlated in any way to be tailored events, they will just spread out.

158
00:32:57.410 --> 00:33:02.840
E3410 x8539 Conf Room: And that makes that means that for what we're trying to design, for we're not gonna be very good at predicting.

159
00:33:02.940 --> 00:33:32.290
E3410 x8539 Conf Room: There's a second case of this, and this one's not. I don't observe this one too often. But observe this one all the time, especially in agricultural data. And I argue, this is perhaps even worse. This is compression where we have observed data that extends a a pretty long span, and our model is only able to predict over a much shorter span. So it doesn't even understand the edges whatsoever in both of those tails, whether that be yield or say disease, resistance.

160
00:33:32.470 --> 00:33:36.570
E3410 x8539 Conf Room: So what we can do if we're thinking about this from a design perspective.

161
00:33:37.223 --> 00:33:51.800
E3410 x8539 Conf Room: We can actually create lost functions that target only learning about the tails or prioritize not only but prioritize learning about the tails, while at some other time giving up a little bit on the meet, so there's no free lunch. But you're allowed to

162
00:33:51.800 --> 00:34:11.870
E3410 x8539 Conf Room: pivot yourself towards what you actually want to design for and so there's some interesting work. This comes out of the Mit post, Doc lab that we're working with extreme events. And how do you? How do you tease out extreme and rare events from different systems with AI. And one of the ways is to build these lost functions.

163
00:34:12.730 --> 00:34:17.489
E3410 x8539 Conf Room: So I'm gonna jump past some of the I had a proof. But

164
00:34:17.969 --> 00:34:30.560
E3410 x8539 Conf Room: I don't know. We'll pass up the proof. How's the right crowd? It's a very elegant way to build in

165
00:34:31.000 --> 00:34:33.010
E3410 x8539 Conf Room: known constraints. Is that

166
00:34:33.618 --> 00:34:36.009
E3410 x8539 Conf Room: exactly? Yeah. Yup.

167
00:34:36.270 --> 00:34:43.870
E3410 x8539 Conf Room: if I have if I'm trying the last function, I'm just trying to understand what you're telling us.

168
00:34:44.520 --> 00:35:10.520
E3410 x8539 Conf Room: I, one option would be just to ignore half of the data and only focus function or focus on the data that's in the region that you want. That's not what you're doing. You're just waiting the data somehow. Yeah, so there, there can be cases where that data in the middle is very useful for understanding the extremes.

169
00:35:10.680 --> 00:35:29.229
E3410 x8539 Conf Room: But sometimes there's data that comes at the expense of understanding those extremes. So that's why. So that's why we wait it that way. And and we waited in this very particular way to make sure it's a continuous distribution. And so you're able to in the case that everything's perfectly correlated.

170
00:35:29.320 --> 00:35:30.580
E3410 x8539 Conf Room: It still works

171
00:35:30.700 --> 00:35:41.104
E3410 x8539 Conf Room: great across the entire span. So yeah, we still don't want to throw away by just completely ignoring we'd likely be missing a lot of information.

172
00:35:42.639 --> 00:35:57.799
E3410 x8539 Conf Room: so here's an example of using that for disease. And so disease is a great goal. Any problematic disease by definition means that resistance is going to be rare. If it wasn't problematic, then it wouldn't be rare, and we wouldn't really care so much.

173
00:35:58.210 --> 00:36:17.520
E3410 x8539 Conf Room: And in all these cases we have so resistances over here. This is this tale. Very limited data, mostly everything here susceptible. And if you use the standard genomics model. You get this compression effect. So you see everything being compressed to the mean

174
00:36:17.844 --> 00:36:43.189
E3410 x8539 Conf Room: the average value. And so your model is just giving you tons of average values out left and right. But you can start pulling this to you can start pulling and teasing out these resistance components of the genetics by adding in one of these lost functions, and then it's really hard to see with the green here. But this ends up removing the compression and gives you more of a diagonal line on your predicted

175
00:36:43.190 --> 00:36:47.600
E3410 x8539 Conf Room: and observed. And so here we don't have this over prediction of the means.

176
00:36:47.600 --> 00:37:06.230
E3410 x8539 Conf Room: So now we're we're telling our model that you need to focus very explicitly on what makes things rare. In this case, which is for diseases. This means that we can now move way faster when we see this problematic diseases in terms of finding the right germ plasm and then breeding those

177
00:37:08.627 --> 00:37:18.792
E3410 x8539 Conf Room: one other part here on kind of genomic selections. That's very useful for teaching that after teasing that out for genomic selection, but we can start implementing some ideas of active learning.

178
00:37:19.530 --> 00:37:36.370
E3410 x8539 Conf Room: I think many people are probably familiar with genomic selection. But we typically will go test some set of genetics. Observe the phenotypes. We then train some model, and then we try to use that model to choose the next set of genotypes to put out in the field.

179
00:37:36.992 --> 00:37:54.589
E3410 x8539 Conf Room: Now, this has typically traditionally taken this approach where this is this is what we call the acquisition function, and tells you which new genetics you want to put out in the field, and traditionally, we just exploited. So the model says, this is the best. Let's put that out there.

180
00:37:54.590 --> 00:38:14.350
E3410 x8539 Conf Room: but that doesn't allow the model to ever learn about other interesting ideas that are out there. So we need to make sure we start embedding some exploratory terms. So when we put this way. We're not biasing our model just to one particular solution, but allowing it to search the space much more dynamically.

181
00:38:14.490 --> 00:38:36.840
E3410 x8539 Conf Room: And we've done a little bit of some analysis of various different genomic data sets. And really, all this gift is trying to show is that this is a blue blog. Is all the data, and the model is picking out all these red terms which are the high performers. They can do it at extremely efficient levels. If it has an exploration term, and so this is perhaps maybe one tenth of the data.

182
00:38:36.840 --> 00:38:44.589
E3410 x8539 Conf Room: So there's massive accelerations that we potentially see if we do appropriate exploration and active learning techniques.

183
00:38:45.440 --> 00:38:49.810
E3410 x8539 Conf Room: And then the final thing, large language models. I have to say it because everyone's doing it.

184
00:38:50.607 --> 00:38:53.609
E3410 x8539 Conf Room: so one of the things that we're interested in.

185
00:38:54.149 --> 00:39:20.960
E3410 x8539 Conf Room: There is that you have a massive genome, and you need to find what are interesting regions for us to go and edit and so we've been using some of the large language models. Find out, find segments that have unetholated regions. Accessible chromatin conserve not going sequences and transcription factor, binding sites. And so we use these models to try to figure out where that is, and then say, Hey, that's a good high end value

186
00:39:21.249 --> 00:39:35.700
E3410 x8539 Conf Room: or high value editing target. And then we go try to collect that data. Now, we have the nice advantage that we have a ton of data on our very specific germ plasma that we wanna make specific edits. So that really helps with building some of these models.

187
00:39:37.633 --> 00:40:03.029
E3410 x8539 Conf Room: And just to start wrapping up here. I talked all about genetics, but there's so much opportunity in the soil and weather and management components here, as well as imaging to either image image some of these things like weather or management practices, and imaging that gives you much better high resolution, phenotyping, and phenotypes that we have yet to even start modeling or observing

188
00:40:03.475 --> 00:40:07.830
E3410 x8539 Conf Room: and those will all fit very nicely. And the AI architectures.

189
00:40:09.211 --> 00:40:26.049
E3410 x8539 Conf Room: I like showing this slide that that we have. I didn't build this slide that somebody didn't kinda nice to show the progress of what we've done, but I think, even though we come a very long way. The next steps are gonna have to go beyond the bushel breaker. Not just gonna be about efficiency, but about other things of

190
00:40:26.408 --> 00:40:43.629
E3410 x8539 Conf Room: how can we make sure that we meet the livelihoods of farmers, and also the regulations and societal pressures, how foods produced and other sustainability metrics. And I think being able to synthesize all these different data streams is gonna be very critical and going beyond

191
00:40:43.630 --> 00:40:45.798
E3410 x8539 Conf Room: the traditional bushel per acre.

192
00:40:46.380 --> 00:40:49.910
E3410 x8539 Conf Room: So a last couple comments here about

193
00:40:50.140 --> 00:41:06.270
E3410 x8539 Conf Room: where I think, maybe on the educational training side, how must that shift? To recognize the opportunity of these data driven levels. I would say that first, that AI and Ag requires a bit of a perspective change, and this is

194
00:41:06.270 --> 00:41:29.580
E3410 x8539 Conf Room: that interpretability and explainability which are important, and things that we should continue to ask questions about, but they should not undermine the capability of prediction and design. And sometimes you see that that something can't be interpreted. Explain, we don't move forward with it, but prediction and design. We don't necessarily need interpreting explainability, at least not today. We'll give it some time.

195
00:41:29.830 --> 00:41:42.507
E3410 x8539 Conf Room: The next part is formalizing quantitative design goals and really making sure that our design goals for line are aligning perfectly with what we're doing with our tools that we have

196
00:41:42.990 --> 00:42:05.180
E3410 x8539 Conf Room: and that's more of an engineering perspective here of trying to teach these creative solutions with clear assumptions and hypotheses and boundaries. That we wanna operate in. And the third one is that we still need to make sure we identify problems from deep biological domain knowledge. I I think one of the most interesting things over the last 2 years being at there is

197
00:42:05.220 --> 00:42:18.239
E3410 x8539 Conf Room: the very critical conversations that I've had with a lot of career biologists that have been absolutely that they've been amazing in terms of figuring out what are the problems we can solve.

198
00:42:18.240 --> 00:42:39.770
E3410 x8539 Conf Room: So this deep biological domain knowledge can't, can't go away here in this this discussion of these kind of 3 items, going forward. And maybe if I leave one last thing. This is kind of how I see it as as this is, gonna be a work of arts of some sort. And I think the engineering mindset really comes and building the frame setting the boundary conditions and the design goal.

199
00:42:40.174 --> 00:42:51.910
E3410 x8539 Conf Room: Ai is really the tool, and biology provides all the different colors and interesting components that we can use to start painting. This picture going forward

200
00:42:51.930 --> 00:42:55.690
E3410 x8539 Conf Room: so. But that I think that was the end of what we had.

201
00:42:56.130 --> 00:42:57.220
E3410 x8539 Conf Room: thanks. Bye, bye.

202
00:43:04.470 --> 00:43:16.129
E3410 x8539 Conf Room: I can hear. Yeah. Thank you, Scott. Thank you, Ethan. Do we? So we now have a little bit of time for some questions. Do we have any questions in the room?

203
00:43:17.170 --> 00:43:21.649
E3410 x8539 Conf Room: Please, do we need a microphone down here?

204
00:43:21.760 --> 00:43:24.310
E3410 x8539 Conf Room: So the people in here is also in. Yeah.

205
00:43:27.370 --> 00:43:29.260
E3410 x8539 Conf Room: here it comes.

206
00:43:30.802 --> 00:43:34.477
E3410 x8539 Conf Room: Yeah. Oh, this microphone is working.

207
00:43:38.990 --> 00:43:57.220
E3410 x8539 Conf Room: Yeah. Hi, yeah, thanks. I'm Chris Aguissa. I'm a plant physiologist in in Ios. So. So my question to you is, I imagine that in your data you're looking at you, then disease, resistance. But you probably are, I, I assume, are integrating data from the environment as well.

208
00:43:57.420 --> 00:44:06.320
E3410 x8539 Conf Room: And imagine that you guys have amazing sensors. And you know, measurements of all the differences in environmental conditions during the day during the seasons.

209
00:44:06.380 --> 00:44:26.499
E3410 x8539 Conf Room: So how hard is it to integrate all this into you, then disease, resistance, or just you? And is it better like with the precision that I assume you guys have either in greenhouses or feuds. Is it better to look at things very specifically? Or is it better to look at all the changes, all the complex changes in the environment. Is it

210
00:44:26.660 --> 00:44:47.199
E3410 x8539 Conf Room: in a way better to look at all the noise at once? Or is it better to be very specific. So it's gonna it, it will depend on your design goal. So in the case of, we want a germ plasm that operates really well and very select region. Then we can be very specific for that. If we want this to be many broad acres.

211
00:44:47.200 --> 00:44:58.590
E3410 x8539 Conf Room: then we're no longer going for a very specific performance. But now a distribution of performances. So we wanna make sure that that that germ plasm is gonna operate in a number of different environments.

212
00:44:58.912 --> 00:45:19.900
E3410 x8539 Conf Room: And so that changes that that changes your design goal. And then you are going to be. You're still specific. But it's just a different set of. Now, you're specific over a wide range of topics. Whereas before you're now specific over a smaller range of topics. So yeah, whenever you're training these models they have. They have a finite set of

213
00:45:20.060 --> 00:45:43.039
E3410 x8539 Conf Room: when you have your architecture and your data, there's a finite amount of learning that can be achieved. And you have to. You have to choose exactly where you want that learning to explicitly go and so I think it brings a lot more to the table if you define that very clearly. But on the the concept of just more data that's coming through with environment.

214
00:45:43.040 --> 00:45:54.440
E3410 x8539 Conf Room: There is a there is a little bit of a caveat to that one. So, for example, I did a lot more fluid fluid mechanics and Phd. In Postdoc. And

215
00:45:54.930 --> 00:46:14.609
E3410 x8539 Conf Room: those are really complex systems that if you look over the last 30 years of weather, 30 years of weather is nowhere near enough weather to really understand how weather is operating. So we need a lot more data on the environmental side to be particularly accurate or high fidelity with what's going on.

216
00:46:14.690 --> 00:46:26.308
E3410 x8539 Conf Room: So I think it's it's great that we're continuously getting more information about the environment. But the total weather scenarios. We probably still have to box those in on just a little bit more.

217
00:46:34.880 --> 00:46:36.700
E3410 x8539 Conf Room: so just wondering

218
00:46:37.186 --> 00:46:48.983
E3410 x8539 Conf Room: you're introducing anything new into the equation along, you know, with synthetic biology, synthetic genes. That's what I said. Because what occurs to me, you're you're bringing all this.

219
00:46:49.620 --> 00:46:52.220
E3410 x8539 Conf Room: you know, 1 million dollar technologies.

220
00:46:52.510 --> 00:46:59.720
E3410 x8539 Conf Room: All this information. But preceding you has been millions of years of evolution and 4,000 years of farming.

221
00:47:00.010 --> 00:47:09.319
E3410 x8539 Conf Room: Who I know. I wonder if you're just using the same set of genes, how much design space there is to actually move into.

222
00:47:09.340 --> 00:47:10.629
E3410 x8539 Conf Room: With all these.

223
00:47:10.950 --> 00:47:13.639
E3410 x8539 Conf Room: you know, high high tech approaches.

224
00:47:13.750 --> 00:47:16.980
E3410 x8539 Conf Room: and as you generate new

225
00:47:17.800 --> 00:47:25.310
E3410 x8539 Conf Room: types, I suppose not upon biologists, but new types of different species. Were you sacrificing

226
00:47:25.530 --> 00:47:33.470
E3410 x8539 Conf Room: in terms of, for example, taste right? Because you don't have a new design space to move into?

227
00:47:33.730 --> 00:47:44.270
E3410 x8539 Conf Room: Yeah. So defining that problem, we will sacrifice. There's a potential that we, you might have a better answer for this one. Okay.

228
00:47:44.890 --> 00:48:09.630
E3410 x8539 Conf Room: so so yeah, does it, if if we only care about yield, and that's the only thing that we're measuring, and that's what the model is going after. There. It's not guaranteed that everything else goes away. But it is definitely a risk that everything goes away. Now we have a lot more typically than just yield that we're designing, for there's a number of other metrics. That exist, and all of those kind of go into the calculation of a multi objective

229
00:48:10.153 --> 00:48:16.199
E3410 x8539 Conf Room: design principle. Maybe you were getting at a another, a different point there about.

230
00:48:16.580 --> 00:48:19.890
E3410 x8539 Conf Room: Have we pretty much seen most of the genomic

231
00:48:20.240 --> 00:48:41.059
E3410 x8539 Conf Room: we squeezed everything out of there. Maybe from an I I don't. I don't think it's true, but we could say from a traditional breeding standpoint, let's assume that is true. I think editing just by itself, and what we can do there is going to completely change that and introduce a whole new set of variations that is going to continue to move

232
00:48:41.460 --> 00:48:50.300
E3410 x8539 Conf Room: to move the boundaries. So even if that were the case, I think the new technology is going to do that. We often think of GM as a very static.

233
00:48:50.310 --> 00:49:02.369
E3410 x8539 Conf Room: Yeah, they're not. They continue to evolve even within breeding programs. So you get, you know, newer comments get gene duplications the genomes, dynamic transposons moving changing, how genes work.

234
00:49:02.530 --> 00:49:08.490
E3410 x8539 Conf Room: And that continues to drive the variation that they're gonna have to continue capturing these models because that continues to to evolve over time.

235
00:49:08.970 --> 00:49:15.230
E3410 x8539 Conf Room: I'm reminded of a paper back in the late 90 s. For my postdoc advisor, who was a chief science officer at Usda for a while.

236
00:49:15.510 --> 00:49:31.519
E3410 x8539 Conf Room: There's a breeding program in barley at Minnesota. They have the same genetics, I think, 60, some years, and they continue to make yield improvements. And the question was, Where's that? Come? Where's it coming from? Just remodel, it should stop, but it keep. It keeps moving. And so there's all these other processes. It's a dynamic genome.

237
00:49:32.060 --> 00:49:33.480
E3410 x8539 Conf Room: There's things happening.

238
00:49:40.160 --> 00:49:57.269
E3410 x8539 Conf Room: So for first, I just have to say, love the talks all the way through there, and I'm so happy. I get to go to dinner with you so I can pick your brain. It's just limited to one question is is difficult. But the the there was one thing in one of your slides, I thought was really interesting, as it combines here

239
00:49:57.270 --> 00:50:15.679
E3410 x8539 Conf Room: with which is showing this slide, which is that importance around computational thinking, engineering, thinking and biology. Thinking. There's one more piece that I think is is really you're in a really cool spot for which is genetics thinking, and specifically with maize researchers cause

240
00:50:15.950 --> 00:50:31.219
E3410 x8539 Conf Room: you work with folks that have to sort of plan experiments years in advance, and have this really limited number of iteration cycles that don't constrain as much on the data, computational thinking or the engineering thinking. But you had this number up there

241
00:50:31.240 --> 00:50:45.890
E3410 x8539 Conf Room: 26. We have 26 more years before we hit 2050 and and all of these dire warnings that come out there, and I was just sort of wondering around, how do you think about what you can do within that timeframe

242
00:50:45.890 --> 00:51:04.369
E3410 x8539 Conf Room: as you start today and specifically around thinking about how you set yourself up for the most success in the future, and if you have any predictions about you know where you'll be in 26 years, where we'll be in 26 years, as we sort of think forward, both in the technology and the programs that are in play right now. Just what are your thoughts and and how you think about that?

243
00:51:04.400 --> 00:51:23.859
E3410 x8539 Conf Room: Yeah, I think it's yeah, on the on the 20 sixth year piece. This was so this is one of my concerns coming to to coming to bear originally was, oh, we have to deal with. I was used to experiments that we would. We had simulations that we would have the AI work with. And so it

244
00:51:23.860 --> 00:51:52.520
E3410 x8539 Conf Room: you get results back within a couple of minutes. And now now you go. Okay, well, we're gonna if we had 2 inbred lines ready to go and put that in the field as a hybrid, you have to wait a whole year to get that data back. Now, if you wanna design a new inbred and you have to cross it, you have to cross it. Go through a number of different processes, like the the shortest timeframe, I think, is 3 years to get a hybrid data point. So if we were today to do something, we're not gonna get that data point in 3 years.

245
00:51:52.780 --> 00:52:00.330
E3410 x8539 Conf Room: So so I think that's a, really, it's a really interesting question. And I think it, it really underlines the aspect of

246
00:52:00.910 --> 00:52:29.630
E3410 x8539 Conf Room: we. We have to look really far downstream and ask the question, are we exploring enough? Because I know. And this is something that maybe the public sector will be better at because we get down to points with, we have to be able to provide a certain set of products that are gonna be high performing. And sometimes we just have to exploit. We have to take the things that we have right now and sometimes can't look downstream. So I think that's a that's a big question of risk

247
00:52:29.710 --> 00:52:37.440
E3410 x8539 Conf Room: that I hope we can solve. But yeah, there's a lot of things asking someone to make a decision 3, 4, 5 years downstream

248
00:52:38.050 --> 00:52:41.736
E3410 x8539 Conf Room: stuff. I don't know if I answer that in any useful way, but

249
00:52:42.360 --> 00:52:46.790
E3410 x8539 Conf Room: so I generally think of, you know, if you're modeling, or that

250
00:52:47.150 --> 00:53:12.880
E3410 x8539 Conf Room: AI modeling that you're mostly being able to look for predictions. Well, you don't extrapolate. You can look within your data set. But you're not really being able to predict far outside of your data set. Is that true? Given the constraints and the loss functions? That you're incorporating in the models. So that if I'm really looking for something new that's gonna enable me to, you know. Do agriculture in, you know.

251
00:53:13.000 --> 00:53:17.949
E3410 x8539 Conf Room: 2050 will I be able to find that? Or do I need to really

252
00:53:18.318 --> 00:53:21.819
E3410 x8539 Conf Room: how do I push those models? So I get that extrapolation?

253
00:53:21.900 --> 00:53:28.330
E3410 x8539 Conf Room: Yeah. Yeah. So when you do this active learning approach. The goal is to be

254
00:53:28.420 --> 00:53:45.109
E3410 x8539 Conf Room: going on the edge. You're gonna try to find the edges continuously. And so you are trying to extrapolate. And one of the things that definitely is hard for me to discuss multiple times is your model, and extrapolation is, gonna be wrong. So many times you might be. If you're working in extreme events.

255
00:53:45.240 --> 00:54:04.960
E3410 x8539 Conf Room: it's so rare that 99.5% of the time you're gonna be wrong. And a lot of people don't like that answer. But it's a it's a reality of you do have to. You're gonna be wrong most of the time. But if you run the statistics and you run a number of different simulations. You can see that being wrong is worth it because you're gaining

256
00:54:05.100 --> 00:54:29.333
E3410 x8539 Conf Room: understanding and data assets that are much more interesting because they're diverse and they are solving. They are answering a question in the space of genetics. Essentially so. I think that that kinda pairs well, with that question is that most of the time an extrapolation? The models are wrong, and that's a really hard discussion to have. But it's very you. It's useful, wrong versus

257
00:54:29.690 --> 00:54:35.088
E3410 x8539 Conf Room: being right and not moving anything. So

258
00:54:36.944 --> 00:54:48.404
E3410 x8539 Conf Room: we need 3 years to prove that you were wrong. To to do this, and you have to convince this to your leadership.

259
00:54:49.630 --> 00:54:50.720
E3410 x8539 Conf Room: So

260
00:54:50.810 --> 00:55:03.229
E3410 x8539 Conf Room: fantastic talk. Thank you so much. So I'm just you know. You're very fortunate in having the luxury of having all of these data and years and years decades of research on maize.

261
00:55:03.410 --> 00:55:09.390
E3410 x8539 Conf Room: So given your experience of working with maids, you know, recognizing as we.

262
00:55:09.400 --> 00:55:13.809
E3410 x8539 Conf Room: as the climate continues to change, we're going to have to bring in additional crops.

263
00:55:13.830 --> 00:55:17.169
E3410 x8539 Conf Room: you know, be they orphan, be they new crops.

264
00:55:17.560 --> 00:55:32.910
E3410 x8539 Conf Room: What recommendations would you give to researchers who are working on some of these orphan or these new crops that we're we're now developing to be able to leverage AI to the best extent possible, to be able to improve them as quickly as possible.

265
00:55:33.410 --> 00:55:59.960
E3410 x8539 Conf Room: Well, I'd say, maybe the the first point that is on this kind of active learning question is being willing to spread out your data and allow the model to paint the lines in between it. That's gonna mean that your limited resources to go work with are going to be spent as efficiently towards building something that can predict. Well, so I think that's something from. If you're starting from scratch and you have that opportunity.

266
00:55:59.960 --> 00:56:10.326
E3410 x8539 Conf Room: Be super. If have a mindset of data is an asset that you have to take a risk analysis approach to to really get the best data. And if you do that.

267
00:56:10.920 --> 00:56:30.339
E3410 x8539 Conf Room: most of the data sets that exist out there, if you do a historical like kind of a historical analysis on it. You only need about 5 to 10% of the data that's out there to get the accuracy, and you could throw away the other 80. So there's a ton of experiments, ton of wasted data. So, taking an approach like that, I think, is very critical from the AI side

268
00:56:30.880 --> 00:56:32.360
E3410 x8539 Conf Room: admiral.

269
00:56:32.790 --> 00:56:35.810
E3410 x8539 Conf Room: Some extent this is happening so obviously, corn

270
00:56:36.130 --> 00:56:44.249
E3410 x8539 Conf Room: forbearers, the most profitable crop, most most resources going to it. So a lot of things get developed, and in corn, maize, and then propagated.

271
00:56:44.400 --> 00:56:46.929
E3410 x8539 Conf Room: found a sore, and then rice and wheat, and

272
00:56:47.480 --> 00:56:53.070
E3410 x8539 Conf Room: and even into the vegetables. We have a veg division that has like 70 different vegetables that they breed

273
00:56:53.110 --> 00:57:04.510
E3410 x8539 Conf Room: so over time. These things, you know, we we first introduced all the genotyping sea chipping that started in corn, and then made its way down to the other crops. Same with all the genotyping genomic resources.

274
00:57:05.470 --> 00:57:16.470
E3410 x8539 Conf Room: And if you think about orphan crops cover crops, and some of the cover crusts and things like that which we're investing in. And other companies are investing in those technologies that he's talking about are also making their way into those as

275
00:57:16.530 --> 00:57:18.650
E3410 x8539 Conf Room: orphan crops as well. So yeah.

276
00:57:18.790 --> 00:57:35.680
E3410 x8539 Conf Room: it's happening slowly. Alright. Well, thank you again. It's wonderful presentation. And I'm sure there'll be additional opportunities to meet with Nsf. Staff for the rest of the day. So let's thank

277
00:57:38.070 --> 00:57:39.050
E3410 x8539 Conf Room: sure. Yep.

278
00:57:42.750 --> 00:57:49.599
E3410 x8539 Conf Room: okay, that was great nice.