NHacker Next
login
▲Chemical knowledge and reasoning of large language models vs. chemist expertisenature.com
96 points by bookofjoe 2 days ago | 69 comments
Loading comments...
PlasmonOwl 9 hours ago [-]
Ok so I am always interested in these papers as a chemist. Often, we find that the LLM are terrible at chemistry. This is because the lived experience of a chemist is fundamentally different from the education they receive. Often, a masters student takes 6 months to become productive at research in a new sub field. A PhD, around 3 months.

Most chemists will begin to develop an intuition. This is where the issues develop.

This intuition is a combination of the chemists mental model, and how the sensory environment stimulates that. As a polymer chemist in a certain system maybe brown means I see scattering hence particles. My system is supposed to be homogeneous so I bin the reaction.

It is often known that good grades don’t make good researchers. That’s because researchers aren’t doing rote recall.

So the issue is this: we ask the LLM how many proton environment in this nmr?

We should ask: I’m intercalating Li into a perovskite using BuLi. Why does the solution turn pink?

Workaccount2 6 hours ago [-]
I think a huge reason why LLMs are so far ahead in programming is because programming exists entirely in a known and totally severed digital environment outside our own. To become a master programmer all you need is a laptop and an internet connection. The nature of it existing entirely in a parallel digital universe just lends itself perfectly to training.

All of that is to say that I don't think the classic engineering fields have some kind of knowledge or intuition that is truly inaccessible to LLMs, I just think that it is in a form that is too difficult right now to train on. However if you could train a model on them, I strongly suspect they would get to the same level they are at today with software.

alganet 5 hours ago [-]
> I think a huge reason why LLMs are so far ahead in programming

Are they? Last time I checked (couple of seconds ago), they still made silly mistakes and hallucinated wildly.

Example: https://imgur.com/a/Cj2y8km (AI teaching me about the Coltrane operator, that obviously does not exist).

gcanko 4 hours ago [-]
You're using the worst model when it comes to programming, not sure what point you're trying prove here. That's why when someone starts ranting how useless ai models are when it comes to coding I always assume they're just using inferior models.
alganet 4 hours ago [-]
My question was very simple. Suitable for a simpler model.

I can come up with prompts that make better models hallucinate (see post below).

I don't understand your objection. This is a known fact, LLMs hallucinate shit regardless of the model size.

CamperBob2 4 hours ago [-]
LLMs are getting better. Are you?

Nothing matters in this business except the first couple of time derivatives.

alganet 4 hours ago [-]
Maybe I'm not.

However, I'm discussing this within the context of the study presented in the paper, not some future yet-to-be-achieved performance expectation.

If we step outside the context of the paper (not advised), I think any average developer is better than an LLM at energy efficiency. LLMs cheat by consuming more resources than a human. "Better" is quite relative. So, let's keep reasonable.

aoeusnth1 5 hours ago [-]
Are you intentionally sandbagging the LLMs to prove a point, or do you really think 4o-mini is good enough for programming?

Even 2.5 flash easily gets this https://imgur.com/a/OfW30eL

alganet 4 hours ago [-]
The point is that I can make them hallucinate quite easily. And they don't demonstrate knowing their own limitations.

For example, 2.5 Flash fails to explain the difference between the short ternary operator (null coalescing) and the Elvis operator.

https://imgur.com/a/xKjuoqV

Even when I specify a language (therefore clearing the confusion, supposedly), it still fails to even recognize the Elvis operator by its toupe, and mixes it up the explanation (it doesn't even understand what I asked).

https://imgur.com/a/itr87hM

So, the point I'm trying to make is that they're not any better for programming than they're for chemistry.

CamperBob2 4 hours ago [-]
Flash is the wrong model for questions like that -- not that you care -- but if you'd like to share the actual prompt you gave it, I'll try it in 2.5 Pro.
alganet 4 hours ago [-]
"explain me the difference between the short ternary operator and the Elvis operator"

When it failed, I replied: "in PHP".

You don't seem to understand what I'm trying to say and instead is trying to defend LLMs for a fault that is a fact known in the industry at large.

I'm sure that in short time, I could make 2.5 Pro hallucinate as well. If not on this question, on others.

This behavior is inline with the paper conclusions:

> many models are not able to reliably estimate their own limitations.

(see Figure 3, they tested a variety of models of different qualities).

This is the kind of question a junior developer can answer with simple google searches, or by reading the PHP manual, or just by testing it on a REPL. Why do we need a fancy model in order to answer such a simple inquiry? Would a beginner know that the answer is incorrect and he should use a different model?

Also, from the paper:

> For very relevant topics, the answers that models provide are wrong.

> Given that the models outperformed the average human in our study, we need to rethink how we teach and examine chemistry.

That's true for programming as well. It outperforms the average human, but then it makes silly mistakes that could confuse beginners. It displays confidence in being plain wrong.

The study also used manually curated questions for evaluation, so my prompt is not some dirty trick. It's totally inline with the context of this discussion.

CamperBob2 3 hours ago [-]
It's better than it was a year ago, as you'd have discovered for yourself if you used current models. Nothing else matters.

See if this looks any better (I don't know PHP): https://g.co/gemini/share/7849517fdb89

If it doesn't, what specifically is incorrect?

alganet 1 hours ago [-]
What I expect from a human is to ask "in which language?", because it makes a difference. If no language was supplied, I expect a brief summary of null coalescing and shorthand ternary options with useful examples in the most popular languages.

--

The JavaScript example should have mentioned the use of `||` (or operator) to achieve the same effect of a shorthand ternary. It's common knowledge.

In PHP specifically, `??` allows you to null coalesce array keys and other types of complex objects. You don't need to write `isset($arr[1]) ? $arr[1] : "ipsum"`, you can just `$arr[1] ?? "ipsum"`. TypeScript has it too and I would expect anyone answering about JavaScript to mention that, since it's highly relevant for the ecosystem.

Also in PHP, there is the `?:` that is similar to what `||` does in JavaScript in an assignment context, but due to type juggling, it can act as a null coalesce operator too (although not for arrays or complex types).

The PHP example they present, therefore, is plain wrong and would lead to a warning for trying to access an unset array key. Something that the `??` operator (not mentioned in the response) would solve.

I would go as far as explaining null conditional acessors as well `$foo?->bar` or `foo?.bar`. Those are often called Elvis operators coloquially and fall within the same overall problem-solving category.

The LLM answer is a dangerous mix of incomplete and wrong. It could lead a beginner to adopt an old bad practice, or leave a beginner without a more thorough explanation. Worst of all, the LLM makes those mistakes with confidence.

--

What I think is going on is that null handling is such a basic task, that programmers learn it in the first few years of their careers and almost never write about it. There's no need to. I'm sure a code-completion LLM can code using those operators effectively, but LLMs cannot talk about them consistently. They'll only get better at it if we get better at it, and we often don't need to write about it.

In this particular elvis operator thing, there has been no significant improvement in the correctedness of the answer in __more than 2 whole years__. Samples from ChatGPT in 2023 (note my image date): https://imgur.com/UztTTYQ https://imgur.com/nsqY2rH.

So, _for some things_, contrary to what you suggested before, LLMs are not getting that much better.

CamperBob2 4 hours ago [-]
They aren't getting any better at programming, so they naturally assume the LLMs aren't, either.
CGMthrowaway 46 minutes ago [-]
>the lived experience of a chemist is fundamentally different from the education they receive. Most chemists will begin to develop an intuition.

Is this a documentation problem? The LLMs are only trained on what is written down. Seems to track with another comment further down quoting:

"Models are limited in ability to answer knowledge-intensive questions, probably because the required knowledge cannot easily be accessed via papers but rather by lookup in specialized databases, which the humans used to answer such questions"

fuzzfactor 1 hours ago [-]
>using BuLi. Why does the solution turn pink?

I would say odds are because of an impurity. My first guess might be the solvent if there is more in action than reagents or reactants. Maybe could be confirmed or denied by some carefully figured filtration beforehand, which might not even be that difficult. I doubt I would try much further than that unless it was a bad problem.

Although for instance an alternate simple purification like distillation is pretty much routine for pure aniline to get some colorless material, and that's some pretty rough stuff to handle.

Now I once was a young chemist facing AI, I ended up highly focused on going forward in ways that would not be "taken over" by AI, and I knew I couldn't be slow or recession still might catch up with me, plus the 1990's were approaching fast ;)

By the mid 1990's I figured there's no way the stuff they have in this paper had not been well investigated.

I always knew it would take people that had way more megabytes than I could afford.

Sheesh, did I overestimate the progress people were making when I wasn't looking.

CamperBob2 4 hours ago [-]
Just out of curiosity (not knowing anything about butyllithium other than what I've read on 'Things I Won't Work With'), is this answer from o3-pro even close?

https://chatgpt.com/share/685041db-c324-800b-afc6-5cb2c5ef31...

calibas 17 hours ago [-]
I'm sure an LLM knows more about computer science than a human programmer.

Not to say the LLM is more intelligent or better at coding, but that computer science is an incredibly broad field (like chemistry). There's simply so much to know that the LLM has an inherent advantage. It can be trained with huge amounts of generalized knowledge far faster than a human can learn.

Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.

It's very impressive, until you realize the LLM's knowledge is a mile wide and an inch deep. It has vast quantities of knowledge, but lacks depth. A human that specializes in a field is almost always going to outperform an LLM in that field, at least for the moment.

mumbisChungo 16 hours ago [-]
It's impressive until you realize its limitations.

Then it becomes impressive again once you understand how to productively use it as a tool, given its limitations.

X6S1x6Okd1st 5 hours ago [-]
Also that limitations keep dropping every six months
logifail 14 hours ago [-]
> Do you know every common programing language?

A long time ago my OH was introduced to someone who claimed "to speak seven languages fluently".

Her response at the time was was "Do they have anything interesting to say in any of them?"

dandellion 10 hours ago [-]
As a foreign English speaker, it's a huge pet peeve is when people use acronyms without having used the full sentence before. Especially when the acronym is already a word or expression and looking it up just returns a bunch of useless examples (oh!). Eventually I'll find out the meaning (other half), and it always turns out they only saved a total of six or seven letters, which can be typed in less than 0.5 seconds, but in exchange they made their sentence more or less incomprehensible for a large group of people.
dylan604 7 hours ago [-]
As a native English speaker, I had no idea what OH was either. I’ve seen SO for significant other and not stack overflow, and I’ve seen reference to better half not just other half. By that choice, I am left to assume this person feels they are the better half which says a lot about them.
djtango 4 hours ago [-]
As a native speaker, you probably scratched your head and worked out what could fit in that gap and eventually worked it out. Then you'll grumble because the other speaker didn't choose your preferred diction.

As a non native speaker you'll probably just feel upset/hopeless/angry.

From my experience, "non-native" here includes people who are "fluent".

So we arrive at the situation where my OH-SO beloved wife is fluent in English and is definitely better than me at writing clearly constructed English essays but when it comes to usage of random idioms/slang or understanding local (and foreign!) English accents I have a very clear advantage.

dylan604 4 hours ago [-]
actually, no, other half never popped into my head. i only got it from seeing other comments in the thread of people confused by it as well.
daveguy 7 hours ago [-]
> By that choice, I am left to assume this person feels they are the better half which says a lot about them.

What a ridiculous assumption.

Maybe they consider themselves and their partner to be equal halves of a whole. You know, the definition of half.

glenneroo 9 hours ago [-]
OTOH we are one of today's "lucky" 10,000? And future searches will possibly lead to this post, further reducing friction to using this acronym. Also newly trained LLMs will also be able to answer quicker. Yay?

I wonder how acronyms such as OTOH even become so well known that they can be used without fear or not being understood? When is that threshold reached? Is using OH now the beginning of a new well-known acronym? I guess only time will tell...

theelous3 8 hours ago [-]
the far more common and acceptable-to-use-without-introduction acronym for this is SO (significant other)

And to answer the question - the threshold is when people stop complaining about the use :)

catigula 8 hours ago [-]
I've literally never seen "OTOH" in my life. Anyhow, if you really feel your sentence can't do without it you can say "conversely" which is pretty short and clear.
mitb6 5 hours ago [-]
OTOH dates back to the 90s and has since remained very common in internet writing. It is more surprising that you've never seen it than that someone used it.

It also isn't an exact synonym of "conversely".

catigula 5 hours ago [-]
There aren't any exact synonyms in English.

I've been an extensive internet user for decades and I don't have it in memory, so I'm not sure how to feel about your assertion. I'm not the only person saying this.

andruby 4 hours ago [-]
> There aren't any exact synonyms in English.

I'm sure that depends on the tolerance. "assist" and "help"? "dog" and "canine"? "purchase" and "buy"?

catigula 4 hours ago [-]
I'm not just being pedantic, it's a fairly mainstream assertion in linguistics. I don't find those words synonymous. They have different performative content. I don't know if this applies to other languages.
dylan604 7 hours ago [-]
We are not in a text chat using T9 on a numeric keypad where typing is painful. There’s no need for acronyms now except for the attempt at not looking like an old or just lazy. We’re also not limited to 140 chars, so not an advantage there either.
Shadowmist 7 hours ago [-]
Paste the comment into an LLM and ask it what it means. Don’t use Google.
arcanemachiner 12 hours ago [-]
> OH

Other half? I've never seen this acronym before.

Upvoter33 8 hours ago [-]
sounds snarky and defensive, tbh
timschmidt 16 hours ago [-]
> Do you know every common programing language? The LLM does, plus it can code in FRACTRAN, Brainfuck, Binary lambda calculus, and a dozen other obscure languages.

Not only this, but they're surprisingly talented at reading compiled binaries in a dozen different machine and bytecodes. I have seen one one-shot an applet rewrite from compiled java bytecode to modern javascript.

catigula 7 hours ago [-]
And herein lies the fundamental power of the LLM and why it can even solve "impressive" problems: it is able to navigate a space that humans can't trivially - massive amounts of information and ability to parse through walls of simple logic/text.

LLMs are at their best when the context capacity of the human is stretched and the task doesn't really take any reasoning but requires an extraction of some basic, common pattern.

dylan604 7 hours ago [-]
> it is able to navigate a space that humans can't trivially - massive amounts of information and ability to parse through walls of simple logic/text.

That’s the very reason we built computers. If an LLM did not also meet this definition, there would be no point of it existing

catigula 7 hours ago [-]
You're not the first person to suggest that LLMs have no reason to exist.
anthk 14 hours ago [-]
Binwalk, Unicorn... as if it that was advanced wizardry. Unix systems have file(1) since forever and binutils from and to every arch.
Energiekomin 9 hours ago [-]
Yes it is and you compare apples with pineapples.

file can't program in brainfuck while doing basic binary analysis.

Binwalk and Unicorn can't do that either. And they can't write to you in multiply natural languages either

yMEyUyNE1 11 hours ago [-]
> There's simply so much to know that the LLM has an inherent advantage.

But do they understand it? I mean, A child used swear words, but does it understand the meaning of the swear words. In other comment, somebodies OH also mentioned about artistic abilities and utility of the words spoken.

ben_w 2 hours ago [-]
Does a submarine swim?

It doesn't matter to my employment prospects if the AI "understands" or "thinks", whatever is meant by that, but rather if potential employers recon it's good enough to not bother employing me.

esafak 17 hours ago [-]
But the LLM can already connect things that you can not, by virtue of its breadth. Some may disagree, but I think it will soon go deeper too.
anthk 14 hours ago [-]
So impressive that every complex SUBLEQ code I've tried with an LLM failed really fast.
6LLvveMx2koXfwn 16 hours ago [-]
Received 01 April 2024

Accepted 26 March 2025

Published 20 May 2025

Probably normal but shows the built in obsolescence of the peer review journal article model in such a fast moving field.

eesmith 14 hours ago [-]
How so?

To me it looks like the paper was submitted last year but the peer reviewers identified issues with the paper which required revision before the final acceptance in March.

We can see the paper was updated since the 1 April 2024 version as it includes o1-preview (released September 2024, I believe), and GPT‑3.5 Turbo from August. I think a couple of other tested versions also post-date 1 April.

Thus, one possible criticism might have been (and I stress that I am making this up) that the original paper evaluated only 3 systems, and didn't reflect the fully diversity of available tools.

In any case, the main point of the paper was not the specific results of AI models available by the end of last year, but the development of a benchmark which can be used to evaluated models in general.

How has that work been made obsolete?

bufferoverflow 13 hours ago [-]
How so? All the models they've tested are obsolete, multiple generations behind high-end versions.

(Though even these obsolete models did better than the best humans and domain experts).

eesmith 13 hours ago [-]
As I wrote, the main point of the paper was not the specific model evaluation, but the development of a benchmark which can be used to test new models.

Good benchmark development is hard work. The paper goes into the details of how it was carried out.

Now that the benchmark is available, you or anyone else could use it to evaluate the current high-end versions, and measure how the performance has changed over time.

You could also use their paper to help understand how to develop a new benchmark, perhaps to overcome some limitations in the benchmark.

That benchmark and the contents of that paper are not obsolete until there is a better benchmark and description of how to build benchmarks.

rotis 11 hours ago [-]
Yes, this paper and many others will be forgotten as soon as they leave the front page. Afterwards noone refers to articles like these here. People just talk about anecdotes and personal experiences. Not that I think this is bad.
Jimmc414 15 hours ago [-]
shows the value of preprint servers like arxiv.org and chemrxiv.org
pu_pe 13 hours ago [-]
Nice benchmark but the human comparison is a little lacking. They claim to have surveyed 19 experts, though the vast majority of them have only a master's degree. This would be akin to comparing LLM programming expertise to a sample of programmers with less than 5 years of experience.

I'm also not sure it's a fair comparison to average human results like that. If you quiz physicians on a broad variety of topics, you shouldn't expect cardiologists to know that much about neurology and vice-versa. This is what they did here, it seems.

KSteffensen 12 hours ago [-]
I'll get some downvotes for this but PhD vs master's degree difference is mostly work experience, an element of workload hazing and snobbery.

Somebody with a masters degree and 5 years of work experience will likely know more than a freshly graduated PhD

698969 10 hours ago [-]
I think the breadth vs depth thing applies here as well, the PhD will know more about the topic they're researching of course.
eesmith 12 hours ago [-]
Sure, but all we know is that these "13 have a master’s degree (and are currently enroled in Ph.D. studies)". We only know they have at least "2 years of experience in chemistry after their first university-level course in chemistry."

How does that qualify them as "domain experts"? What domain is their expertise? All of chemistry?

marcodiego 6 hours ago [-]
> [..] models are [...] limited in [...] ability to answer knowledge-intensive questions [...], they did not memorize the relevant facts. [...] This is probably because the required knowledge cannot easily be accessed via papers [...] but rather by lookup in specialized databases [...], which the humans [...] used to answer such questions [...]. This indicates that there is [...] room for improving [...] by training [...] on more specialized data sources or integrating them with specialized databases.

> [...] our analysis shows [...] performance of models is correlated with [...] size [...]. This [...] also indicates that chemical LLMs could, [...], be further improved by scaling them up.

Does that means the world of chemists will be eaten by LLMs? Will LLMs just improve chemists output or productivity? I'd be scared if this happened in my area of work.

X6S1x6Okd1st 5 hours ago [-]
It's increasingly looking like if you're young enough most knowledge work will be eaten by LLMs (or the thing that comes next) within your lifetime.

Hopefully we'll see human assisted with AI & induced demand for a good while, but the idea that people work unassisted in knowledge work is gonna go the way of artisan clothing

hooverd 4 hours ago [-]
so much for those birth rates
gavinray 9 hours ago [-]
I asked several LLM's after jailbreaking with prompts to provide viable synthesis routes for various psychoactive substances and they did a remarkable job.

This was neat to see but also raised some eyebrows from me. A clever kid with some pharmacology knowledge and basic organic chemistry understanding could get up to no good.

Especially since you can ask the model to use commonly available reagents + precursors and for synthesis routes that use the least amount of equipment and glassware.

dylan604 7 hours ago [-]
My limited bit of knowledge of both chemistry and LLMs would tell me that subtle incorrect chemistry can have disastrous effects while subtle incorrect is an LLM superpower suggests that this is precisely the inevitable outcome
Workaccount2 5 hours ago [-]
You need a decent amount of experience to make psychoactive substances. Chemistry is one of those things that looks like you just follow the steps, but in practice requires a ton of intuition and "feeling it". You can see this if you watch NileRed on youtube, he is a pretty experienced chemist, and even then still flops all the time trying to replicate reactions right out of the book.

Besides, the books Pihkl and Tikhl lay out how to make most psychoactive substances, and those books have been online for free for decades now.[1][2] Maybe there are easier routes and easier to acquire precursor recipes, but I doubt those would be hard to find. The hardest part by far is the chemistry intuition.

[1]https://erowid.org/library/books_online/pihkal/pihkal.shtml [2]https://erowid.org/library/books_online/tihkal/tihkal.shtml

gavinray 3 hours ago [-]
TiHKal and PiHKaL are fulls of synths that require equipment and re-agents far beyond what a hobbyist would be able to source.

There are various "one-pot" techniques for certain compounds if one is sufficiently clever.

For example, a certain cathinone can be produced by combining ephedrine/pseudoephedrine with a household product that reduces secondary alcohols to ketones and letting it sit.

refurb 3 hours ago [-]
What LLM’s?

I’m a chemist and I asked it to show me the structure for a common molecule and it kept getting it really wrong

sgt101 11 hours ago [-]
Also, books, books are really good for finding knowledge !

Seriously LLM's as a cultural technology cast them as a super interactive indexing system. I find that's a useful lens to use to understand this kind of study.

AvAn12 8 hours ago [-]
How much of this is because Scale AI and others have had human “taskers” create huge amounts of domain-specific content for OpenAI and other foundation model providers?
fuzzfactor 2 days ago [-]
Nothing to see here unless you have some kind of unsatisfied interest in the future of AI :\

This is all highly academic, and I'm highly industrial so take this with a grain of salt. Sodium salt or otherwise, your choice ;)

If you want things to be accomplished at the bench, you want any simulation to be made by those who have not been away from the bench for that many decades :)

Same thing with the industrial environment, some people have just been away from it for too long regardless of how much familiarity they once had. You need to brush up, sometimes the same plant is like a whole different world if you haven't been back in a while.

mistrial9 19 hours ago [-]
BASF Group - will they speak in public? probably not, given what is at stake IMHO