Thursday, May 5, 2016

Two Billion Compounds?

I've been cracking my skull against a peculiar problem this week:
How many unique molecules compounds have ever been made?*

I'm referring to those produced by humankind, over the past 250 years - give or take a decade - of formal chemistry effort. CAS claims 100 million molecules in their collection, and predict, at the current rate of registration, another 650 million over the next 50 years.

Berries by the side of the road, 2016.
Not counted in billions.
Certainly other databases exist, a well-curated larger example being ChemSpider (34 million), but I'm sure the Venn diagram for that against CAS overlaps quite a bit. Ditto PubChem, which according to ChemConnector had over 37 million structures in 2009, but lots of errors, duplicates, and isotopomers, to hear him tell it. Outside the med-chem arena, there are exciting new collections such as the Aspuru-Guzik lab's Clean Energy Project, to identify photovoltaic materials. Surely the assembled collection of privately-held corporate data from all chemistry, pharma, biotech, and engineering firms must include another windfall; ~200 million compounds?

So, let's try a thought exercise - say we limit the set of what we call "made," or synthesized. We won't consider polymers, whether natural (DNA, polysaccharides) or artificial (Teflon, urethanes). Screening collections, libraries, and combinatorics; unless someone produced >1 mg, I'm leaving it out. Metal complexes and salts are in, since most of the time inorganic and formulations colleagues still produce quantities you can hold and measure (and get a melting point on!).

Granted, by referring explicitly to the public and private chemistry databases, I'm not including dark reactions, those failed experiments or perhaps non-optimal yields that never make it to publication. Based on my lab career (and that of my hood-mates), I'd say there's a comfortable 5-10 molecules made for every 1 that gets reported somewhere. Of course, since many of those are literature preps or repeat reactions, I don't think it inflates the count that much; truly, novel molecules tend to creep into papers and patents somehow.

Chemical space gurus, I apologize - I only want to count things that have been bottled, columned, purified, and analyzed. Large computational data sets of billions - unless they've been made and characterized - aren't up for consideration. Neither are metabolites isolated from plants or microbes; no fair counting what we relied on other organisms to make. S'posing this means we also leave out decomposition products and geological materials.

So them's the rules: 1 mg produced and characterized, non-polymeric, must have been made or produced with human hands. Salts and metals are in, along with isotopomers and stereoisomers.

What do readers and commenters think? My guess is in the title of this post.


*On the Twitter, Peter Kenny points out that I should, in truth, be asking after compounds, not molecules. Fair enough.
** Another reader points out that ZINC15, the database of "stuff you can buy now," only includes ~10M at present.


  1. It's always an interesting question to consider chemical space, and I also find it interesting to think about what counts as a 'real' chemical. You always need to define your question somehow, which immediately raises new problems.

    Fair enough to leave out natural products (if I understand right, you want a measure of synthetic effort). If the alkaloid you aren't counting comes off the column as a white crystalline HCl salt, will you count that?

    If you have a solution of a sugar derivative and on crystallisation you get one set of crystals of alpha molecules and another set of beta ones, are you counting it twice?

    The second question is why you should be careful about using chemical databases. How many glucoses are in ChemSpider? As well as 'alpha' and 'beta' there is also 'undefined', as well as the open chain ones etc. CAS has a similar problem for some substances, e.g. one record for a racemic mix and one for each of the enantiomers, even though in solution the molecules rapidly interconvert. Each has a different structure and a different name, but are they actually different compounds?

    What about isomeric substitution?

    "How many unique molecules have ever been made?" Do you mean 'molecule', or do you mean 'compound' or 'chemical substance'?

    Interesting question, which I can't help you answer so I'm quibbling instead!

  2. All completely fair questions. In order:
    Yes, I'd count the salt.
    Yes, we'll count fluxional centers, for instance in sugars.
    Agreed that DBs are inherently unreliable for stereo and mixtures.
    Isomers would be in.

    I guess I mean "chemical substance." Obviously, every time I make a mmol, it's billions of "molecules," huh?

    Thanks for the questions.

  3. I'm going to take the Fermi approach here, and summarily ignore all your rules (sorry).

    Start with the assumption that the number of chemists in the world has approximately tracked with the total world population for the past 250 years.

    Currently, there are ~50,000 chemists in the US. Let's assume 10% are actively making new chemical matter. That means 0.0000167% of the population is making new chemicals. If we extrapolate this backwards, we can assume that the same portion of the population was making new chemicals for the last 250 years (not a great assumption, but probably not off by THAT much).

    If we break the last 250 years into half centuries, and "integrate" over each of those 50-year blocks, we can get a decent idea of how many total people have been making new molecules over the past quarter millennium.

    Lastly, I picked an arbitrary number of new molecules a chemist might make in a given year. More than 10, probably less than 100. So let's take the geometric mean, 32.

    Running all the numbers gives you 333 million molecules.

    1. I love the population-based analysis. Very nice.

      That said, I contest the estimates of 10% and 32 cmpds per year. Suspect both are actually higher, maybe by, would you be comfortable with 600 million?

      (Either way, still below 1B)

    2. I am definitely willing to cede a factor of 2 in either parameter. What's a couple hundred million compounds between friends, anyway?