Two Plus Two

Nov 13, 2024

"On reading," by Simon Wain-Hobson, is a weekly discussion of scientific papers and news articles around gain of function research in virology.

Read →

4 Comments

Tommy Cleary

Nov 13, 2024

The Bioinformatics and Institutional limits of the questions you ask are important…but the mistakes are often not only in Virology, which result in LAI, but also in the data science linked to the Error Detection Paths and Patterns.

Dear NLM Officer,

Re: SARS-CoV-2 reference sequence suppression and other COVID Origin research GenBank data omissions

I am writing to learn more about how it is that the SARS-CoV-2 reference sequence became suppressed in May 2023?

<<May 6, 2023 02:21 PM; suppressed; This record was removed by RefSeq staff. Please contact info@ncbi.nlm.nih.gov for further details.>>

https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=girevhist

Also, I would like to know more about the sequence that was deleted WH01 from Lili Ren, discussed in the article below.

https://www.science.org/content/article/first-sars-cov-2-genome-deposited-us-database-earlier-than-previously-known

<<Verifying the quality of sequencing data submissions is necessary to maintain the integrity of such databases managed by NIH and ensures that users have access to trusted and reliable data.>>

<<Thus, the sequence was never made publicly available on GenBank.

In the interim, another submission to GenBank from a different submitter was received and published on January 12, 2020.

That submission published on January 12 provided the genetic sequence for SARS-CoV-2.

The sequence published on January 12, 2020, was nearly identical to the sequence that was submitted by Lili Ren.>>

And from the GenBank correspondence here:

https://d1dth6e84htgma.cloudfront.net/Ford_H2_316_20240111_152518_05f9837537.pdf

Was a GI number given to this sequence?

The GenBank correspondence states:

<<AUTODELETED group 7385146

Groups deleted (use these grids to undelete if necessary):>>

Can a version of this submission from group 7385146 please be undeleted and made available for calculation of more exact comparison to the SARS-CoV-2 reference sequence; <<nearly identical>> is not very accurate.

Thank you.

Are any of the submissions from Lili Ren WH01 series available on this database archive perhaps?

https://web.archive.org/web/20200222054741/https:/bigd.big.ac.cn/ncov/genome/

Further, given that

<<Verifying the quality of sequencing data submissions is necessary to maintain the integrity of such databases managed by NIH and ensures that users have access to trusted and reliable data.>>

Can the GenBank indexers please provide some more information about BioProject: 1097963?

Now it is not public:

<<The following ID is not public in BioProject: 1097963

ID:1097963>>

https://web.archive.org/web/20240705052322/https://www.ncbi.nlm.nih.gov/bioproject/?term=cybersecurity

But previously it was public…and available when searching for cybersecurity resources of GenBank:

https://web.archive.org/web/20240807055019/https://www.ncbi.nlm.nih.gov/bioproject/?term=Cybersecurity

This BioProject included a link to an IT education company…has this training been now completed by the GenBank indexers?

https://maetechacademy.edu.my/course.html

Given the stringent quality control NIH conducts how is it that this helpful BioProject is no longer available to the public?

Was it perhaps Spam or some other failure of the NIH’s high standards of data integrity and or cybersecurity?

Can the NLM GenBank resources please provide more stream lined support services and resources for reporting suspect submission that have made it past the NIH strict data integrity processes?

Data contamination is an important issue to avoid in GenBank submissions.

But JAMOGK000000000.1 Pseudomonas aeruginosa was removed by GenBank even when this contamination was found in related submission with AI/ML tools:

<<This record was removed because the sequence was determined to be contaminated. Please contact info@ncbi.nlm.nih.gov for further details>>

But in GenBank’s stringent data integrity rules there has been no attempt to identify the actual contaminant here in the JAMOGK01 series of data?

https://www.ncbi.nlm.nih.gov/Traces/wgs/JAMOGK01?display=contigs

In previous contaminated submission it was obvious where the PLA had contaminated the records with data relevant to COVID Origin;

For example NY5541 urine sample from 2019

https://www.ncbi.nlm.nih.gov/Traces/wgs/JAMOHC01?display=contigs&page=1&state=dead

And… NY5537 collected 2019 sputum sample submitted by Zhou,D:

Submitted (19-MAY-2022) State Key Laboratory of Pathogen and

Biosecurity, Beijing Institute of Microbiology and Epidemiology,

No. 20, Dongdajie, Fengtai, Beijing, Beijing 100071, China

https://www.ncbi.nlm.nih.gov/nuccore/JAMOGK010000088.1?report=GenBank

See <<Breaking: SARS-CoV-2 Spike found in bacteria samples taken from China, 2019

January 20, 2023 >> Adeno News article

https://web.archive.org/web/20230122173319/https://adeno-news.com/2023/01/20/breaking-sars-cov-2-spike-found-in-bacteria-samples-taken-from-china-2019/

So too with JAMOGK01 can you please identify the contaminant and make this contamination public BEFORE you comply with request from PLA to remove data…

If you can flag the contaminant codes now, when you are aware of what they are, this would be helpful.

Thank you.

Finally;

I am interested in COVID origin research.

Prof Zhegli Shi of WIV was quite clear that I examine all available data on GenBank before asking for access to WIV’s virus databases.

See previous correspondence on this issue:

<<Case CAS-1324284-Y8N8F1 - National Library of Medicine Customer Service confirmation TRACKING:000435001291518>>

This is quite challenging as some of the data in GenBank are hidden from view.

I would like to be able to see that data, and I would like to know how it became suppressed.

For example: the pre-print <<Spread and Geographic Structure of SARS-related Coronaviruses in Bats and the Origin of Human SARS Coronavirus; Yu2018unpublished>>

and the data set and correspondence with GenBank from the submitting authors is important to ongoing COVID Origin research.

Now the title of the pre-print gives zero results…due to suppression.

No items found.

On August 09, 2022 this paper gave 163 search results in GenBank as seen in this archive.

https:/web.archive.org/web/20220809085043/https:/www.ncbi.nlm.nih.gov/nuccore/?term=Spread+and+Geographic+Structure+of+SARS-related+Coronaviruses+in+++++++++++++Bats+and+the+Origin+of+Human+SARS+Coronavirus

By basic bioinformatics analysis we can see the recoverable series of nucleotide & protein submissions by the authors

<<Yu,P., Hu,B., Li,B., Luo,D., Zhu,G., Zhang,L., Holmes,E.C., Shi,Z. and Cui,J.>> extends from the suppressed record:-

<<GI 1769824624>>Record suppressed: spike protein [Bat SARS-like coronavirus] - Protein - NCBInlm.nih.gov

And

<<GI 1769824623>>Record suppressed: Bat SARS-like coronavirus strain Rs161465_Guangdong spike protein (S) - Nucleotide - NCBInlm.nih.gov

To:-

<<GI 1769824592>> where it is interrupted by an unrelated sequence placed on 25-OCT-2018 that is not suppressed:

Salmonella enterica subsp. enterica serovar Infantis strain FSIS170230 - Nucleotide - NCBInlm.nih.gov

Then to <<GI 1769824316>>Record suppressed: Bat SARS-like coronavirus strain Rs5725_Yunnan ORF8 gene, complete cds

Followed again by the unrelated<<GI 1769824315>>

Also placed on 25-OCT-2019Salmonella enterica subsp. enterica serovar Infantis strain FSIS170230

This series of 308 suppressed GI GenBank submissions is extensive but also not the full set of 163 nucleotide results yet as this should result in 326 missing GI numbers and only 308 have been recovered so 18 GI submissions are still missing according to my calculations.

This means of the at least 163 nucleotide and protein sequence pairs submitted for this preprint and placed in GenBank, only 154 are able to be recovered for analysis by examining the series of GI numbers at this stage.

How many original GI data points were placed for this preprint?

Also, where are the final nine nucleotide and protein sequences pairs that were searchable on August 09, 2022?

That means at least eighteen GI numbers are missing and due to GenBank suppression are not able to be found?

Perhaps there is another way to recover these missing files?

Was there an earlier GI number series that was placed when the preprint was originally submitted?

When was this series linked to <<Yu2018unpublished>> originally submitted?

Given that cybersecurity concerns were highlighted by Prof ZLShi as reason for limiting access to WIV’s extensive bat virus databases, can you please reassure me that the missing data from <<Yu2018unpublished>> is safe, and send links to the remaining suppressed and missing sequence submissions.

Thank you for your assistance.

Kind regards

Mr Tommy Cleary

Postgrad Student UNDA.

Expand full comment

Tommy Cleary

Nov 13, 2024

3/-

Part of the solution is here in the Synthetic Markers of tail codes Cyphers in Baric Lab products…

https://northernvirginiamag.com/culture/culture-features/2022/10/14/cia-kryptos-sculpture-cipher/

The main error of Virology and associated Science is not being direct enough to finish the Math and calculate not only risk of LAI in an age of Synthetics…but also to extrapolate this to determine the Extinction level risk of LAI.

Link

Expand full comment

Tommy Cleary

Nov 13, 2024

2/2

Citations to come.

I have to post and edit and add links?

Expand on what David Relman, Ralph Baric and Tom Ingelby seem to want to say…as well as what Eddie Holmes et al refuses to say…

Given our one home here how can the extinction level risks of GOF be simply ignored?

The Theory of Mind of Biological Weapons is explored here by Baric

https://www.jcvi.org/sites/default/files/assets/projects/synthetic-genomics-options-for-governance/Baric-Synthetic-Viral-Genomics.pdf

Expand full comment

Tommy Cleary

Nov 13, 2024Edited

Dear Simon,

Teaching and learning are entwined…

<<Aside 3

It is assumed that the reader knows something of the GOF controversy in virology. To ensure the essays remain short, they are best read as a series. Are the essays too dense or difficult to absorb? Comments please. Suggestions for an article around which a future essay could be crafted would be welcome.>>

You essays are straight forward and easy to understand.

The reason you leave certain topics alone is not.

Dual Use Research of Concern is implicitly involved with Science and the arguments for and against GOF but also deeply and irreparably linked to the Dark Side…Biological Weapons and the extinction level danger of synthetics.

Expand full comment