On snotcasting

snot·cast | \ ˈsnät-ˌkast \ | snotcast; snotcasting

intransitive verb; Definition: to publicly interpret nasal swab data for other people for later consumption (see PODCAST, SNOT definition 1 sense 1)

So, I’ve been watching the NYC Variant report for a long time (Since March 2021) and I’d like to share what I’ve been doing.

I’ll be extensively using the strain aliases released by the WHO if they exist for the given strain. I’ll use the Pangolin (Phylogenetic Assignment of Named Global Outbreak Lineages) name otherwise.

Strain aliasCorresponding Pango LineageMedia Nickname
BetaB.1.351South Africa

Also, there are two organizations I refer to over and over.

  • PRL: Pandemic Response Lab. It’s a lab started by the city located at the Alexandria Center for Life Science in Manhattan (Start of lab press release) (Website)
  • GISAID: Started in 2008, worldwide initiative to share virus data. (Website)

I use the word “sparkline” a lot. It means “little graph in a cell that doesn’t have axes”.

The Data from NYC

February 22, 2021 was the first “New York City COVID-19 Cases Caused by SARS-CoV-2 Variants Report” and it only mentions the Alpha strain. Same with the week after. March 10th has three strains: Alpha, Beta, and Iota. This is the birth of the white table that ballooned from these three rows and columns to a hulking monstrosity in June. You can see this below. The reports contain sequencing numbers from both PRL and GISAID.

Swipe from left to right to compare the two screenshots
The left is from March 10, the right is from June 1
Rows are weeks, columns are strains

The first one I looked at personally was the week after this one, where it mentions GISAID numbers from Alpha, Beta, Eta, Iota, Zeta, but no Gammas yet. I was very interested in tracking this strain and other virulent strains because they would provide an easy-to-understand correction to the risk level messaging as vaccines started to get out through the start of 2021. Politicians were making statements about the effectiveness of the vaccines and people were starting to question the need for the protocols that have brought this pandemic down to a dull roar. But the table of PRL data was hard to read.

The Sheet: Conditional format and sparklines for clarity

The Sheet isn’t the main offering I have to the public and has lost some of its punch now that the data is stale (June 10th. More on that below.) but it still exists and is where I do some of the public calculations I haven’t yet moved to GitHub. It is located at the short link and you can make a copy to see how all of it works. I tried to make it so it was super easy to just poke in the next week of data without having to adjust formulas for the new row.

Spreadsheet slickness sidebar (skip if you don’t Spreadsheet): See that black row in the first picture? All of the calculations and graphing include it (and ignore it) in the ranges. When you add a row in the middle of a range in an existing equation, the range expands to include the new row with the same endpoints. Useful for when you are adding rows to the end of something that has a subtotal/filter/formula that you don’t want to have to update weekly.

I’m pretty proud of column AI (above right) with the scaled stacked bar in orange and brown. The maximum is set by the row with the most tests (as can be seen above left), so all of the bars are to scale with each other. The variant reports themselves never totaled the variant percentages to quickly communicate the amount of general variant spread, which I guess was advisable considering some strains may not be super virulent, but they did never disclose the number of “other” strains. I only totaled them myself because Google Sheets sparklines only allow alternating colors and that wouldn’t communicate much for what eventually grew to 14 strains. I’ll leave the true reason for that up to your judgement but there was quite an interesting data release that brought several strains from the “other” category onto the report suddenly. More on that now.

Sunday, May 16 2021 Data release

The variant reports were typically being released on Wednesdays or maybe a day later than that here or there. (Sidebar: If anyone knows the release dates of the PDFs over time let me know, my info is incomplete) This one came out days late, on a weekend. I’m not exactly certain why the date of May 11th is associated with this report, but that’s the date that is on the PDF. Perhaps scheduled release date.

Several strains got added to this report, but the thing that stands out the most to me is the addition of B.1.526.1 (Variant of Interest, CDC not WHO) dating all the way back to March 15–21, six weeks of data that wasn’t shared with us. Of considerable note knowing now what we didn’t then, Delta (B.1.617.2) was lurking in the data unshared by the report from the week before and that one seems to matter way more. That variant of interest shows me that the city will obscure data from us if it decides it isn’t relevant. We need transparency in case this happens to a strain that is really virulent (like Delta) and we aren’t told for a long time (like B.1.526.1).

Slide from left to right to compare the screenshots
May 4 report (left) compared with May 11 (right).
Note how “955” total is present in both rows for the week of April 19–25.
Note the retroactive appearance of strains. Delta is B.1.617.2

Best/worst part? I found a stinking typo in the PDF. Because I conditionally format (data represented on a color spectrum) on The Sheet and remember trends, it wasn’t hard to notice that the “34 (3.6%)” for B.1.526/B.1.526.2 S:E484K+ was a lot lower than it was in prior weeks and was likely a typo. They silently corrected it (along with updating a few other numbers, it seems) by replacing the PDF on their website. I have both copies on my DropBox.

Slide from left to right to compare the screenshots
A comparison of the original 102.6 kB PDF (left) and the updated 82.0 kB PDF (right)

Here’s what adding the new strains in this report did to the total count of variants and unassigned data to the weeks before.

Slide from left to right to compare the screenshots
On the left is before the May 11 report and on the right is after

All this leaves me with is an unsettled feeling when I think the city isn’t giving all of the information it has. Oh, and this was the next day:

June 10th overhaul

New Data Page on Variants/Strains

Our new Variants Data page tracks the testing and spread in NYC of variants of the virus that causes COVID-19. NYC is monitoring strains — types of variants that show meaningful differences in how they function — and other variants that have been identified and reported here., after June 10

I was more than disappointed that the format of the data had changed, but took solace in my new data representation that I had worked out the week before, I will cover that in my last section. I would like to focus on the issue of the way the data is represented by the city on this new page and why that might be. I will also show you what it may look like if it summarized the data differently.

The new city variants page table, as it looked on June 10th and the week afterward.

The table summarizes data from the last four weeks with equal weight. This seems like a bad idea because a contagion with a 4–14 day incubation period should be monitored from week to week if that data is available (which it is). On top of that, there’s a data lag: The last day of sequenced tests come from about 10 days before any data release. If the report is late, this increases.

This is the last data report that made it to my spreadsheet, from June 1. Note, from top to bottom: strain name, counts and percentages compared to the total, with percentages highlighted by a color scale, green sparklines indicating the trends from the last five weeks, orange bar sparklines show trends for the entire monitoring period, and cumulative counts as numbers totaled from all weeks.

I’ve found a news story from NBC New York from that day that reports on the average and not the latest week (below). This can be very dangerous if the average is going to be lower on anything rising. We are expecting things to rise and this is why we are monitoring. I would expect for any steady rise that the average would report about a half of the actual latest data based on the average of steadily rising things that start from zero (example: average speed of a constantly accelerating object). It doesn’t make sense at all.

I see the views that I can contribute with my abilities:

  1. Showing the effect of the average on the chart by making a version of the chart with a different average or no average. I call these Variant Data Table spoofs.
  2. Making stacked histograms of the data to make the trends over time clear that aren’t being reflected. That was my intention behind the strain by strain “last five week” sparklines on The Sheet (see image above).
The title of this news story reflects a rounding of the four week average (4.9%) instead of the buried data from the most recent data available at the time (the week ending 5/29. On the 6/10 release, it was 8.33%. An update to this data from 6/17, removing a single test result from Alpha, changes this to 8.42%.)

Variants Data Table Spoofs

I started by taking the data from variant-epi-data.csv and imported it into a fresh tab of The Sheet (link). I created a cell that controls the range of data that is totaled up from week to week. That cell is specifically linked at Make a copy and the data changes. Finally, I used a color picker, a bar sparkline, and some resizing to match the design of the city display. I applied this formatting and took screenshots. (I’m not sure how to make the creation of these graphics live in GitHub instead of in a spreadsheet but this works for now and it matches the data.) I included an extra digit of precision and you can see that this number rounds to the city number when the four-week window is active.

A few explanatory notes on these spoofs from June 10th:

  • The percentages of Eta and Zeta are not disclosed to us variant-epi-data.csv, only on the display on that Variants page from the city.
  • ¯\_(ツ)_/¯ is a kaomoji, a series of unicode characters taken out of context for their visual similarity to, often, a gesture. This is a shrug. This is intended to point out that these two data strains are treated specially. We only know the four week average.
  • “We didn’t see it during this time” is clearer than an empty bar.

June 10th spoof comparison

Slide from left to right to compare the screenshots
The left is made with the four week average and matches the city.
The right is made with the most recent week of data only.
Note: These percentages were accurate on the day of the 6/10 release.

June 17th spoof comparison

Slide from left to right to compare the screenshots
The left is made with the four week average and matches the city.
The right is made with the most recent week of data only.
Note: These percentages were accurate on the day of the 6/17 release.

June 24th spoof comparison

Slide from left to right to compare the screenshots
The left is made with the four week average and matches the city.
The right is made with the most recent week of data only.
Note: These percentages were accurate on the day of the 6/24 release.

Stacked Histograms with gnuplot on GitHub

I know how to use GitHub from a brief stint as a programmer, and I used gnuplot in college. I became frustrated with the limitations of the data visualization I had available in Sheets and tried doing this the week before the data change, but on GitHub the visualization process is in full view which evokes the transparency I wish from the city. I chose the colors originally with the help of, a color picking website. gnuplot is deliberately uncapitalized and completely free, there are guides to using it available from many university science and math programs.

For I copied the city data by “forking” the repository linked at the bottom of the red and orange table. This provides me my own copy of all of the data the city makes available, with the ability to add my own files and pull the upstream changes that the city makes as new data comes in. I also copied the data to a spreadsheet.

I’m trying to do all of my calculations in the public eye so they can be trusted and so others can repeat my methods for other sets of data or perhaps correct my mistakes. By cloning my repository, you can have my recipe and the data. All you need is to install gnuplot and run it on any .p file in the repository in the file it is in so that it has access to the data. You can edit any .p file to change the settings, the colors, the style, the title. If you want to do this publicly, you can fork the city data that is upstream of me or you can fork my repository itself. I would love collaborators!

This is the most relevant folder. You can take a look around, I tried to explain myself with the files that automatically display simple webpages below the files themselves. I added my note on top of the note provided by the city so that both are still visible. (Note: “../” refers to the folder above in filenames). I’m going to paste that readme here:

  • There’s a human-readable data file called ../variant-epi-data-readable.csv
    • It has the dates and total count sequenced (from ../cases-sequenced.csv)
    • I made the dates given use month abbreviations and I added two rows for the weeks that are yet to happen.
  • This folder called visualization/ contains my files
    • visualization/all-weeks-plotted.p plots all strains given in ../variant-epi-data-readable.csv
      • This is gnuplot code. How to install gnuplot.
      • It plots the strains in a stacked histogram.
      • visualization/all-weeks-plotted.png is the resulting image.
      • Download it to view it or scroll down.
    • ../visualization/four-weeks-plotted.p makes a graph that highlights the four last weeks of data that the city is just lumping together on their public display.
    • visualization/ignored-strains.p is a graph that uses the data before they eliminated strains that are not very prevalent.
      • This graph will not be updated because there is no data to update it with.
      • Note the top of the graph reaches 10% and this is not the same as other graphs you might see from me.
  • The folder below, visualization/spoofs contains my simulations of what the city display would look like if it were using a one week window instead of a four week window.

Images (June 24th data)

Changes in back data

I’ve noticed that the variant count data is changing with every update the city pushes out. I can’t think of a good reason for this but I’ll present my findings in the images below.

The three-color pictures show a red color for a decrease, a yellow color for no change, and a green color for an increase. The displayed values are those in the later week. The last row is white because it’s new data at that point and has no comparison.

The color scale pictures show a deeper red color for counts that went down more than others and a deeper green color for counts that went up more than others. No change or small changes tend towards white. The displayed values are the changes in the counts or percentages (using final minus initial). Percentages aren’t highlighted because a change in one count can tweak many denominators in a given week (row). The last week is omitted from this view.

These are all computed using a different tab of The Sheet mentioned above and you can check my work.

June 10 to June 17 Comparison

You can see that there are only changes of ±2 at maximum for all weeks, but some of these weeks are far in the past.

June 17 to June 24 Comparison

Of particular note is the 500+ tests added to the early weeks of January while recent tests are less than one hundred total per week. Notably, they’re primarily no noted strain of the virus, i.e. “Other”.

Questions I’m left with that need answering

I’m adding to this list over time.

  1. Should New York City be averaging its last four weeks of data like this to hide gaps in its own data or is that bad practice?
  2. At what prevalence threshold are we allowed to see a strain on variants-epi-data.csv?
  3. Can we be sure that new strains will be added as they appear? There’s a precedent of retroactive disclosures (see May 16 Data Release section above)
  4. Will this sequencing continue or is this lowering sequencing rate a reflection of the ending of this monitoring?
  5. Why are counts in as far as the first week of sequencing changing? As I write this specific sentence, it is June 24th and counts have changed over the past two commits since June 10th. Is this going to continue? What does it mean?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s