intransitive verb; Definition: to publicly interpret nasal swab data for other people for later consumption (see PODCAST, SNOT definition 1 sense 1)
So, I’ve been watching the NYC Variant report for a long time (Since March 2021) and I’d like to share what I’ve been doing.
I’ll be extensively using the strain aliases released by the WHO if they exist for the given strain. I’ll use the Pangolin (Phylogenetic Assignment of Named Global Outbreak Lineages) name otherwise.
|Strain alias||Corresponding Pango Lineage||Media Nickname|
Also, there are two organizations I refer to over and over.
- PRL: Pandemic Response Lab. It’s a lab started by the city located at the Alexandria Center for Life Science in Manhattan (Start of lab press release) (Website)
- GISAID: Started in 2008, worldwide initiative to share virus data. (Website)
I use the word “sparkline” a lot. It means “little graph in a cell that doesn’t have axes”.
The Data from NYC
February 22, 2021 was the first “New York City COVID-19 Cases Caused by SARS-CoV-2 Variants Report” and it only mentions the Alpha strain. Same with the week after. March 10th has three strains: Alpha, Beta, and Iota. This is the birth of the white table that ballooned from these three rows and columns to a hulking monstrosity in June. You can see this below. The reports contain sequencing numbers from both PRL and GISAID.
The first one I looked at personally was the week after this one, where it mentions GISAID numbers from Alpha, Beta, Eta, Iota, Zeta, but no Gammas yet. I was very interested in tracking this strain and other virulent strains because they would provide an easy-to-understand correction to the risk level messaging as vaccines started to get out through the start of 2021. Politicians were making statements about the effectiveness of the vaccines and people were starting to question the need for the protocols that have brought this pandemic down to a dull roar. But the table of PRL data was hard to read.
The Sheet: Conditional format and sparklines for clarity
The Sheet isn’t the main offering I have to the public and has lost some of its punch now that the data is stale (June 10th. More on that below.) but it still exists and is where I do some of the public calculations I haven’t yet moved to GitHub. It is located at the short link tinyurl.com/variants-nyc-sheet and you can make a copy to see how all of it works. I tried to make it so it was super easy to just poke in the next week of data without having to adjust formulas for the new row.
Spreadsheet slickness sidebar (skip if you don’t Spreadsheet): See that black row in the first picture? All of the calculations and graphing include it (and ignore it) in the ranges. When you add a row in the middle of a range in an existing equation, the range expands to include the new row with the same endpoints. Useful for when you are adding rows to the end of something that has a subtotal/filter/formula that you don’t want to have to update weekly.
I’m pretty proud of column AI (above right) with the scaled stacked bar in orange and brown. The maximum is set by the row with the most tests (as can be seen above left), so all of the bars are to scale with each other. The variant reports themselves never totaled the variant percentages to quickly communicate the amount of general variant spread, which I guess was advisable considering some strains may not be super virulent, but they did never disclose the number of “other” strains. I only totaled them myself because Google Sheets sparklines only allow alternating colors and that wouldn’t communicate much for what eventually grew to 14 strains. I’ll leave the true reason for that up to your judgement but there was quite an interesting data release that brought several strains from the “other” category onto the report suddenly. More on that now.
Sunday, May 16 2021 Data release
The variant reports were typically being released on Wednesdays or maybe a day later than that here or there. (Sidebar: If anyone knows the release dates of the PDFs over time let me know, my info is incomplete) This one came out days late, on a weekend. I’m not exactly certain why the date of May 11th is associated with this report, but that’s the date that is on the PDF. Perhaps scheduled release date.
Several strains got added to this report, but the thing that stands out the most to me is the addition of B.1.526.1 (Variant of Interest, CDC not WHO) dating all the way back to March 15–21, six weeks of data that wasn’t shared with us. Of considerable note knowing now what we didn’t then, Delta (B.1.617.2) was lurking in the data unshared by the report from the week before and that one seems to matter way more. That variant of interest shows me that the city will obscure data from us if it decides it isn’t relevant. We need transparency in case this happens to a strain that is really virulent (like Delta) and we aren’t told for a long time (like B.1.526.1).
Best/worst part? I found a stinking typo in the PDF. Because I conditionally format (data represented on a color spectrum) on The Sheet and remember trends, it wasn’t hard to notice that the “34 (3.6%)” for B.1.526/B.1.526.2 S:E484K+ was a lot lower than it was in prior weeks and was likely a typo. They silently corrected it (along with updating a few other numbers, it seems) by replacing the PDF on their website. I have both copies on my DropBox.
Here’s what adding the new strains in this report did to the total count of variants and unassigned data to the weeks before.
All this leaves me with is an unsettled feeling when I think the city isn’t giving all of the information it has. Oh, and this was the next day:
June 10th overhaul
I was more than disappointed that the format of the data had changed, but took solace in my new data representation that I had worked out the week before, I will cover that in my last section. I would like to focus on the issue of the way the data is represented by the city on this new page and why that might be. I will also show you what it may look like if it summarized the data differently.
The table summarizes data from the last four weeks with equal weight. This seems like a bad idea because a contagion with a 4–14 day incubation period should be monitored from week to week if that data is available (which it is). On top of that, there’s a data lag: The last day of sequenced tests come from about 10 days before any data release. If the report is late, this increases.
I’ve found a news story from NBC New York from that day that reports on the average and not the latest week (below). This can be very dangerous if the average is going to be lower on anything rising. We are expecting things to rise and this is why we are monitoring. I would expect for any steady rise that the average would report about a half of the actual latest data based on the average of steadily rising things that start from zero (example: average speed of a constantly accelerating object). It doesn’t make sense at all.
I see the views that I can contribute with my abilities:
- Showing the effect of the average on the chart by making a version of the chart with a different average or no average. I call these Variant Data Table spoofs.
- Making stacked histograms of the data to make the trends over time clear that aren’t being reflected. That was my intention behind the strain by strain “last five week” sparklines on The Sheet (see image above).
Variants Data Table Spoofs
I started by taking the data from
variant-epi-data.csv and imported it into a fresh tab of The Sheet (link). I created a cell that controls the range of data that is totaled up from week to week. That cell is specifically linked at tinyurl.com/choose-your-window. Make a copy and the data changes. Finally, I used a color picker, a bar sparkline, and some resizing to match the design of the city display. I applied this formatting and took screenshots. (I’m not sure how to make the creation of these graphics live in GitHub instead of in a spreadsheet but this works for now and it matches the data.) I included an extra digit of precision and you can see that this number rounds to the city number when the four-week window is active.
A few explanatory notes on these spoofs from June 10th:
- The percentages of Eta and Zeta are not disclosed to us
variant-epi-data.csv, only on the display on that Variants page from the city.
- ¯\_(ツ)_/¯ is a kaomoji, a series of unicode characters taken out of context for their visual similarity to, often, a gesture. This is a shrug. This is intended to point out that these two data strains are treated specially. We only know the four week average.
- “We didn’t see it during this time” is clearer than an empty bar.
June 10th spoof comparison
June 17th spoof comparison
June 24th spoof comparison
Stacked Histograms with gnuplot on GitHub
I know how to use GitHub from a brief stint as a programmer, and I used gnuplot in college. I became frustrated with the limitations of the data visualization I had available in Sheets and tried doing this the week before the data change, but on GitHub the visualization process is in full view which evokes the transparency I wish from the city. I chose the colors originally with the help of coolors.co, a color picking website. gnuplot is deliberately uncapitalized and completely free, there are guides to using it available from many university science and math programs.
For I copied the city data by “forking” the repository linked at the bottom of the red and orange table. This provides me my own copy of all of the data the city makes available, with the ability to add my own files and pull the upstream changes that the city makes as new data comes in. I also copied the data to a spreadsheet.
I’m trying to do all of my calculations in the public eye so they can be trusted and so others can repeat my methods for other sets of data or perhaps correct my mistakes. By cloning my repository, you can have my recipe and the data. All you need is to install gnuplot and run it on any .p file in the repository in the file it is in so that it has access to the data. You can edit any .p file to change the settings, the colors, the style, the title. If you want to do this publicly, you can fork the city data that is upstream of me or you can fork my repository itself. I would love collaborators!
This is the most relevant folder. You can take a look around, I tried to explain myself with the README.md files that automatically display simple webpages below the files themselves. I added my note on top of the note provided by the city so that both are still visible. (Note: “../” refers to the folder above in filenames). I’m going to paste that readme here:
- There’s a human-readable data file called
- It has the dates and total count sequenced (from
- I made the dates given use month abbreviations and I added two rows for the weeks that are yet to happen.
- It has the dates and total count sequenced (from
- This folder called
visualization/contains my files
visualization/all-weeks-plotted.pplots all strains given in
- This is gnuplot code. How to install gnuplot.
- It plots the strains in a stacked histogram.
visualization/all-weeks-plotted.pngis the resulting image.
- Download it to view it or scroll down.
../visualization/four-weeks-plotted.pmakes a graph that highlights the four last weeks of data that the city is just lumping together on their public display.
visualization/ignored-strains.pis a graph that uses the data before they eliminated strains that are not very prevalent.
- This graph will not be updated because there is no data to update it with.
- Note the top of the graph reaches 10% and this is not the same as other graphs you might see from me.
- The folder below,
visualization/spoofscontains my simulations of what the city display would look like if it were using a one week window instead of a four week window.
Images (June 24th data)
Changes in back data
I’ve noticed that the variant count data is changing with every update the city pushes out. I can’t think of a good reason for this but I’ll present my findings in the images below.
The three-color pictures show a red color for a decrease, a yellow color for no change, and a green color for an increase. The displayed values are those in the later week. The last row is white because it’s new data at that point and has no comparison.
The color scale pictures show a deeper red color for counts that went down more than others and a deeper green color for counts that went up more than others. No change or small changes tend towards white. The displayed values are the changes in the counts or percentages (using final minus initial). Percentages aren’t highlighted because a change in one count can tweak many denominators in a given week (row). The last week is omitted from this view.
These are all computed using a different tab of The Sheet mentioned above and you can check my work.
June 10 to June 17 Comparison
You can see that there are only changes of ±2 at maximum for all weeks, but some of these weeks are far in the past.
June 17 to June 24 Comparison
Of particular note is the 500+ tests added to the early weeks of January while recent tests are less than one hundred total per week. Notably, they’re primarily no noted strain of the virus, i.e. “Other”.
Questions I’m left with that need answering
I’m adding to this list over time.
- Should New York City be averaging its last four weeks of data like this to hide gaps in its own data or is that bad practice?
- At what prevalence threshold are we allowed to see a strain on
- Can we be sure that new strains will be added as they appear? There’s a precedent of retroactive disclosures (see May 16 Data Release section above)
- Will this sequencing continue or is this lowering sequencing rate a reflection of the ending of this monitoring?
- Why are counts in as far as the first week of sequencing changing? As I write this specific sentence, it is June 24th and counts have changed over the past two commits since June 10th. Is this going to continue? What does it mean?