It’s the dream of data journalists to get their hands on a data set that’s ready made to analyze and publish from the first moment they open it.
A little over a year ago, my colleagues and I received a large csv file in the neighborhood of 150 gigabytes. You cannot open a file like that in Excel to see it (unless you only want to see the first million rows). If you have a computer that can open it in another program, it is incredibly slow to process. So my colleague Aaron Williams wrote a bash script to cut the files into million row chunks and then a python script to convert them all to parquet files so we could do analysis on hundred of millions of data points in parallel.
When we ran initial analysis on the files, we found that the approximately 380 million rows of data was remarkably clean. That allowed us to do analysis and build public data pages quickly and with confidence. But this is an outlier and what you get is rarely that clean.
Data journalists are used to regularly getting data with many errors, chief among them misspellings. This is unsurprising. A lot of data is compiled by humans typing in information into free-form text fields and humans make a lot of mistakes. Spell check is a nice feature of some software but not all of it. And so when we get data, sometimes there’s 57 iterations of the city of Philadelphia as there was in the data released by the U.S. Small Business Administration on loans issued under the Paycheck Protection Program.
For data like this, there are tools that can be used to clean the data for analysis and publication. The most commonly used tool is OpenRefine (previously GoogleRefine). It’s a software that is commonly taught to clean data through a variety of methods in a very visual space that allows a user to control what is considered the same and what are fundamentally different things with slightly different spellings.
A newer tool, and the one I primarily use and teach to others, is dedupe, created and maintained by the folks at DataMade. Dedupe allows for more control and replication and is great for people of all skill levels.
It doesn’t hurt to play around with both or others and figure out what works for you. At the end of the day, the tool that works best for you is the best tool. And once it is clean, you can be much more certain that the remaining steps in your work will be much easier.
Over the weekend, I went on a trip out of the city to buy some lumber for a few upcoming projects. My primary goal was to buy wood to find good lumber for a tabletop and I found a lovely live-edge slab that is better than anything I was hoping to find.
The ten-foot tall piece of English Walnut immediately caught my eye. A nice section of it would make for a perfect top for a dining table. Instead of cutting out the seven-foot foot top out (so it could fit on top of an SUV) by using a straight cut across, my lumber guy showed me how to cut a light curve that is perfectly circular to emphasize the live-edge of the wood. It really looks lovely. He charged me one beer for this knowledge, payable on my next visit. I’ll bring a twelve-pack.
The real beauty of this piece is that the wood is already perfectly flat and the top and bottom are parallel to each other. So the wood requires basically nothing to begin attaching legs and aprons for a dining table. But this is an outlier and what you get is rarely that clean.
For a previous project building desks, I needed a lot of walnut. Like nearly a full car’s worth (thanks to Jim Webster for the help). The 56 pieces were of all shapes and sizes and were exceptionally rough. This is par for the course with lumber. Lumber shops get rough planks from lumber yards and the expectation is that you’ll clean them up yourself.
The first step I took in each board was to break out my jointer and make certain the one side of the board and one edge were squared flat. This is a precise process that requires a user to control the wood well but that gets you into position. Then, I put the boards through a planer to square off the other side and get the board to the same thickness as the other 55 boards. This part makes the boards really look like cleaned boards.
Finally, I cut the uncleaned edge of the board to be parallel and to the correct size using the table saw. After all that work, the boards are ready for building. And once it is clean, you can be much more certain that the remaining steps in your work will be much easier.
datawork
The NYPD Files — “After New York state repealed a law that kept police disciplinary records secret, ProPublica sought records from the civilian board that investigates complaints by the public about New York City police officers. The board provided us with the closed cases of every active-duty police officer who had at least one substantiated allegation against them. The records span decades, from September 1985 to January 2020. We have created a database of complaints that can be searched by name or browsed by precinct or nature of the allegations.” [ProPublica]
Barr claimed feds in KC made 200 arrests in two weeks. That’s not even close to true. — “The number baffled many in Kansas City, including local officials who said they could not vouch for it. Speaking with McClatchy after the Wednesday event, the senior Justice Department official clarified that the 200 figure included arrests dating back to December 2019. It also included, the official said, both state and FBI arrests in joint operations.” [The Kansas City Star]
Florida collects more data on COVID hospital patients than it shares with the public — “Florida was one of the last three states to report current COVID hospitalization data to the public. The state health department has long included “cumulative hospitalizations” in its county reports, but that information does not allow public health experts to track the disease’s progression in a community or its strain on local hospitals.On Friday afternoon, there were about 9,200 patients hospitalized with a primary diagnosis of COVID statewide, including about 2,000 in Miami-Dade and nearly 1,300 in Broward, according to the state.” [Miami Herald]
At college health centers, students battle misdiagnoses and inaccessible care — “To assess the landscape of student health services at roughly 1,700 four-year residential campuses, The Post interviewed more than 200 students, parents and health officials and examined thousands of pages of medical records and court documents and 5,500 reviews of student health centers posted on Google. College students reported they commonly waited days or weeks for appointments and were routinely provided lackluster care. Dozens of students ended up hospitalized — and some near death — for mistakes they said were made at on-campus clinics, including misdiagnosed cases of appendicitis at Kansas State University and meningitis at the University of Arkansas.” [The Washington Post]
As a copy editor, I felt this one in my soul.