Tag Archives: web scraping

Another Excel Horror Story

I was trying to create a list of officially approved Health Maintenance Organisations (HMOs) in Nigeria. After jotting down what data I wanted to collect and creating a schema, I paused to decide on how to initiate the approach. I wanted to first of all have it as a CSV file and then figured that the cheapest way to start would be to be “graphical” about it. I opted to go for MS Excel, since I could easily save the results in the desired format. After all, I’m an Office 365 subscriber, so why not give it a try?

If you know anything about me, you are probably aware of my aversion to Excel. After a long romance, our separation was both violent and traumatic. But today I said to myself that I would not be unduly nasty and give it a shot. I told myself, there is no doubt that Excel is a great application and it’s used my millions with great effect.

I found the website of the National Health Insurance Scheme (NHIS) and the page that lists the HMOs. Good. I could have two windows open, the web page on the left and Excel on the right, plug into some good music and in a few minutes of copy-pasting, I should be able to acquire the data.

After a few minutes — and when I got to the phone numbers — Excel started off with one of our old quarrels. Somehow, we could never get to agree on how to handle phone numbers. First, it turned the numbers into scientific notation. Then I tried to set the input type from “General” to “Text” to allow, leading zeros. Then I had to click on the action prompt to indicate that I didn’t want formatted text. Even though I applied my settings to the columns that were to accept phone numbers, whenever I hit the next row, I had to start all over again. Arrrrrgh!

I now chastised myself for thinking that Excel was a changed person. How stupid I was! So I had to vent…

Sometimes we do silly things but don’t know why. This was one of them. I’m reasonably comfortable with R, and practically kicked myself knowing that with the rvest package, and a little peeping around for HTML tags and/or CSS selectors using the SelectorGadget, I could more efficiently grab the data I so badly needed.

Here’s the code I eventually used to get the job done:


nhisHtml <- read_html("https://www.nhis.gov.ng/hmo-contacts/")

tableTag <- html_nodes(nhisHtml, "table")
tblElements <- html_table(tableTag)
myDf <- tblElements[[1]]
write.csv(myDf, "data.csv")

What on earth was I thinking to even attempt using Excel for this task?

Leave a comment

Filed under Computers & Internet