![]() In the early years some data were missing and that missing data was represented by a string of dashes. The data ranges from 1948 to the current time but the figures for 2020 were labelled ‘Provisional’ in an additional column. Lastly, the number of data columns changed part way through the file. Then, although it looked a bit like a CSV file, there were no delimiters: the data were separated by a variable number of blank spaces. ![]() Secondly, the column names were in two rows rather than the one that is conventional in a spreadsheet file. The data were tabulated but preceded by a free format description, so this was the first thing that had to go. First, there was the structure of the file. Also, and perhaps more importantly, writing a program to download and format the data meant that I could automatically keep it up to date with no extra effort. But I decided it would be more fun to do it programmatically with Python and Pandas. I could, no doubt, have converted the file with a text editor - that would have been very tedious. This article is about the different techniques that I used to transform this semi-structured text file into a Pandas dataframe with which I could perform data analysis and plot graphs. So, I needed to do a bit of cleaning and tidying in order to be able to create a Pandas dataframe and plot graphs. The problem was that it was a text file that looked like a CSV file but it was actually really formatted for a human reader. I needed a simple dataset to illustrate my articles on data visualisation in Python and Julia and decided upon weather data (for London, UK) that was publicly available from the UK Met Office. ![]() These days much of the data you find on the internet are nicely formatted as JSON, Excel files or CSV. Semi-structured data on the left, Pandas dataframe and graph on the right - image by author
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |