Guansong Wang's Website

Collecting and Cleansing AQI Data

Tags: 2015ChinaAQI 2015-08-21

Collect and cleanse hourly AQI of five cities of China (where U.S. embassy locations: Beijing, Shanghai, Guangzhou, Shenyang, Chengdu) from 2008 to 2014.

Collect and cleanse hourly AQI of five cities of China (where U.S. embassy locations: Beijing, Shanghai, Guangzhou, Shenyang, Chengdu) from 2008 to 2014.

Link to raw and cleansed dataset:Baidu Cloud.

Data Source: stateair.net
Create Date: 20150406
Updata Date: 20150407

Defination of the dataset and introduction can be downloaded from this link: http://www.stateair.net/web/historical/1/1.html

Data use statementhttp://www.stateair.net/web/historical/1/1.html

Files:

City_PM25_From_To.csv
CSV dataset of cleansed data of _city_
USEmbassyAQI.csv
CSV dataset with five cities combied
USEmbassyAQI.RData
In Rdata format
Raw/*
Raw dataset

Cleansing Process:

  • Delete the first three rows of text.
  • Delete the "Parameter" column, since all have the same value "PM2.5".
  • Delete "DATE(LST)" column, keep "Year", "Month", "Day" and "Hour" columns.
  • Delete "Unit" column, since all have the same value "ug/m3".
  • Delete "QC Name", since all are blank.
  • Transfer "Duration" column, which takes two values of "Valid" and "Missing" to a boolean column "Valid".

Update 20150407:

There are 17 pairs of duplicates (same location and time), all of which are at 3:00 am. The duplicates have different values of AQI and the 2:00 am records are missing for all the cases, so correct the first appearance of 3:00 am as 2:00 am record. The following data is adjusted:

Row Site Year Month Day Hour Value Valid
6675Beijing2009383179TRUE
15579Beijing2010314375TRUE
24315Beijing20113133351TRUE
50523Beijing201439355TRUE
59355Chengdu20123113-999FALSE
68091Chengdu20133103176TRUE
76827Chengdu201439365TRUE
85683Guangzhou20113133-999FALSE
94419Guangzhou20123113-999FALSE
103155Guangzhou2013310392TRUE
111891Guangzhou201439323TRUE
120747Shanghai20113133-999FALSE
129483Shanghai2012311389TRUE
138219Shanghai20133103107TRUE
146955Shanghai201439355TRUE
155739Shenyang20133103-999FALSE
164475Shenyang201439353TRUE

Moreover, there are records with value -999 but not marked as invalid in Shanghai, such as:

Row Site Year Month Day Hour Value Valid
127905Shanghai201215949TRUE
127906Shanghai2012151049TRUE
127907Shanghai20121511-999TRUE
127908Shanghai20121512-999TRUE
127909Shanghai2012151350TRUE
127910Shanghai2012151450TRUE

The two records in the middle is invalid but not properly marked. There are totally 296 cases like this that are adjusted by marking "Valid=FALSE".