Skip to content

Researching Methods for Scraping Government Tax Revenue From the Web

Tue Aug 02 2016
Brian Dumbacher, Mathematical Statistician, Economic Statistical Methods Division, and Cavan Capps, Big Data Lead, Associate Directorate for Research and Methodology
Component ID: #ti2099933879

The Quarterly Summary of State and Local Government Tax Revenue is a sample survey conducted by the U.S. Census Bureau that collects data on tax revenue collections from state and local governments. Much of the data are publicly available on government websites. In fact, instead of responding via questionnaire, some respondents direct survey analysts to their websites to obtain the data. Going directly to websites for those data can reduce respondent burden and aid data review.

Component ID: #ti136337427

It would be useful to have a tool that automatically collects, or scrapes, relevant data from the web. Developing such a tool can be challenging. There are thousands of government websites but very little standardization in terms of structure and publications. A large majority of government publications are in Portable Document Format (PDF), a file type not easily analyzed. Finally, both web and PDF documents have constantly changing formats.

Component ID: #ti2123715836

To solve this problem, researchers at the Census Bureau are studying and applying methods for unstructured data, text analytics and machine learning. These methods belong to the realm of “Big Data.” Big Data refers to large and frequently generated datasets representing a variety of structures. As opposed to designed survey data, Big Data are “found” or “organic” data. Typically, these data are created for a click log, a social media blog or an online PDF report, but are innovatively repurposed and used for something else such as inferring behavior. Since the data were not specifically designed to infer, they often have unique challenges.

Component ID: #ti2123715835

The goal of this research is to develop a web crawler with machine learning that performs three tasks:

Component ID: #ti2123715834

  1. Crawls through a government website and discovers all PDFs.
  2. Classifies each PDF as containing relevant data on tax revenue collections.
  3. Extracts the relevant data, organizes it and stores it in a database.

Component ID: #ti2123715833

For task 1, we used the open-source software called Apache Nutch. In a production environment, the process will scale up by distributing the work over many computers and then combining the results.

Component ID: #ti2123715832

For task 2, we developed a technique to convert PDF documents to text and re-organize the output. A classifying model applied to the converted PDF determines if the document has relevant data on tax revenue collections. This model uses the occurrence of key sequences of words such as “statistical report” and “sales tax income” and other text analysis techniques.

Component ID: #ti2123715831

For task 3, we are considering various ideas. Relevant data would probably be found in tables and in close proximity to key sequences of words. We will explore table identification methods based on the distribution of terminology in the PDF and additional modeling that maps the nonstandard data in PDFs to standard definitions in Census Bureau publications.

Component ID: #ti2123715830

The Census Bureau looks forward to continuing this web scraping research and exploring new machine learning algorithms that reduce respondent burden, speed survey processing and improve data collection.

Component ID: #ti2123715829

To learn more about the research methods for scraping government tax revenue from the web, please join us at the Joint Statistical Meetings on August 2, 2016.

X
  Is this page helpful?
Thumbs Up Image Yes    Thumbs Down Image No
X
No, thanks
255 characters remaining
X
Thank you for your feedback.
Comments or suggestions?