Internal scripts used for download tasks

Python module to download EPA AQS Data hosted at https://www.epa.gov/aqs

The module can be used as a library of functions to be called from other python scripts.

The data is downloaded from https://aqs.epa.gov/aqsweb/airdata/download_files.html

The tool adds a column containing a uniquely generated Monitor Key

Probably the only method useful to external user is download_aqs_data()

transfer(reader: DictReader, writer: DictWriter, flt=None, header: bool = True)[source]

Specific for EPA AQS Data

Rewrites the CSV content adding Monitor Key and optionally filtering rows by a provided list of parameter codes

Parameters:
  • reader – Input data as an instance of csv.DictReader

  • writer – Output source should be provided as csv.DictWriter

  • flt – Optionally, a callable function returning True for rows that should be written to the output and False for those that should be omitted

  • header – whether to first write header row

Returns:

Nothing

add_monitor_key(row: Dict)[source]

Internal method to generate and add unique Monitor Key

Parameters:

row – a row of AQS CSV file

Returns:

Nothing, modifies the given row in place

download_data(task: DownloadTask)[source]

A utility method to download the content of given URL to the given file

Parameters:
  • url – Source URL

  • target – Target file path

  • parameters – An optional list of EPA AQS Parameter codes to include in the output

  • append – whether to append to an existing file

Returns:

Nothing

destination_path(destination: str, path: str) str[source]

A utility method to construct destination file path

Parameters:
  • destination – Destination directory

  • path – Source path in URL

Returns:

Path on a file system

collect_annual_downloads(destination: str, path: str, contiguous_year_segment: List, parameters: List) DownloadTask[source]

A utility method to collect all URLs that should be downloaded for a given list of years and EPA AQS parameters

Parameters:
  • destination – Destination directory for downloads

  • path – path element

  • contiguous_year_segment – a list of contiguous years taht can be saved in the same file

  • parameters – List of EPA AQS Parameter codes

  • downloads – The resulting collection of downloads that have to be performed

Returns:

downloads list

collect_daily_downloads(destination: str, ylabel: str, contiguous_year_segment: List, parameter) DownloadTask[source]

A utility method to collect all URLs that should be downloaded for a given list of years and EPA AQS parameters

Parameters:
  • destination – Destination directory for downloads

  • ylabel – a label to use for years in the destination path

  • contiguous_year_segment – a list of contiguous years taht can be saved in the same file

  • parameters – List of EPA AQS Parameter codes

  • downloads – The resulting collection of downloads that have to be performed

Returns:

downloads list

collect_aqs_download_tasks(context: AQSContext)[source]

Main entry into the library

Parameters:
  • aggregation – Type of time aggregation: annual or daily

  • years – a list of years to include, if None - then all years are included

  • destination – Destination Directory

  • parameters – List of EPA AQS Parameter codes. For annual aggregation can be empty, in which case all data is downloaded. Required for daily aggregation. Can contain either integer codes, or mnemonic instanced of Parameter Enum or both.

  • merge_years

Returns:

as_stream(url: str, extension: str = '.csv', params=None, mode=None)[source]

Returns the content of URL as a stream. In case the content is in zip format (excluding gzip) creates a temporary file

Parameters:
  • mode – optional parameter to specify desirable mode: text or binary. Possible values: ‘t’ or ‘b’

  • params – Optional. A dictionary, list of tuples or bytes to send as a query string.

  • url – URL

  • extension – optional, when the content is zip-encoded, the extension of the zip entry to return

Returns:

Content of the URL or a zip entry

as_content(url: str, params=None, mode=None)[source]

Returns byte or text block with URL content

Parameters:
  • url – URL

  • params – Optional. A dictionary, list of tuples or bytes to send as a query string.

  • mode – optional parameter to specify desirable return format: text or binary. Possible values: ‘t’ or ‘b’, default is binary

Returns:

Content of the URL

as_csv_reader(url: str, mode=None) DictReader[source]

An utility method to return the CSV content of the URL as CSVReader

Parameters:

url – URL

Returns:

an instance of csv.DictReader

file_as_stream(filename: str, extension: str = '.csv', mode=None)[source]

Returns the content of file as a stream. In case the content is in zip format (excluding gzip) creates a temporary file

Parameters:
  • mode – optional parameter to specify desirable mode: text or binary. Possible values: ‘t’ or ‘b’

  • filename – path to file

  • extension – optional, when the content is zip-encoded, the extension of the zip entry to return

Returns:

Content of the file or a zip entry

file_as_csv_reader(filename: str)[source]

An utility method to return the CSV content of the file as CSVReader

Parameters:

filename – path to file

Returns:

an instance of csv.DictReader

fopen(path: str, mode: str)[source]

A wrapper to open various types of files

Parameters:
  • path – Path to file

  • mode – Opening mode

Returns:

file-like object

check_http_response(r: Response)[source]

An internal method raises an exception of HTTP response is not OK

Parameters:

r – Response

Returns:

nothing, raises an exception if response is not OK

download(url: str, to: IO)[source]

A utility method to download large binary data to a file-like object

is_downloaded(url: str, target: str, check_size: int = 0) bool[source]

Checks if the same data has already been downloaded

Parameters:
  • check_size – Use default value (0) if target size should be equal to source size. If several urls are combined when downloaded then specify a positive integer to check that destination file size is greater than the specified value. Specifying negative value will disable size check

  • url – URL with data

  • target – Destination of the downloads

Returns:

True if the destination file exists and is newer than URL content

write_csv(reader: DictReader, writer: DictWriter, transformer=None, filter=None, write_header: bool = True)[source]

Rewrites the CSV content optionally transforming and filtering rows

Parameters:
  • transformer – An optional callable that tranmsforms a row in place

  • reader – Input data as an instance of csv.DictReader

  • writer – Output source should be provided as csv.DictWriter

  • filter – Optionally, a callable function returning True for rows that should be written to the output and False for those that should be omitted

  • write_header – whether to first write header row

Returns:

Nothing

class Collector[source]
class CSVWriter(out_stream)[source]
class ListCollector[source]
basename(path)[source]

Returns a name without extension of a file or an archive entry

Parameters:

path – a path to a file or archive entry

Returns:

base name without full path or extension

is_readme(name: str) bool[source]

Checks if a file is a documentation file This method is used to extract some metadata from documentation provided as markDOwn files

Parameters:

name

Returns:

get_entries(path: str) Tuple[List, Callable][source]

Returns a list of entries in an archive or files in a directory

Parameters:

path – path to a directory or an archive

Returns:

Tuple with the list of entry names and a method to open these entries for reading

get_readme(path: str)[source]

Looks for a README file in the specified path :param _sphinx_paramlinks_dorieh.utils.io_utils.get_readme.path: a path to a folder or an archive :return: a file that is possibly a README file

is_dir(path: str) bool[source]
Determine if a certain path specification refers

to a collection of files or a single entry. Examples of collections are folders (directories) and archives

Parameters:

path – path specification

Returns:

True if specification refers to a collection of files

class CSVFileWrapper(file_like_object, sep=',', null_replacement='NA')[source]

A wrapper around CSV reader that does:

  • Counts characters and line read

  • Logging of the progress of the file being read

  • Performs on-the-fly replacement of null and special values