archive

This module supports the Archive class, a generic structure representing a collection of archived emails, typically from a single mailing list.

class bigbang.archive.Archive(data, archive_dir='/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/latest/archives/', mbox=False)

Bases: object

A representation of a mailing list archive.

Initialize an Archive object.

The behavior of the constructor depends on the type of its first argument, data.

If data is a Pandas DataFrame, it is treated as a representation of email messages with columns for Message-ID, From, Date, In-Reply-To, References, and Body. The created Archive becomes a wrapper around a copy of the input DataFrame.

If data is a string, then it is interpreted as a path to either a single .mbox file (if the optional argument single_file is True) or else to a directory of .mbox files (also in .mbox format). Note that the file extensions need not be .mbox; frequently they will be .txt.

Upon initialization, the Archive object drops duplicate entries and sorts its member variable data by Date.

Parameters
  • data (pandas.DataFrame, or str) –

  • archive_dir (str, optional) – Defaults to CONFIG.mail_path

  • mbox (bool) –

activity = None
add_affiliation(rel_email_affil)

Uses a DataFrame of email affiliation information and adds it to the archive’s data table.

The email affilation data is expected to have a regular format, with columns:

  • email - strings, complete email addresses

  • affilation - strings, names of organizatiosn of affilation

  • min_date - datetime, the starting date of the affiliation

  • max_date - datetime,the end date of the affilation.

Note that this mutates the dataframe in self.data to add the affiliation data.

rel_email_affil : pandas.DataFrame

compute_activity(clean=True)

Return the computed activity.

data = None
entities = None
get_activity(resolved=False)

Get the activity matrix of an Archive.

Columns of the returned DataFrame are the Senders of emails. Rows are indexed by ordinal date. Cells are the number of emails sent by each sender on each data.

If resolved is true, then default entity resolution is run on the activity matrix before it is returned.

get_personal_headers(header='From')

Returns a dataframe with a row for every message of the archive, containing column entries for:

  • The personal header specified. Defaults to “From”. Could be “Repy-To”.

  • The email address extracted from the From field

  • The domain of the From field

This dataframe is computed the first time this method is called and then cached.

Parameters

header (string, default "From") –

Returns

data

Return type

pandas.DataFrame

get_threads(verbose=False)

Get threads.

preprocessed = None
resolve_entities(inplace=True)

Return data with resolved entities.

Parameters

inplace (bool, default True) –

Returns

Returns None if inplace == True

Return type

pandas.DataFrame or None

save(path, encoding='utf-8')

Save data to csv file.

threads = None
exception bigbang.archive.ArchiveWarning

Bases: BaseException

Base class for Archive class specific exceptions

exception bigbang.archive.MissingDataException(value)

Bases: Exception

bigbang.archive.archive_directory(base_dir, list_name)

Creates a new archive directory for the given list_name unless one already exists. Returns the path of the archive directory.

Returns the footer of a DataFrame of emails.

A footer is a string occurring at the tail of most messages. Messages can be a DataFrame or a Series

bigbang.archive.load(path)
bigbang.archive.load_data(name: str, archive_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/latest/archives/', mbox: bool = False)

Load the data associated with an archive name, given as a string.

Attempt to open {archives-directory}/NAME.csv as data.

Failing that, if the the name is a URL, it will try to derive the list name from that URL and load the .csv again.

Parameters
  • name (str) –

  • archive_dir (str, default CONFIG.mail_path) –

  • mbox (bool, default False) – If true, expects and opens an mbox file at this path

Returns

data

Return type

pandas.DataFrame

bigbang.archive.messages_to_dataframe(messages)

Turn a list of parsed messages into a dataframe of message data, indexed by message-id, with column-names from headers.

bigbang.archive.open_list_archives(archive_name: str, archive_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/latest/archives/', mbox: bool = False) → pandas.core.frame.DataFrame

Return a list of all email messages contained in the specified directory.

Parameters
  • archive_name (str) – the name of a subdirectory of the directory specified in argument archive_dir. This directory is expected to contain files with extensions .txt, .mail, or .mbox. These files are all expected to be in mbox format– i.e. a series of blocks of text starting with headers (colon-separated key-value pairs) followed by an email body.

  • archive_dir (str:) – directory containing all messages.

  • mbox (bool, default False) – True if there’s an mbox file already available for this archive.

Returns

data

Return type

pandas.DataFrame