archive¶
This module supports the Archive class, a generic structure representing a collection of archived emails, typically from a single mailing list.
-
class
bigbang.archive.
Archive
(data, archive_dir='/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/', mbox=False)¶ Bases:
object
A representation of a mailing list archive.
Initialize an Archive object.
The behavior of the constructor depends on the type of its first argument, data.
If data is a Pandas DataFrame, it is treated as a representation of email messages with columns for Message-ID, From, Date, In-Reply-To, References, and Body. The created Archive becomes a wrapper around a copy of the input DataFrame.
If data is a string, then it is interpreted as a path to either a single .mbox file (if the optional argument single_file is True) or else to a directory of .mbox files (also in .mbox format). Note that the file extensions need not be .mbox; frequently they will be .txt.
Upon initialization, the Archive object drops duplicate entries and sorts its member variable data by Date.
- Parameters
data (pandas.DataFrame, or str) –
archive_dir (str, optional) – Defaults to CONFIG.mail_path
mbox (bool) –
-
activity
= None¶
-
add_affiliation
(rel_email_affil)¶ Uses a DataFrame of email affiliation information and adds it to the archive’s data table.
The email affilation data is expected to have a regular format, with columns:
email
- strings, complete email addressesaffilation
- strings, names of organizatiosn of affilationmin_date
- datetime, the starting date of the affiliationmax_date
- datetime,the end date of the affilation.
Note that this mutates the dataframe in
self.data
to add the affiliation data.rel_email_affil : pandas.DataFrame
-
compute_activity
(clean=True)¶ Return the computed activity.
-
data
= None¶
-
entities
= None¶
-
get_activity
(resolved=False)¶ Get the activity matrix of an Archive.
Columns of the returned DataFrame are the Senders of emails. Rows are indexed by ordinal date. Cells are the number of emails sent by each sender on each data.
If resolved is true, then default entity resolution is run on the activity matrix before it is returned.
-
get_personal_headers
(header='From')¶ Returns a dataframe with a row for every message of the archive, containing column entries for:
The personal header specified. Defaults to “From”. Could be “Repy-To”.
The email address extracted from the From field
The domain of the From field
This dataframe is computed the first time this method is called and then cached.
- Parameters
header (string, default "From") –
- Returns
data
- Return type
pandas.DataFrame
-
get_threads
(verbose=False)¶ Get threads.
-
preprocessed
= None¶
-
resolve_entities
(inplace=True)¶ Return data with resolved entities.
- Parameters
inplace (bool, default True) –
- Returns
Returns None if inplace == True
- Return type
pandas.DataFrame or None
-
save
(path, encoding='utf-8')¶ Save data to csv file.
-
threads
= None¶
-
exception
bigbang.archive.
ArchiveWarning
¶ Bases:
BaseException
Base class for Archive class specific exceptions
-
exception
bigbang.archive.
MissingDataException
(value)¶ Bases:
Exception
-
bigbang.archive.
archive_directory
(base_dir, list_name)¶ Creates a new archive directory for the given list_name unless one already exists. Returns the path of the archive directory.
Returns the footer of a DataFrame of emails.
A footer is a string occurring at the tail of most messages. Messages can be a DataFrame or a Series
-
bigbang.archive.
load
(path)¶
-
bigbang.archive.
load_data
(name: str, archive_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/', mbox: bool = False)¶ Load the data associated with an archive name, given as a string.
Attempt to open {archives-directory}/NAME.csv as data.
Failing that, if the the name is a URL, it will try to derive the list name from that URL and load the .csv again.
- Parameters
name (str) –
archive_dir (str, default CONFIG.mail_path) –
mbox (bool, default False) – If true, expects and opens an mbox file at this path
- Returns
data
- Return type
pandas.DataFrame
-
bigbang.archive.
messages_to_dataframe
(messages)¶ Turn a list of parsed messages into a dataframe of message data, indexed by message-id, with column-names from headers.
-
bigbang.archive.
open_list_archives
(archive_name: str, archive_dir: str = '/home/docs/checkouts/readthedocs.org/user_builds/bigbang-py/checkouts/v0.4.0/archives/', mbox: bool = False) → pandas.core.frame.DataFrame¶ Return a list of all email messages contained in the specified directory.
- Parameters
archive_name (str) – the name of a subdirectory of the directory specified in argument archive_dir. This directory is expected to contain files with extensions .txt, .mail, or .mbox. These files are all expected to be in mbox format– i.e. a series of blocks of text starting with headers (colon-separated key-value pairs) followed by an email body.
archive_dir (str:) – directory containing all messages.
mbox (bool, default False) – True if there’s an mbox file already available for this archive.
- Returns
data
- Return type
pandas.DataFrame