Email Threading

Email threading

An example illustrating the use of email threading algorithm on the fedora mailing list.

from __future__ import print_functionfrom time import timeimport sysimport platformimport pandas as pdimport requestspd.options.display.float_format = '{:,.3f}'.formatif platform.system() == 'Windows' and sys.version_info > (3, 0): print('This example currently fails on Windows with PY3 (issue #') sys.exit()dataset_name = "fedora_ml_3k_subset" # see list of available datasetsBASE_URL = "http://localhost:5001/api/v0" # FreeDiscovery server URL

0. Load the test dataset

url = BASE_URL + '/example-dataset/{}'.format(dataset_name)print(" GET", url)res = requests.get(url)res = res.json()# To use a custom dataset, simply specify the following variablesdata_dir = res

Out:

GET http://localhost:5001/api/v0/example-dataset/fedora_ml_3k_subset

1. Parse emails

url = BASE_URL + '/feature-extraction'print(" POST", url)res = requests.post(url, json={'parse_email_headers': True}).json()dsid = res['id']print(" => received {}".format(list(res.keys())))print(" => dsid = {}".format(dsid))url = BASE_URL+'/feature-extraction/{}'.format(dsid)print(" POST", url)requests.post(url, json={'data_dir': data_dir})

Out:

POST http://localhost:5001/api/v0/feature-extraction
   => received ['id']
   => dsid = b06fe87954794b29
 POST http://localhost:5001/api/v0/feature-extraction/b06fe87954794b29

2. Thread Emails

url = BASE_URL + '/email-threading/'print(" POST", url)t0 = time()res = requests.post(url, json={'parent_id': dsid}).json()mid = res['id']print(" => model id = {}".format(mid))def print_thread(container, depth=0): print(''.join(, ' (id={})'.format(container['id'])])) for child in container: print_thread(child, depth + 1)

Out:

POST http://localhost:5001/api/v0/email-threading/
     => model id = 268913466c904c3d

Threading examples cf. https://www.redhat.com/archives/rhl-devel-list/2008-October/thread.htlm for ground truth data (mailman has a maximum threading depth of 3, unlike FreeDiscovery

for idx in [-1, -2, -3, -4, -5]: # get latest threads print(' ') print_thread(res)

Out:

Strange ext3 problem (id=3049)
> Re: Strange ext3 problem (id=3055)

Dia has .la files (id=3039)
> Re: Dia has .la files (id=3040)
> > Re: Dia has .la files (id=3048)
> > > Re: Dia has .la files (id=3047)
> > > > Re: Dia has .la files (id=3051)
> > > > > Re: Dia has .la files (id=3052)
> > > > > > Re: Dia has .la files (id=3057)
> > > > > > > Re: Dia has .la files (id=3060)
> > > > > > Re: Dia has .la files (id=3061)
> > > Re: Dia has .la files (id=3062)

PackageKit 0.3.10 into F9 (id=3032)
> Re: PackageKit 0.3.10 into F9 (id=3034)

rawhide report: 20081031 changes (id=3019)
> Re: rawhide report: 20081031 changes (id=3021)
> > Re: rawhide report: 20081031 changes (id=3023)
> > > Re: rawhide report: 20081031 changes (id=3025)
> > > > Re: rawhide report: 20081031 changes (id=3026)
> Re: rawhide report: 20081031 changes (id=3022)
> > Re: rawhide report: 20081031 changes (id=3027)
> > > Re: rawhide report: 20081031 changes (id=3028)
> > > Re: rawhide report: 20081031 changes (id=3029)
> > > > Re: rawhide report: 20081031 changes (id=3030)
> > > > > Re: rawhide report: 20081031 changes (id=3031)
> > > > > Re: rawhide report: 20081031 changes (id=3035)
> > > > > > Re: rawhide report: 20081031 changes (id=3037)
> > Re: rawhide report: 20081031 changes (id=3033)
> > Re: rawhide report: 20081031 changes (id=3036)

Seeking comaintainers (id=3002)
> Re: Seeking comaintainers (id=3004)
> > Re: Seeking comaintainers (id=3006)
> > > Re: Seeking comaintainers (id=3008)
> Re: Seeking comaintainers (id=3005)
> > Re: Seeking comaintainers (id=3009)
> Re: Seeking comaintainers (id=3007)
> Re: Seeking comaintainers (id=3010)
> > Re: Seeking comaintainers (id=3011)

Total running time of the script: ( 0 minutes 11.619 seconds)