PuppyIR API Reference

puppy.service

This module contains classes for building a service.

ServiceManager

class puppy.service.ServiceManager(config)

Manages a collection of search services for a PuppyIR Service

add_search_service(obj)

Add a search service

remove_search_service(service_or_name)

Removes an exisiting search service

SearchService

class puppy.service.SearchService(service_manager, name)

Models the configuration of a QueryFilter pipeline, Search Engine, and a ResultFilter pipeline.

add_filters(*filters)

Add one or more filters. Detects filter type (e.g., QueryFilter, ResultModifier) and places in appropriate pipeline.

add_query_filter(query_filter)

Add filter to query filter pipeline.

add_query_modifier(query_modifier)

Add modifier to query modifier pipeline.

add_result_filter(result_filter)

Add filter to result filter pipeline.

add_result_modifier(result_modifier)

Add filter to result filter pipeline.

clear_filters()

Remove all existing filters.

replace_filters(*filters)

Replace existing filters with new filters.

search(query, offset=0, highlight=False)

Search with query and result filter pipelines active.

Parameters:

  • query (puppy.model.Query): search query
  • offset (int): result offset

Returns:

  • results (puppy.model.Response): search results
simplesearch(query, offset=0)

Search without query and result filter pipelines.

Parameters:

  • query (puppy.model.Query): search query
  • offset (int): result offset

Returns:

  • results (puppy.model.Response): search results

puppy.search

SearchEngine

class puppy.search.SearchEngine(service, **args)

Abstract search engine interface.

configure_opener()

Configure urllib2 opener with network proxy

search(query, pos=1)

Perform a search.

Parameters:

  • query (puppy.model.Query): query object
  • pos (int): result offset

Returns:

  • results (puppy.model.Response): results of the search

puppy.search.exceptions

SearchEngineError

class puppy.search.exceptions.SearchEngineError(searchEngineName, error, **extras)[source]

Use for exceptions in which the search engine wrapper fails - this can be for multiple reasons, for example: the lack of a proxy server in config or a search service being down. Callers should respond to this in a way that fails gracefully.

ApiKeyError

class puppy.search.exceptions.ApiKeyError(searchEngineName, apiFieldName)[source]

Use for exceptions in which the API for a wrapper, which requires one, has not been supplied. Callers should respond in such a way that the developer, it is not intended for users of an application, are aware of the issue and so can take the necessary steps to rectify the issue.

puppy.search.engine

Bing

class puppy.search.engine.Bing(service, site=None, source='web', adult='Strict', market='en-GB', resultsPerPage=10, lat=None, lon=None, radius=5, sites=None, **args)

Bing search engine wrapper.

Note: you can only use location based searching with sourcetypes ‘web’ and ‘phonebook’; however, with web, it doesn’t appear to have any effect

Parameters:

  • site: if you wish to search specific websites for results
  • source (str): web, image, news are the options
  • adult (str): strict, i.e. safesearch not recommended to change from the default
  • market (str): i.e. which area’s results are prioritised more - en-gb is the UK
  • resultsPerPage (int): How many results per page
  • lat (double): the latitude of the place you want to search in
  • lon (double): the longitude of the place you want to search in
  • radius (int): the radius to retrieve results from around lat and lon; 0-250miles is the limit

BingSite

class puppy.search.engine.BingSite(service, site=None, source='web')

Bing site search engine wrapper.

Parameters:

  • site: if you wish to search specific websites for results
  • source (str): web, image, news are the options

BingV2 (API Version 2.2)

class puppy.search.engine.BingV2(service, source='Web', adult='Strict', market='en-GB', resultsPerPage=8, filters=None, sortBy=None, newsCategory=None, sites=None, **args)

Bing search engine wrapper for Version 2.X of the API - allowing for News, Web, Image and Video results to be retrieved

One of the key advantages of using this wrapper is using the new features and also being able to use multiple sources to create a mash-up. i.e. source=”Web+Image” gets results from the web and also image search services.

You must include your application’s Bing ID in your service manage config to use this service. It should be under the identifier “bing_api_key”

If you are using the ‘Spell’ then you must set the ‘market’ parameter to match the language you are querying in i.e. English UK set Market to en-gb or Dutch set it to nl-nl

Parameters:

  • source (str): what source the results should come from, valid options are: Web, News, Video, Image, Spell, RelatedSearch.
  • adult (str): Strict is the default, not recommended to change this
  • market (str): For UK: en-GB, For Netherlands: nl-NL etc
  • resultsPerPage (int): How many results per page

– Image and Video Only Parameters –

  • filters (str): filter options split up by ‘+’ you can only have one of each type see Bing API documentation for what these are

– Video and News Only Paramters –

  • sortBy (str): sort news by either ‘Date’ or ‘Relevance’

– News Only Parameters –

  • newsCategory (str): what sort of news is wanted - see BingAPI for list of options, for example: ‘rt_ScienceAndTechnology’

Digg

class puppy.search.engine.Digg(service, resultsPerPage=8, sort=None, topic=None, media='all', max_date=None, min_date=None, **args)

Digg search engine wrapper.

Parameters:

  • resultsPerPage (int): How many results per page
  • sort (str): how to sort results (see Digg site for a list of the options) an example is ‘submit_date-desc’ to sort via the item’s submit date
  • topic (str): restrict the search to a specific topic (see Digg site for a list of them)
  • media (str): options are: ‘all’, ‘news’, ‘videos’, ‘images’
  • max_date (unix timestamp - converted to str): latest date results returned were posted
  • min_date (unix timestamp - converted to str): earliest date results returned were posted

EmmaSearch

class puppy.search.engine.EmmaSearch(service, age='v', resultsPerPage=10, **args)

EmmaSearch search engine.

Parameters:

  • age (str): values - ‘v’ for adults (shows all ‘a’ and ‘k’ results too), ‘a’ for teenagers, and ‘k’ for children
  • resultsPerPage (int): How many results per page - the default for the emma search service is 10

Flickr

class puppy.search.engine.Flickr(service, sortBy='relevance', safeSearch=3, mediaType='photos', resultsPerPage=8, bbox=None, **args)

Flickr search engine.

You must include your application’s Flickr ID in your service manage config to use this service it should be under the identifier “flickr_api_key”

Parameters:

  • sortBy (str): how we sort results, default is relevance see Flickr API for more details
  • safeSearch (int): default is 3, i.e. strict, not recommended to change this
  • mediaType (str): all, photos, videos are the options
  • resultsPerPage (int): How many results per page
  • bbox (str): replace the names with the values of the corners of the bounding box ‘swLongitude,swLatitude,neLongitude,neLatitude’

Google Geocode

class puppy.search.engine.GoogleGeocode(service, sensor='false', **args)

GoogleGeocode search service.

Parameters:

  • sensor(str): does your device have a GPS sensor or not, not recommended to change from ‘false’ but the other option is, naturally, ‘true’ - must be lowercase

Google (depreciated)

class puppy.search.engine.Google(service, **args)

Google search engine.

Google have regrettfully retired this search api

Code is left here for reference purposes

Google Books

class puppy.search.engine.GoogleBooks(service, resultsPerPage=8, langRestrict=None, filter=None, orderBy='relevance', printType=None, **args)

Google’s Books search engine api.

See documentation for how to specify advanced queries i.e. Hobbit+inauthor:Tolkien

Parameters:

  • resultsPerPage (int): How many results per page
  • langRestrict (str): restrict results to a certain language i.e. ‘en’ for English
  • filter (str): filter volumes by type/availabilty, valid values - ‘partial’, ‘full’, ‘free-ebooks’, ‘paid-ebooks’, ‘ebooks’
  • orderBy (str): order either by ‘relevance’ or ‘newest’
  • printType (str): ‘all’, ‘books’ or ‘magazines’ restrict the results to either all or one of the preceding types of media only

Guardian

class puppy.search.engine.Guardian(service, orderBy='newest', **args)

Guardian search engine.

Warning: ‘StandFirst’ is the result field used for description; it is a form of abstract for the news story. It can however, contain html tags and so when processing these results outside the framework care needs to be taken.

Parameters:

  • orderBy (str): the options are - ‘newest’, ‘oldest’ and ‘relevance’

iTunes

class puppy.search.engine.ITunes(service, country='gb', lang='en_gb', media=None, resultsPerPage=8, explicit=False, **args)

iTunes search engine wrapper - allowing for Track, Album and Artist search results to be retrieved

If you change either lang or country change the other variable to match i.e. change lang to ‘en_gb’ you should also change country to ‘gb’ to match or vice-versa.

Parameters:

  • country (str): Which iTunes store to search i.e. ‘gb’ for the UK and ‘us’ for the USA etc
  • lang (str): the language the results should be returned in
  • media(str): the media type you want to search for (see iTunes documentation for others e.g. ‘movie’ etc)
  • resultsPerPage (int): How many results per page
  • explicit (boolean): Do we want to return results marked as including explicit content (not recommended to change this)

LastFM

class puppy.search.engine.LastFM(service, source='track', resultsPerPage=8, artist=None, **args)

LastFM search engine wrapper - allowing for Track, Album and Artist search results to be retrieved

You must include your application’s LastFM ID in your service manage config to use this service. It should be under the identifier “last_fm_api_key”

Parameters:

  • source (str): What to search for, valid types: ‘track’, ‘album’ and ‘artist’
  • resultsPerPage (int): How many results per page

– Track Only Parameters –

  • artist (str): the artist for the tracks you are searching for

OpenSearch

class puppy.search.engine.OpenSearch(service, url, **args)

OpenSearch search engine.

Picassa

class puppy.search.engine.Picassa(service, resultsPerPage=8, access='public', kind='photo', **args)

Picassa search engine.

Parameters:

  • resultsPerPage (int): select how many results per page
  • access (str): public, private (it is not recommended to change to private), all, visible
  • kind (str): photo is the only working option

Rotten Tomatoes

class puppy.search.engine.RottenTomatoes(service, resultsPerPage=8, **args)

RottenTomatoes search engine.

You must include your application’s Rotten Tomatoes ID in your service manage config to use this service it should be under the identifier “rotten_tomatoes_api_key”

Parameters:

  • resultsPerPage (int): How many results per page

SimpleWikipedia

class puppy.search.engine.SimpleWikipedia(service, resultsPerPage=8, **args)

Simple Wikipedia search engine.

Parameters:

  • resultsPerPage (int): How many results per page - note with Wiki only one page of results is returned.

Solr

class puppy.search.engine.Solr(service, url, **args)

Solr search engine.

SoundCloud

class puppy.search.engine.SoundCloud(service, resultsPerPage=8, order=None, tags=None, filter=None, genres=None, types=None, bpmFilter=None, durationFilter=None, createdFilter=None, **args)

SoundCliud search engine wrapper for a music sharing application allowing the searching for tracks.

You must include your api key for Wordnik in your service manage config to use this service. It should be under the identifier “soundcloud_api_key”

Parameters:

  • resultsPerPage (int): the number of results to return for a search query
  • order (str): the order to return results in, valid values are ‘created_at’ and ‘hotness’ (this later one being popularity of tracks)
  • tags (str): a comma separated string of tags to look for along with the query
  • filter (str): filter via the access category, valid values are: ‘all’, ‘public’, ‘private’, ‘streamable’, ‘downloadable’
  • genres (str): a comma separated string of genres to look for along with the query (see the SoundCloud site for a list of genres)
  • types (str): a comma separated string of types of track to look for along with the query (see the SoundCloud site for a list of types - examples are ‘live’ or ‘demo’)
  • bpmFilter (dict): filters via beats per minute, with the fields being ‘from’ and ‘to’ their values both being ints
  • durationFilter (dict): filters via duration of the track, with the fields being ‘from’ and ‘to’ their values both being ints with the units being milliseconds
  • createdFilter (dict): filters via when the track was created, with the fields being a string of format: ‘yyyy-mm-dd hh:mm:ss’

Spotify

class puppy.search.engine.Spotify(service, resultType='tracks', **args)

Spotify search engine.

Parameters:

  • resultType (str): what result type should be returned, the options are: ‘tracks’, ‘albums’, ‘artists’

Twitter

class puppy.search.engine.Twitter(service, language='en', type='mixed', geocode=None, resultsPerPage=9, includeEntities=False, **args)

Twitter search engine.

Parameters:

  • language (str): en = English, de = German etc
  • type (str): what sort of results to get can be - mixed, recent, popular
  • geocode (str): to get queries around a specific location
  • includeEntities (boolean): if this is true then a lot of meta-data is included (mentions, associated images, associated urls)
  • resultsPerPage (int): results per page

Geocode format is: latitude,longitude,radius Example: ‘37.781157,-122.398720,1mi’

WebSpellChecker

Register for an API key here: http://www.webservius.com/services/spellcheck/spellcheck

class puppy.search.engine.WebSpellChecker(service, language='en_GB', **args)

Web Spell Checker’s search engine api.

You must include your application’s Web Spell Checker Api key in your service manager config to use this service It should be under the identifier “web_spell_api_key”

Parameters:

  • language (str): the language/dictionary to check again i.e. ‘en_US’ for American English, ‘nl_NL’ for Dutch etc (this is case sensative)

Wikipedia

class puppy.search.engine.Wikipedia(service, resultsPerPage=8, wikiLanguage='en', **args)

Wikipedia search engine.

Parameters:

  • resultsPerPage (int): How many results per page - note with Wiki only one page of results is returned.
  • wikiLanguage(str): which wiki api you want to search, default is en (English), nl (Dutch) is another example

Wordnik

class puppy.search.engine.Wordnik(service, source='Definitions', resultsPerPage=8, sourceDictionaries=None, **args)

Worknik search engine wrapper for their dictionary based API. This wrapper allows for searching for spelling corrections, examples of the usage of a word (in web results), and also definitions for a word.

This API is only for English however, other languages are not supported.

You must include your api key for Wordnik in your service manage config to use this service. It should be under the identifier “wordnik_api_key”

With sourceDictionaries (see below) you can select multiple values i.e. ahd,webster but this will just return the first definition from ahd or if it doesn’t have one from webster

Parameters:

  • source (str): what source the results should come from, valid options are: ‘Suggestions’, ‘Examples’, ‘Definitions’
  • resultsPerPage (int): How many (the maximum number) results to return

– Definitions Only Parameters –

  • sourceDictionaries (str): the dictionary to search, if blank it defaults to the first definition. Other options are: ‘all’, ‘ahd’, ‘century’, ‘wiktionary’, ‘webster’, ‘wordnet’

Yahoo

class puppy.search.engine.Yahoo(service, **args)

Yahoo search engine.

You must include your application’s Yahoo ID in your service manage config to use this service. It should be under the identifier “yahoo_api_key”

YouTube

class puppy.search.engine.YouTube(service, **args)

YouTube search engine.

YouTubeV2 (API Version 2.0)

class puppy.search.engine.YouTubeV2(service, resultsPerPage=8, safeSearch='strict', orderBy='relevance', format=None, location=None, locationRadius=None, onlyLocation=False)

YouTube search engine API version 2.

The orderBy parameter allows results to be filtered by their language relevence - see below for more.

Parameters:

  • resultsPerPage (int): results per page

  • safeSearch (str) : default is strict it’s not recommended to change this

  • orderBy: (str) rating, viewCount, relevance, relevance_lang_<languageCode>

  • format (int): this defines if videos must conform to a standard for example 5 means only videos that can be embedded

  • location (str): defines the location the videos should be from, in the format ‘lat,lon’

  • locationRadius (str): format is ‘<radius><unit>’ the radius around the location, within which results should be return from

    the valid units are: m, km, ft and mi

  • onlyLocation (boolean): only return results with a location (i.e. a geotag)

Replace <languageCode> with a code i.e. English: ‘en’, Dutch: ‘nl’

Whoosh wrappers

The following two wrappers both require Whoosh to be installed, for instructions for installing Whoosh see 1.2 - Requirements and Installation.

Due to Whoosh being an extra, rather than required, you cannot import them from ‘puppy.search.engine‘ like the above wrappers; you import them using the code below:

from puppy.search.engine.whooshQueryEngine import WhooshQueryEngine
from puppy.search.engine.whooshQuerySuggestEngine import WhooshQuerySuggestEngine

Whoosh Query Engine

class puppy.search.engine.whooshQueryEngine.WhooshQueryEngine(service, whoosh_query_index_dir='', resultsPerPage=8, **args)[source]

Whoosh Query log search engine.

Parameters:

  • resultsPerPage (int): select how many results per page
  • whoosh_query_index_dir (str): the absolute path for where you want queries indexed at

Whoosh Query Suggest Engine

class puppy.search.engine.whooshQuerySuggestEngine.WhooshQuerySuggestEngine(service, whoosh_query_index_dir='', resultsPerPage=8, **args)[source]

Whoosh Query log search engine.

Paramters:

  • resultsPerPage (int): select how many results per page
  • whoosh_query_index_dir (str): the absolute path for where you want queries indexed at

puppy.model

Response

class puppy.model.Response(results={})

Data model for search results. Response has four main attributes:

  • feed: dictionary of information about the search results {title, * description, etc}
  • entries: list of search results [{title, link, summary, etc}, ...]
  • namespaces: list of namespaces [“http://a9.com/-/spec/opensearch/1.1/”, * ...]
  • version: source type of orginal results “rss/atom/json”
get_itemsperpage()

Returns the number of results per page, as reported by the search engine (usually, 10, except for Google, 8)

This number is used mainly by page algorithms.

Returns:

  • opensearch_itemsperpage: the itemsperpage value
get_startindex()

Returns the start item for the current “page”, as reported by the search engine. It is usually 0 or items per page * page number

This number is used mainly by page algorithms.

Returns:

  • opensearch_startindex: the startindex value
get_totalresults()

Returns the number total of results, as reported by the search engine.

This number is used mainly by page algorithms.

Returns:

  • opensearch_totalresults: the total_results value
static parse_feed(xml_feed)

Parses a RSS/ATOM feed of Opensearch results

static parse_json_suggestions(json_doc)

Parse a JSON document of Opensearch suggestions

static parse_xml_suggestions(xml_doc)

Parse a XML document of Opensearch suggestions

to_atom()

Creates an XML from a OpenSearch Response.

Returns:

  • response_xml (str): OpenSearch Response as an ATOM feed
to_json()

Creates JSON from a Response object.

Returns:

  • response_json (str): Response as JSON
to_rss()

Creates an RSS feed from a Response object.

Returns:

  • response_xml (str): Response as RSS feed

Query

class puppy.model.Query(search_terms)

OpenSearch Query.

Models an OpenSearch Query element.

See: http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_Query_element

static parse_xml(oss_xml)

Parse OpenSearch Query XML.

Parameters:

  • oss_xml (str): OpenSearch Query XML

Returns:

  • puppy.model.OpenSearch.Query

TODO code Query.parse_xml()

write_xml()

Creates XML for OpenSearch Query.

Returns:

  • query_xml (str): OpenSearch Query as XML

TODO code Query.write_xml()

Description

class puppy.model.Description

OpenSearch Description.

Models an OpenSearch Description document.

See: http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_description_document

static parse_xml(oss_xml)

Parse OpenSearch Description XML.

Parameters:

  • oss_xml (str): OpenSearch Description XML

Returns:

  • puppy.model.OpenSearch.Description

TODO code Description.parse_xml()

write_xml()

Creates XML for an OpenSearch Description document.

Returns:

  • description_xml (str): OpenSearch Description document as XML

TODO code Description.write_xml()

puppy.query

QueryFilter

class puppy.query.QueryFilter(order=0)

Base class for filters that can reject queries, e.g., by detecting profanity.

QueryModifier

class puppy.query.QueryModifier(order=0)

Base class for all query modifiers

puppy.query.exceptions

QueryRejectionError

class puppy.query.exceptions.QueryRejectionError[source]

Raise when a filter rejects a query, e.g., because profanity is detected.

QueryFilterError

class puppy.query.exceptions.QueryFilterError[source]

Use for exceptions in which the filter operationally failed and the filter’s function cannot be realized. Callers should respond to this as if a modification or rejection decision cannot be made, as opposed to puppy.query.QueryRejectionError, in which case the query should not be issued.

QueryModifierError

class puppy.query.exceptions.QueryModifierError[source]

Use for exceptions in which the modifier operationally failed and the modifier’s function cannot be realized. Callers should respond to this as if a modification or rejection decision cannot be made, as opposed to puppy.query.QueryRejectionError, in which case the query should not be issued.

puppy.query.filter

BlackListFilter

class puppy.query.filter.BlackListFilter(order=0, terms='')

The BlackList filter looks at the query to check if any terms are contained within the black list if so, they are rejected.

Parameters:

  • order (int): filter precedence
  • terms: a string containing all the blacklisted terms separated by spaces i.e. ‘ ‘

WDYL Profanity Filter

class puppy.query.filter.WdylProfanityQueryFilter(order=0)

Rejects queries containing profanity using WDYL (by Google).

What this does is query the service, which returns a JSON response of true or false depending upon the presence, or not, of profanity.

Warning: there is a marked delay in waiting for a response from this service - overuse can lead to poor performance.

Parameters:

  • order (int): filter precedence

SuggestionFilter

class puppy.query.filter.SuggestionFilter(order=0)

Creates a set of suggestions based upon the query search terms.

As of July 2011, Sergio’s web service no longer responds and is therefore not usable.

Paramters:

  • order (int): filter precedence

WhooshQueryLogger

About the Whoosh Query Logger

The Whoosh Query Logger, like the search engine wrappers for Whoosh, requires Whoosh to be installed, for instructions for installing Whoosh see 1.2 - Requirements and Installation.

Due to Whoosh being an extra, rather than required, you cannot import it from ‘puppy.query.filter‘ like the above filters; you import the Whoosh Query Logger using the code below:

from puppy.query.filter.whooshQueryLogger import WhooshQueryLogger
class puppy.query.filter.whooshQueryLogger.WhooshQueryLogger(order=0, whoosh_query_index_dir='', unique=True)[source]

Logs the queries in a Whoosh Index, Creates a Whoosh Index to store queries if there is no index in the dir given with a Schema(title=ID(unique=True, stored=True), content=TEXT(stored=True), ncontent=NGRAM(stored=True), issued=DATETIME(stored=True)) Parameters:

  • order (int): filter precedence
  • whoosh_query_index_dir (string): path to the directory of the index
  • unique (boolean): indicates whether all queries are stored, or only unique queries (i.e. if unique=True)

puppy.query.modifier

SpellingModifier

class puppy.query.modifier.SpellingCorrectingModifier(order=0)

This modifies queries by replacing mispelt words with the first “correct” spelling found.

Parameters:

  • order (int): modifier precedence
  • language (string): this defines which dictionary to use, it defaults to en_US - change this as required

Warning: this requires the PyEnchant library to be installed

TermExpansionModifier

class puppy.query.modifier.TermExpansionModifier(order=0, terms='')

Expands original query terms with extra terms.

Parameters:

  • order (int): modifier precedence
  • terms (string): the terms to be appended to the query

KidsModifier

class puppy.query.modifier.KidsModifier(order=0, modifiers=None)

Base class for QueryModifiers aiming to modify queries to be more child-directed, e.g., appending for kids to query, creating Q -> Q. After modification, the Google Suggest service is checked for the presence of Q; if it exists as a frequenty query, Q is returned to the caller; otherwise, Q (the original query) is returned (hence a null operation).

KidifyQueryModifier

class puppy.query.modifier.KidsModifier(order=0, modifiers=None)

Base class for QueryModifiers aiming to modify queries to be more child-directed, e.g., appending for kids to query, creating Q -> Q. After modification, the Google Suggest service is checked for the presence of Q; if it exists as a frequenty query, Q is returned to the caller; otherwise, Q (the original query) is returned (hence a null operation).

puppy.result

ResultFilter

class puppy.result.ResultFilter(order=0)

Abstract result filter.

ResultModifier

class puppy.result.ResultModifier(order=0)

Change result.

puppy.result.exceptions

ResultFilterError

class puppy.result.exceptions.ResultFilterError[source]

Use for exceptions in which the filter operationally failed and the filter’s function cannot be realized. Callers should respond to this as if a rejection decision cannot be made.

ResultModifierError

class puppy.result.exceptions.ResultModifierError[source]

Use for exceptions in which the modifier operationally failed and the modifier’s function cannot be realized. Callers should respond to this as if a modification cannot be made to the result.

puppy.result.filter

Age Filter

class puppy.result.filter.AgeFilter(age, ageField=None, ageTolerance=3, minAgeField='minAge', maxAgeField='maxAge', order=0, rejectUnclassified=False)

Filters search results based on either a specific age or if the age is within an age range defined by the result.

Note: there is no default value for ‘age’ it must be passed to this filter so that it can be customised for the application using it.

Options:

  • order (int): filter precedence
  • age (integer) : the age of the user the results should be filtered for
  • ageField (str) : the field name for the age in the results
  • ageTolerance (int): if results just have an age field this defines the tolerance for accepting results i.e. within 3 years of the ‘age’ parameter - must be >= 0
  • maxAgeField (str) : the field name for the maximum age in the results
  • minAgeField (str) : the field name for the minimum age (if used)
  • rejectUnclassified (boolean): if set to true results without an age classificiation will be rejected automatically

Duplicate Filter

class puppy.result.filter.DuplicateFilter(order=0, existingResults=[])

Filters search results and rejects ones already stored by an application. This is done by default by checking the link field of new results against a list of ones currently stored by the application. If found, they are rejected.

Options:

  • order (int): defines when, in the pipeline, this filter will be executed
  • existing results (list of str): urls already stored in the application - we want to avoid getting these again.

ExclusionFilter

class puppy.result.filter.ExclusionFilter(order=0, terms='', customFields=[])

Filters search results based on a list of words to exclude, if any of these are found the result in question is rejected.

Options:

  • order (int): defines when, in the pipeline, this filter will be executed
  • terms (str): terms that, if appearing in the result, will cause it to be rejected - separated by “+’s”
  • customFields (list of str): extra fields in the results to filter with the exclusion list - depedendent upon their existence in the search service results

ProfanityFilter

class puppy.result.filter.WdylProfanityFilter(order=0, customFields=[])

Filters results with profanity in them by using the wsdl service.

Pros:
  • no hardcoded blacklist. they do the effort in keeping the service effective
Cons:
  • URL call. This can mean delay. Effort should be made to parallelize the pipeline so that this effect is minimal.

Parameters:

  • order (int): filter precedence
  • customFields (list of str): extra fields in the results to filter with the exclusion list - depedendent upon their existence in the search service results

SuitabilityFilter

This filter evaluates a result on its suitability for children by assigning it a score of 0 (unsuitable) to 1.0 (100% suitable). For an example of how to use this filter check out the SeSu prototype - see 1.3 - Running Prototypes for details on how to install and run this prototype.

N.B. this filter requires Java to be installed and present on the system path (see: 1.2 - Requirements and Installation for more).

class puppy.result.filter.SuitabilityFilter(order=0, threshold=0.0)

Filters search results based on the results’ suitability for children.

Parameters:

  • order (int): filter precedence
  • threshold (double): confidence score to accept a page (e.g. 0.5)

puppy.result.modifier

BlackListModifier

class puppy.result.modifier.BlackListResultModifier(order=0, terms='', customFields=[])

Modify processes result entry content and replaces blacklisted words

Options:

  • order (int): modifier precedence
  • terms (str): terms that, if appearing in the result, will be replaced with ***

puppy.logging

QueryLogger

class puppy.logging.QueryLogger(search_service, log_mode=0, log_dir=None, log_period='midnight', log_maxbytes=1000000000)

Logs queries for a SearchService.

The QueryLogger will log all queries submitted to a SearchService, sending them to:

  1. current directory, if there is no given log_dir
  2. specific directory, if a log_dir filepath is given (by constructor or config)

The QueryLogger has five logging modes:

  1. OneBigFile - single file that grows endlessly
  2. Rotational - files rotate when log file size is = 1GB
  3. Timed - files rotate every day at midnight
  4. Permanent Rotating - files rate when the log file size is reached taking a unique name for each new log
  5. Gzip Permanent Rotating - same as above by using Gz compression
create_logger()

Create a new logger with a specific handler

get_log_dir()

Find the log_dir if none was passed in the constructor.

Checks the service config files, then defaults to creating a log directory in the current working directory

log(query)

logs a query using a simple [ISO Timestamp, Query Terms] format

EventLogger

class puppy.logging.EventLogger(application_name, log_mode=0, log_dir=None, log_period='midnight', log_maxbytes=1000000000)

The EventLogger will log all events submitted to it from an application (either standalone or Django)

  1. current directory, if there is no given log_dir
  2. specific directory, if a log_dir filepath is given by the constructor

The EventLogger has three logging modes:

  1. OneBigFile - single file that grows endlessly
  2. Rotational - files rotate when log file size is = 1GB by default; can be changed via log_maxbytes
  3. Timed - files rotate every day at midnight
  4. Permanent Rotating - files rate when the log file size is reached taking a unique name for each new log
  5. Gzip Permanent Rotating - same as above by using Gz compression
create_logger()

Create a new logger with a specific handler

get_log_dir(log_dir)

Works out what the log directory will be. There are three cases:

  1. A log dir is given by the constructor and exits - use it
  2. A log dir is given by does not exist - make it and use it
  3. A log dir is not given then create one from current path
log(identifier, action, **data)

Logs a query using a simple [ISO Timestamp, Identifier, Action, Data] format

  • Identifier (str): what identifies this log entry to a user i.e. IP address, Cookie Number etc
  • Action (str): the action the user has done i.e. page request
  • Data (str): associated data to the action done