API Documentation¶
Dedupe
Objects¶
Class for active learning deduplication. Use deduplication when you have data that can contain multiple records that can all refer to the same entity.
-
class
Dedupe
(variable_definition, [data_sample[, [num_cores]])¶ Initialize a Dedupe object with a field definition
Parameters: - variable_definition (dict) – A variable definition is list of dictionaries describing the variables will be used for training a model.
- num_cores (int) – the number of cpus to use for parallel processing, defaults to the number of cpus available on the machine
- data_sample – __DEPRECATED__
# initialize from a defined set of fields variables = [ {'field' : 'Site name', 'type': 'String'}, {'field' : 'Address', 'type': 'String'}, {'field' : 'Zip', 'type': 'String', 'has missing':True}, {'field' : 'Phone', 'type': 'String', 'has missing':True} ] deduper = dedupe.Dedupe(variables)
-
sample
(data[, [sample_size=15000[, blocked_proportion=0.5[, original_length]]])¶
In order to learn how to deduplicate your records, dedupe needs a sample of your records to train on. This method takes a mixture of random sample of pairs of records and a selection of pairs of records that are much more likely to be duplicates.
Parameters: - data (dict) – A dictionary-like object indexed by record ID where the values are dictionaries representing records.
- sample_size (int) – Number of record tuples to return. Defaults to 15,000.
- blocked_proportion (float) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs. Defaults to 0.5.
- original_length – If data is a subsample of all your data, original_length should be the size of your complete data. By default, original_length defaults to the length of data.
deduper.sample(data_d, 150000, .5)
-
uncertainPairs
()¶ Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.
This method is mainly useful for building a user interface for training a matching model.
> pair = deduper.uncertainPairs() > print pair [({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]
-
markPairs
(labeled_examples)¶ Add users labeled pairs of records to training data and update the matching model
This method is useful for building a user interface for training a matching model or for adding training data from an existing source.
Parameters: labeled_examples (dict) – a dictionary with two keys, match
anddistinct
the values are lists that can contain pairs of records.labeled_examples = {'match' : [], 'distinct' : [({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})] } deduper.markPairs(labeled_examples)
-
train
([recall=0.95[, index_predicates=True]])¶ Learn final pairwise classifier and blocking rules. Requires that adequate training data has been already been provided.
Parameters: - recall (float) –
The proportion of true dupe pairs in our training data that that we the learned blocks must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.
recall should be a float between 0.0 and 1.0, the default is 0.95
- index_predicates (bool) –
Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take susbstantial memory.
Defaults to True.
deduper.train()
- recall (float) –
-
writeTraining
(file_obj)¶ Write json data that contains labeled examples to a file object.
Parameters: file_obj (file) – File object. with open('./my_training.json', 'w') as f: deduper.writeTraining(f)
-
readTraining
(training_file)¶ Read training from previously saved training data file object
Parameters: training_file (file) – File object containing training data with open('./my_training.json') as f: deduper.readTraining(f)
-
cleanupTraining
()¶ Delete data we used for training.
data_sample
,training_pairs
,training_data
, andactiveLearner
can be very large objects. When you are done training you may want to free up the memory they use.deduper.cleanupTraining()
-
threshold
(data[, recall_weight=1.5])¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.
Parameters: - data (dict) – a dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
- recall_weight (float) – sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
> threshold = deduper.threshold(data, recall_weight=2) > print threshold 0.21
-
match
(data[, threshold = 0.5[, max_components = 30000]])¶ Identifies records that all refer to the same entity, returns tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.
This method should only used for small to moderately sized datasets for larger data, use matchBlocks
Parameters: - data (dict) – a dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
- threshold (float) –
a number between 0 and 1 (default is 0.5). We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
- max_components (int) – Dedupe splits records into connected components and then clusters each component. Clustering uses about N^2 memory, where N is the size of the components. Max components sets the maximum size of a component dedupe will try to cluster. If a component is larger than max_components, dedupe will try to split it into smaller components. Defaults to 30K.
> duplicates = deduper.match(data, threshold=0.5) > print duplicates [((1, 2, 3), (0.790, 0.860, 0.790)), ((4, 5), (0.720, 0.720)), ((10, 11), (0.899, 0.899))]
-
blocker.
index_fields
¶ A dictionary of the Index Predicates that will used for blocking. The keys are the fields the predicates will operate on.
-
blocker.
index
(field_data, field)¶ Indexes the data from a field for use in a index predicate.
Parameters: - field data (set) – The unique field values that appear in your data.
- field (string) – The name of the field
for field in deduper.blocker.index_fields : field_data = set(record[field] for record in data) deduper.index(field_data, field)
-
blocker
(data)¶ Generate the predicates for records. Yields tuples of (predicate, record_id).
Parameters: data (list) – A sequence of tuples of (record_id, record_dict). Can often be created by data_dict.items(). > data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})] > blocked_ids = deduper.blocker(data) > print list(blocked_ids) [('foo:1', 1), ..., ('bar:1', 100)]
-
matchBlocks
(blocks[, threshold=.5])¶ Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids
Keyword arguments
Parameters: - blocks (list) –
Sequence of records blocks. Each record block is a tuple containing records to compare. Each block should contain two or more records. Along with each record, there should also be information on the blocks that cover that record.
For example, if we have three records:
(1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Pat', 'address' : '123 Main'}) (3, {'name' : 'Sam', 'address' : '123 Main'})
and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks:
# Block 1 (Whole name) (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Pat', 'address' : '123 Main'}) # Block 2 (Whole name) (3, {'name' : 'Sam', 'address' : '123 Main'}) # Block 3 (Whole address (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Pat', 'address' : '123 Main'}) (3, {'name' : 'Sam', 'address' : '123 Main'})
So, the blocks you feed to matchBlocks should look like this, after filtering out the singleton block.
blocks =(( (1, {'name' : 'Pat', 'address' : '123 Main'}, set([])), (2, {'name' : 'Pat', 'address' : '123 Main'}, set([])) ), ( (1, {'name' : 'Pat', 'address' : '123 Main'}, set([1])), (2, {'name' : 'Pat', 'address' : '123 Main'}, set([1])), (3, {'name' : 'Sam', 'address' : '123 Main'}, set([])) ) ) deduper.matchBlocks(blocks)
Within each block, dedupe will compare every pair of records. This is expensive. Checking to see if two sets intersect is much cheaper, and if the block coverage information for two records does intersect, that means that this pair of records has been compared in a previous block, and dedupe will skip comparing this pair of records again.
- threshold (float) –
Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold.
Lowering the number will increase recall, raising it will increase precision.
- blocks (list) –
-
classifier
¶ By default, the classifier is a L2 regularized logistic regression classifier. If you want to use a different classifier, you can overwrite this attribute with your custom object. Your classifier object must be have fit and predict_proba methods, like sklearn models.
from sklearn.linear_model import LogisticRegression deduper = dedupe.Dedupe(fields) deduper.classifier = LogisticRegression()
-
thresholdBlocks
(blocks, recall_weight=1.5)¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.
For larger datasets, you will need to use the
thresholdBlocks
andmatchBlocks
. This methods require you to create blocks of records. See the documentation for thematchBlocks
method for how to construct blocks. .. code:: pythonthreshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)Keyword arguments
Parameters: - blocks (list) – See
`matchBlocks`
- recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
- blocks (list) – See
-
writeSettings
(file_obj[, index=False])¶ Write a settings file that contains the data model and predicates to a file object.
Parameters: - file_obj (file) – File object.
- bool (index) – Should the indexes of index predicates be saved. You will probably only want to call this after indexing all of your records.
with open('my_learned_settings', 'wb') as f: deduper.writeSettings(f, indexes=True)
-
loaded_indices
¶ Indicates whether indices for index predicates was loaded from a settings file.
StaticDedupe
Objects¶
Class for deduplication using saved settings. If you have already trained dedupe, you can load the saved settings with StaticDedupe.
-
class
StaticDedupe
(settings_file[, num_cores])¶ Initialize a Dedupe object with saved settings
Parameters: - settings_file (file) – A file object containing settings info produced from
the
Dedupe.writeSettings()
of a previous, active Dedupe object. - num_cores (int) – the number of cpus to use for parallel processing, defaults to the number of cpus available on the machine
-
threshold
(data[, recall_weight=1.5])¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.
Parameters: - data (dict) – a dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
- recall_weight (float) – sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
> threshold = deduper.threshold(data, recall_weight=2) > print threshold 0.21
-
match
(data[, threshold = 0.5[, max_components = 30000]])¶ Identifies records that all refer to the same entity, returns tuples containing a sequence of record ids and corresponding sequence of confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is a measure of our confidence a particular entity belongs in the cluster.
This method should only used for small to moderately sized datasets for larger data, use matchBlocks
Parameters: - data (dict) – a dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
- threshold (float) –
a number between 0 and 1 (default is 0.5). We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
- max_components (int) – Dedupe splits records into connected components and then clusters each component. Clustering uses about N^2 memory, where N is the size of the components. Max components sets the maximum size of a component dedupe will try to cluster. If a component is larger than max_components, dedupe will try to split it into smaller components. Defaults to 30K.
> duplicates = deduper.match(data, threshold=0.5) > print duplicates [((1, 2, 3), (0.790, 0.860, 0.790)), ((4, 5), (0.720, 0.720)), ((10, 11), (0.899, 0.899))]
-
blocker.
index_fields
¶ A dictionary of the Index Predicates that will used for blocking. The keys are the fields the predicates will operate on.
-
blocker.
index
(field_data, field)¶ Indexes the data from a field for use in a index predicate.
Parameters: - field data (set) – The unique field values that appear in your data.
- field (string) – The name of the field
for field in deduper.blocker.index_fields : field_data = set(record[field] for record in data) deduper.index(field_data, field)
-
blocker
(data)¶ Generate the predicates for records. Yields tuples of (predicate, record_id).
Parameters: data (list) – A sequence of tuples of (record_id, record_dict). Can often be created by data_dict.items(). > data = [(1, {'name' : 'bob'}), (2, {'name' : 'suzanne'})] > blocked_ids = deduper.blocker(data) > print list(blocked_ids) [('foo:1', 1), ..., ('bar:1', 100)]
-
matchBlocks
(blocks[, threshold=.5])¶ Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids
Keyword arguments
Parameters: - blocks (list) –
Sequence of records blocks. Each record block is a tuple containing records to compare. Each block should contain two or more records. Along with each record, there should also be information on the blocks that cover that record.
For example, if we have three records:
(1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Pat', 'address' : '123 Main'}) (3, {'name' : 'Sam', 'address' : '123 Main'})
and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks:
# Block 1 (Whole name) (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Pat', 'address' : '123 Main'}) # Block 2 (Whole name) (3, {'name' : 'Sam', 'address' : '123 Main'}) # Block 3 (Whole address (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Pat', 'address' : '123 Main'}) (3, {'name' : 'Sam', 'address' : '123 Main'})
So, the blocks you feed to matchBlocks should look like this, after filtering out the singleton block.
blocks =(( (1, {'name' : 'Pat', 'address' : '123 Main'}, set([])), (2, {'name' : 'Pat', 'address' : '123 Main'}, set([])) ), ( (1, {'name' : 'Pat', 'address' : '123 Main'}, set([1])), (2, {'name' : 'Pat', 'address' : '123 Main'}, set([1])), (3, {'name' : 'Sam', 'address' : '123 Main'}, set([])) ) ) deduper.matchBlocks(blocks)
Within each block, dedupe will compare every pair of records. This is expensive. Checking to see if two sets intersect is much cheaper, and if the block coverage information for two records does intersect, that means that this pair of records has been compared in a previous block, and dedupe will skip comparing this pair of records again.
- threshold (float) –
Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold.
Lowering the number will increase recall, raising it will increase precision.
- blocks (list) –
-
classifier
¶ By default, the classifier is a L2 regularized logistic regression classifier. If you want to use a different classifier, you can overwrite this attribute with your custom object. Your classifier object must be have fit and predict_proba methods, like sklearn models.
from sklearn.linear_model import LogisticRegression deduper = dedupe.Dedupe(fields) deduper.classifier = LogisticRegression()
-
thresholdBlocks
(blocks, recall_weight=1.5)¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.
For larger datasets, you will need to use the
thresholdBlocks
andmatchBlocks
. This methods require you to create blocks of records. See the documentation for thematchBlocks
method for how to construct blocks. .. code:: pythonthreshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)Keyword arguments
Parameters: - blocks (list) – See
`matchBlocks`
- recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
- blocks (list) – See
-
writeSettings
(file_obj[, index=False])¶ Write a settings file that contains the data model and predicates to a file object.
Parameters: - file_obj (file) – File object.
- bool (index) – Should the indexes of index predicates be saved. You will probably only want to call this after indexing all of your records.
with open('my_learned_settings', 'wb') as f: deduper.writeSettings(f, indexes=True)
-
loaded_indices
¶ Indicates whether indices for index predicates was loaded from a settings file.
- settings_file (file) – A file object containing settings info produced from
the
RecordLink
Objects¶
Class for active learning record linkage.
Use RecordLinkMatching when you have two datasets that you want to merge. Each dataset, individually, should contain no duplicates. A record from the first dataset can match one and only one record from the second dataset and vice versa. A record from the first dataset need not match any record from the second dataset and vice versa.
For larger datasets, you will need to use the thresholdBlocks
and
matchBlocks
. This methods require you to create blocks of records.
For RecordLink, each blocks should be a pairs of dictionaries of
records. Each block consists of all the records that share a particular
predicate, as output by the blocker method of RecordLink.
Within a block, the first dictionary should consist of records from the first dataset, with the keys being record ids and the values being the record. The second dictionary should consist of records from the dataset.
Example
> data_1 = {'A1' : {'name' : 'howard'}}
> data_2 = {'B1' : {'name' : 'howie'}}
...
> blocks = defaultdict(lambda : ({}, {}))
>
> for block_key, record_id in linker.blocker(data_1.items()) :
> blocks[block_key][0].update({record_id : data_1[record_id]})
> for block_key, record_id in linker.blocker(data_2.items()) :
> if block_key in blocks :
> blocks[block_key][1].update({record_id : data_2[record_id]})
>
> blocked_data = blocks.values()
> print blocked_data
[({'A1' : {'name' : 'howard'}}, {'B1' : {'name' : 'howie'}})]
-
class
RecordLink
(variable_definition, [data_sample, [[num_cores]])¶ Initialize a Dedupe object with a variable definition
Parameters: - variable_definition (dict) – A variable definition is list of dictionaries describing the variables will be used for training a model.
- num_cores (int) – the number of cpus to use for parallel processing, defaults to the number of cpus available on the machine
- data_sample – __DEPRECATED__
We assume that the fields you want to compare across datasets have the same field name.
-
sample
(data_1, data_2[, sample_size=150000[, blocked_proportion=0.5[, original_length_1[, original_length_2]]]])¶
In order to learn how to link your records, dedupe needs a sample of your records to train on. This method takes a mixture of random sample of pairs of records and a selection of pairs of records that are much more likely to be duplicates.
Parameters: - data_1 (dict) – A dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
- data_2 (dict) – A dictionary of records from second dataset, same form as data_1
- sample_size (int) – The size of the sample to draw. Defaults to 150,000
- blocked_proportion (float) – The proportion of record pairs to be sampled from similar records, as opposed to randomly selected pairs. Defaults to 0.5.
- original_length_1 – If data_1 is a subsample of your first dataset, original_length_1 should be the size of the complete first dataset. By default, original_length_1 defaults to the length of data_1
- original_length_2 – If data_2 is a subsample of your first dataset, original_length_2 should be the size of the complete first dataset. By default, original_length_2 defaults to the length of data_2
linker.sample(data_1, data_2, 150000)
-
threshold
(data_1, data_2, recall_weight)¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.
Parameters: - data_1 (dict) – a dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
- data_2 (dict) – a dictionary of records from second dataset, same form as data_1
- recall_weight (float) – sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
> threshold = deduper.threshold(data_1, data_2, recall_weight=2) > print threshold 0.21
-
match
(data_1, data_2, threshold)¶ Identifies pairs of records that refer to the same entity, returns tuples containing a set of record ids and a confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is the estimated probability that the records refer to the same entity.
This method should only used for small to moderately sized datasets for larger data, use matchBlocks
Parameters: - data_1 (dict) – a dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
- data_2 (dict) – a dictionary of records from second dataset, same form as data_1
- threshold (float) –
a number between 0 and 1 (default is 0.5). We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
-
matchBlocks
(blocks[, threshold=.5])¶ Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids
Keyword arguments
Parameters: - blocks (list) –
Sequence of records blocks. Each record block is a tuple containing two sequences of records, the records from the first data set and the records from the second dataset. Within each block there should be at least one record from each datasets. Along with each record, there should also be information on the blocks that cover that record.
For example, if we have two records from dataset A and one record from dataset B:
# Dataset A (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Sam', 'address' : '123 Main'}) # Dataset B (3, {'name' : 'Pat', 'address' : '123 Main'})
and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks:
# Block 1 (Whole name) (1, {'name' : 'Pat', 'address' : '123 Main'}) (3, {'name' : 'Pat', 'address' : '123 Main'}) # Block 2 (Whole name) (2, {'name' : 'Sam', 'address' : '123 Main'}) # Block 3 (Whole address (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Sam', 'address' : '123 Main'}) (3, {'name' : 'Pat', 'address' : '123 Main'})
So, the blocks you feed to matchBlocks should look like this,
blocks =(( [(1, {'name' : 'Pat', 'address' : '123 Main'}, set([]))], [(3, {'name' : 'Pat', 'address' : '123 Main'}, set([]))] ), ( [(1, {'name' : 'Pat', 'address' : '123 Main'}, set([1])), (2, {'name' : 'Sam', 'address' : '123 Main'}, set([]))], [(3, {'name' : 'Pat', 'address' : '123 Main'}, set([1]))] ) ) linker.matchBlocks(blocks)
Within each block, dedupe will compare every pair of records. This is expensive. Checking to see if two sets intersect is much cheaper, and if the block coverage information for two records does intersect, that means that this pair of records has been compared in a previous block, and dedupe will skip comparing this pair of records again.
- threshold (float) –
Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold.
Lowering the number will increase recall, raising it will increase precision.
- blocks (list) –
-
uncertainPairs
()¶ Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.
This method is mainly useful for building a user interface for training a matching model.
> pair = deduper.uncertainPairs() > print pair [({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]
-
markPairs
(labeled_examples)¶ Add users labeled pairs of records to training data and update the matching model
This method is useful for building a user interface for training a matching model or for adding training data from an existing source.
Parameters: labeled_examples (dict) – a dictionary with two keys, match
anddistinct
the values are lists that can contain pairs of records.labeled_examples = {'match' : [], 'distinct' : [({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})] } deduper.markPairs(labeled_examples)
-
train
([recall=0.95[, index_predicates=True]])¶ Learn final pairwise classifier and blocking rules. Requires that adequate training data has been already been provided.
Parameters: - recall (float) –
The proportion of true dupe pairs in our training data that that we the learned blocks must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.
recall should be a float between 0.0 and 1.0, the default is 0.95
- index_predicates (bool) –
Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take susbstantial memory.
Defaults to True.
deduper.train()
- recall (float) –
-
writeTraining
(file_obj)¶ Write json data that contains labeled examples to a file object.
Parameters: file_obj (file) – File object. with open('./my_training.json', 'w') as f: deduper.writeTraining(f)
-
readTraining
(training_file)¶ Read training from previously saved training data file object
Parameters: training_file (file) – File object containing training data with open('./my_training.json') as f: deduper.readTraining(f)
-
cleanupTraining
()¶ Delete data we used for training.
data_sample
,training_pairs
,training_data
, andactiveLearner
can be very large objects. When you are done training you may want to free up the memory they use.deduper.cleanupTraining()
-
classifier
¶ By default, the classifier is a L2 regularized logistic regression classifier. If you want to use a different classifier, you can overwrite this attribute with your custom object. Your classifier object must be have fit and predict_proba methods, like sklearn models.
from sklearn.linear_model import LogisticRegression deduper = dedupe.Dedupe(fields) deduper.classifier = LogisticRegression()
-
thresholdBlocks
(blocks, recall_weight=1.5)¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.
For larger datasets, you will need to use the
thresholdBlocks
andmatchBlocks
. This methods require you to create blocks of records. See the documentation for thematchBlocks
method for how to construct blocks. .. code:: pythonthreshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)Keyword arguments
Parameters: - blocks (list) – See
`matchBlocks`
- recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
- blocks (list) – See
-
writeSettings
(file_obj[, index=False])¶ Write a settings file that contains the data model and predicates to a file object.
Parameters: - file_obj (file) – File object.
- bool (index) – Should the indexes of index predicates be saved. You will probably only want to call this after indexing all of your records.
with open('my_learned_settings', 'wb') as f: deduper.writeSettings(f, indexes=True)
-
loaded_indices
¶ Indicates whether indices for index predicates was loaded from a settings file.
StaticRecordLink
Objects¶
Class for record linkage using saved settings. If you have already trained a record linkage instance, you can load the saved settings with StaticRecordLink.
-
class
StaticRecordLink
(settings_file[, num_cores])¶ Initialize a Dedupe object with saved settings
Parameters: - settings_file (str) – File object containing settings data produced from
the
RecordLink.writeSettings()
of a previous, active Dedupe object. - num_cores (int) – the number of cpus to use for parallel processing, defaults to the number of cpus available on the machine
with open('my_settings_file', 'rb') as f: deduper = StaticDedupe(f)
-
threshold
(data_1, data_2, recall_weight)¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.
Parameters: - data_1 (dict) – a dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
- data_2 (dict) – a dictionary of records from second dataset, same form as data_1
- recall_weight (float) – sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
> threshold = deduper.threshold(data_1, data_2, recall_weight=2) > print threshold 0.21
-
match
(data_1, data_2, threshold)¶ Identifies pairs of records that refer to the same entity, returns tuples containing a set of record ids and a confidence score as a float between 0 and 1. The record_ids within each set should refer to the same entity and the confidence score is the estimated probability that the records refer to the same entity.
This method should only used for small to moderately sized datasets for larger data, use matchBlocks
Parameters: - data_1 (dict) – a dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
- data_2 (dict) – a dictionary of records from second dataset, same form as data_1
- threshold (float) –
a number between 0 and 1 (default is 0.5). We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
-
matchBlocks
(blocks[, threshold=.5])¶ Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids
Keyword arguments
Parameters: - blocks (list) –
Sequence of records blocks. Each record block is a tuple containing two sequences of records, the records from the first data set and the records from the second dataset. Within each block there should be at least one record from each datasets. Along with each record, there should also be information on the blocks that cover that record.
For example, if we have two records from dataset A and one record from dataset B:
# Dataset A (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Sam', 'address' : '123 Main'}) # Dataset B (3, {'name' : 'Pat', 'address' : '123 Main'})
and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks:
# Block 1 (Whole name) (1, {'name' : 'Pat', 'address' : '123 Main'}) (3, {'name' : 'Pat', 'address' : '123 Main'}) # Block 2 (Whole name) (2, {'name' : 'Sam', 'address' : '123 Main'}) # Block 3 (Whole address (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Sam', 'address' : '123 Main'}) (3, {'name' : 'Pat', 'address' : '123 Main'})
So, the blocks you feed to matchBlocks should look like this,
blocks =(( [(1, {'name' : 'Pat', 'address' : '123 Main'}, set([]))], [(3, {'name' : 'Pat', 'address' : '123 Main'}, set([]))] ), ( [(1, {'name' : 'Pat', 'address' : '123 Main'}, set([1])), (2, {'name' : 'Sam', 'address' : '123 Main'}, set([]))], [(3, {'name' : 'Pat', 'address' : '123 Main'}, set([1]))] ) ) linker.matchBlocks(blocks)
Within each block, dedupe will compare every pair of records. This is expensive. Checking to see if two sets intersect is much cheaper, and if the block coverage information for two records does intersect, that means that this pair of records has been compared in a previous block, and dedupe will skip comparing this pair of records again.
- threshold (float) –
Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold.
Lowering the number will increase recall, raising it will increase precision.
- blocks (list) –
-
classifier
¶ By default, the classifier is a L2 regularized logistic regression classifier. If you want to use a different classifier, you can overwrite this attribute with your custom object. Your classifier object must be have fit and predict_proba methods, like sklearn models.
from sklearn.linear_model import LogisticRegression deduper = dedupe.Dedupe(fields) deduper.classifier = LogisticRegression()
-
thresholdBlocks
(blocks, recall_weight=1.5)¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.
For larger datasets, you will need to use the
thresholdBlocks
andmatchBlocks
. This methods require you to create blocks of records. See the documentation for thematchBlocks
method for how to construct blocks. .. code:: pythonthreshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)Keyword arguments
Parameters: - blocks (list) – See
`matchBlocks`
- recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
- blocks (list) – See
-
writeSettings
(file_obj[, index=False])¶ Write a settings file that contains the data model and predicates to a file object.
Parameters: - file_obj (file) – File object.
- bool (index) – Should the indexes of index predicates be saved. You will probably only want to call this after indexing all of your records.
with open('my_learned_settings', 'wb') as f: deduper.writeSettings(f, indexes=True)
-
loaded_indices
¶ Indicates whether indices for index predicates was loaded from a settings file.
- settings_file (str) – File object containing settings data produced from
the
Gazetteer
Objects¶
Class for active learning gazetteer matching.
Gazetteer matching is for matching a messy data set against a ‘canonical dataset’, i.e. one that does not have any duplicates. This class is useful for such tasks as matching messy addresses against a clean list.
The interface is the same as for RecordLink objects except for a couple of methods.
-
class
Gazetteer
¶ -
index
(data)¶ Add records to the index of records to match against. If a record in canonical_data has the same key as a previously indexed record, the old record will be replaced.
Parameters: data (dict) – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
-
unindex(data) :
Remove records from the index of records to match against.
Parameters: data (dict) – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
-
match
(messy_data, threshold=0.5, n_matches=1)¶ Identifies pairs of records that could refer to the same entity, returns tuples containing tuples of possible matches, with a confidence score for each match. The record_ids within each tuple should refer to potential matches from a messy data record to canonical records. The confidence score is the estimated probability that the records refer to the same entity.
Parameters: - messy_data (dict) – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
- threshold (float) –
a number between 0 and 1 (default is 0.5). We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
- n_matches (int) – the maximum number of possible matches from canonical_data to return for each record in messy_data. If set to None all possible matches above the threshold will be returned. Defaults to 1
-
threshold
(messy_data, recall_weight = 1.5)¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.
Parameters: - messy_data (dict) – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
- recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
-
matchBlocks
(blocks, threshold=.5, n_matches=1)¶ Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids
Parameters: - blocks (list) –
Sequence of records blocks. Each record block is a tuple containing two sequences of records, the records from the messy data set and the records from the canonical dataset. Within each block there should be at least one record from each datasets. Along with each record, there should also be information on the blocks that cover that record.
For example, if we have two records from a messy dataset one record from a canonical dataset:
# Messy (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Sam', 'address' : '123 Main'}) # Canonical (3, {'name' : 'Pat', 'address' : '123 Main'})
and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks:
# Block 1 (Whole name) (1, {'name' : 'Pat', 'address' : '123 Main'}) (3, {'name' : 'Pat', 'address' : '123 Main'}) # Block 2 (Whole name) (2, {'name' : 'Sam', 'address' : '123 Main'}) # Block 3 (Whole address (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Sam', 'address' : '123 Main'}) (3, {'name' : 'Pat', 'address' : '123 Main'})
So, the blocks you feed to matchBlocks should look like this,
blocks =(( [(1, {'name' : 'Pat', 'address' : '123 Main'}, set([]))], [(3, {'name' : 'Pat', 'address' : '123 Main'}, set([])] ), ( [(1, {'name' : 'Pat', 'address' : '123 Main'}, set([1]), ((2, {'name' : 'Sam', 'address' : '123 Main'}, set([])], [((3, {'name' : 'Pat', 'address' : '123 Main'}, set([1])] ) ) linker.matchBlocks(blocks)
- threshold (float) –
Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold.
Lowering the number will increase recall, raising it will increase precision.
- n_matches (int) – the maximum number of possible matches from canonical_data to return for each record in messy_data. If set to None all possible matches above the threshold will be returned. Defaults to 1
clustered_dupes = deduper.matchBlocks(blocked_data, threshold)
- blocks (list) –
-
uncertainPairs
()¶ Returns a list of pairs of records from the sample of record pairs tuples that Dedupe is most curious to have labeled.
This method is mainly useful for building a user interface for training a matching model.
> pair = deduper.uncertainPairs() > print pair [({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})]
-
markPairs
(labeled_examples)¶ Add users labeled pairs of records to training data and update the matching model
This method is useful for building a user interface for training a matching model or for adding training data from an existing source.
Parameters: labeled_examples (dict) – a dictionary with two keys, match
anddistinct
the values are lists that can contain pairs of records.labeled_examples = {'match' : [], 'distinct' : [({'name' : 'Georgie Porgie'}, {'name' : 'Georgette Porgette'})] } deduper.markPairs(labeled_examples)
-
train
([recall=0.95[, index_predicates=True]])¶ Learn final pairwise classifier and blocking rules. Requires that adequate training data has been already been provided.
Parameters: - recall (float) –
The proportion of true dupe pairs in our training data that that we the learned blocks must cover. If we lower the recall, there will be pairs of true dupes that we will never directly compare.
recall should be a float between 0.0 and 1.0, the default is 0.95
- index_predicates (bool) –
Should dedupe consider predicates that rely upon indexing the data. Index predicates can be slower and take susbstantial memory.
Defaults to True.
deduper.train()
- recall (float) –
-
writeTraining
(file_obj)¶ Write json data that contains labeled examples to a file object.
Parameters: file_obj (file) – File object. with open('./my_training.json', 'w') as f: deduper.writeTraining(f)
-
readTraining
(training_file)¶ Read training from previously saved training data file object
Parameters: training_file (file) – File object containing training data with open('./my_training.json') as f: deduper.readTraining(f)
-
cleanupTraining
()¶ Delete data we used for training.
data_sample
,training_pairs
,training_data
, andactiveLearner
can be very large objects. When you are done training you may want to free up the memory they use.deduper.cleanupTraining()
-
classifier
¶ By default, the classifier is a L2 regularized logistic regression classifier. If you want to use a different classifier, you can overwrite this attribute with your custom object. Your classifier object must be have fit and predict_proba methods, like sklearn models.
from sklearn.linear_model import LogisticRegression deduper = dedupe.Dedupe(fields) deduper.classifier = LogisticRegression()
-
thresholdBlocks
(blocks, recall_weight=1.5)¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.
For larger datasets, you will need to use the
thresholdBlocks
andmatchBlocks
. This methods require you to create blocks of records. See the documentation for thematchBlocks
method for how to construct blocks. .. code:: pythonthreshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)Keyword arguments
Parameters: - blocks (list) – See
`matchBlocks`
- recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
- blocks (list) – See
-
writeSettings
(file_obj[, index=False])¶ Write a settings file that contains the data model and predicates to a file object.
Parameters: - file_obj (file) – File object.
- bool (index) – Should the indexes of index predicates be saved. You will probably only want to call this after indexing all of your records.
with open('my_learned_settings', 'wb') as f: deduper.writeSettings(f, indexes=True)
-
loaded_indices
¶ Indicates whether indices for index predicates was loaded from a settings file.
-
StaticGazetteer
Objects¶
Class for gazetter matching using saved settings. If you have already trained a gazetteer instance, you can load the saved settings with StaticGazetteer.
This class has the same interface as StaticRecordLink except for a couple of methods.
-
class
StaticGazetteer
¶ -
index
(data)¶ Add records to the index of records to match against. If a record in canonical_data has the same key as a previously indexed record, the old record will be replaced.
Parameters: data (dict) – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
-
unindex(data) :
Remove records from the index of records to match against.
Parameters: data (dict) – a dictionary of records where the keys are record_ids and the values are dictionaries with the keys being field_names
-
match
(messy_data, threshold=0.5, n_matches=1)¶ Identifies pairs of records that could refer to the same entity, returns tuples containing tuples of possible matches, with a confidence score for each match. The record_ids within each tuple should refer to potential matches from a messy data record to canonical records. The confidence score is the estimated probability that the records refer to the same entity.
Parameters: - messy_data (dict) – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
- threshold (float) –
a number between 0 and 1 (default is 0.5). We will consider records as potential duplicates if the predicted probability of being a duplicate is above the threshold.
Lowering the number will increase recall, raising it will increase precision
- n_matches (int) – the maximum number of possible matches from canonical_data to return for each record in messy_data. If set to None all possible matches above the threshold will be returned. Defaults to 1
-
threshold
(messy_data, recall_weight = 1.5)¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of data.
Parameters: - messy_data (dict) – a dictionary of records from a messy dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
- recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
-
matchBlocks
(blocks, threshold=.5, n_matches=1)¶ Partitions blocked data and returns a list of clusters, where each cluster is a tuple of record ids
Parameters: - blocks (list) –
Sequence of records blocks. Each record block is a tuple containing two sequences of records, the records from the messy data set and the records from the canonical dataset. Within each block there should be at least one record from each datasets. Along with each record, there should also be information on the blocks that cover that record.
For example, if we have two records from a messy dataset one record from a canonical dataset:
# Messy (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Sam', 'address' : '123 Main'}) # Canonical (3, {'name' : 'Pat', 'address' : '123 Main'})
and two predicates: “Whole name” and “Whole address”. These predicates will produce the following blocks:
# Block 1 (Whole name) (1, {'name' : 'Pat', 'address' : '123 Main'}) (3, {'name' : 'Pat', 'address' : '123 Main'}) # Block 2 (Whole name) (2, {'name' : 'Sam', 'address' : '123 Main'}) # Block 3 (Whole address (1, {'name' : 'Pat', 'address' : '123 Main'}) (2, {'name' : 'Sam', 'address' : '123 Main'}) (3, {'name' : 'Pat', 'address' : '123 Main'})
So, the blocks you feed to matchBlocks should look like this,
blocks =(( [(1, {'name' : 'Pat', 'address' : '123 Main'}, set([]))], [(3, {'name' : 'Pat', 'address' : '123 Main'}, set([])] ), ( [(1, {'name' : 'Pat', 'address' : '123 Main'}, set([1]), ((2, {'name' : 'Sam', 'address' : '123 Main'}, set([])], [((3, {'name' : 'Pat', 'address' : '123 Main'}, set([1])] ) ) linker.matchBlocks(blocks)
- threshold (float) –
Number between 0 and 1 (default is .5). We will only consider as duplicates record pairs as duplicates if their estimated duplicate likelihood is greater than the threshold.
Lowering the number will increase recall, raising it will increase precision.
- n_matches (int) – the maximum number of possible matches from canonical_data to return for each record in messy_data. If set to None all possible matches above the threshold will be returned. Defaults to 1
clustered_dupes = deduper.matchBlocks(blocked_data, threshold)
- blocks (list) –
-
classifier
¶ By default, the classifier is a L2 regularized logistic regression classifier. If you want to use a different classifier, you can overwrite this attribute with your custom object. Your classifier object must be have fit and predict_proba methods, like sklearn models.
from sklearn.linear_model import LogisticRegression deduper = dedupe.Dedupe(fields) deduper.classifier = LogisticRegression()
-
thresholdBlocks
(blocks, recall_weight=1.5)¶ Returns the threshold that maximizes the expected F score, a weighted average of precision and recall for a sample of blocked data.
For larger datasets, you will need to use the
thresholdBlocks
andmatchBlocks
. This methods require you to create blocks of records. See the documentation for thematchBlocks
method for how to construct blocks. .. code:: pythonthreshold = deduper.thresholdBlocks(blocked_data, recall_weight=2)Keyword arguments
Parameters: - blocks (list) – See
`matchBlocks`
- recall_weight (float) – Sets the tradeoff between precision and recall. I.e. if you care twice as much about recall as you do precision, set recall_weight to 2.
- blocks (list) – See
-
writeSettings
(file_obj[, index=False])¶ Write a settings file that contains the data model and predicates to a file object.
Parameters: - file_obj (file) – File object.
- bool (index) – Should the indexes of index predicates be saved. You will probably only want to call this after indexing all of your records.
with open('my_learned_settings', 'wb') as f: deduper.writeSettings(f, indexes=True)
-
loaded_indices
¶ Indicates whether indices for index predicates was loaded from a settings file.
-
Convenience Functions¶
-
consoleLabel
(matcher)¶ Train a matcher instance (Dedupe or RecordLink) from the command line. Example
> deduper = dedupe.Dedupe(variables) > deduper.sample(data) > dedupe.consoleLabel(deduper)
-
trainingDataLink
(data_1, data_2, common_key[, training_size])¶ Construct training data for consumption by the
RecordLink.markPairs()
from already linked datasets.Parameters: - data_1 (dict) – a dictionary of records from first dataset, where the keys are record_ids and the values are dictionaries with the keys being field names.
- data_2 (dict) – a dictionary of records from second dataset, same form as data_1
- common_key (str) – the name of the record field that uniquely identifies a match
- training_size (int) – the rough limit of the number of training examples, defaults to 50000
Warning
Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.
-
trainingDataDedupe
(data, common_key[, training_size])¶ Construct training data for consumption by the
Dedupe.markPairs()
from an already deduplicated dataset.Parameters: - data (dict) – a dictionary of records, where the keys are record_ids and the values are dictionaries with the keys being field names
- common_key (str) – the name of the record field that uniquely identifies a match
- training_size (int) – the rough limit of the number of training examples, defaults to 50000
Warning
Every match must be identified by the sharing of a common key. This function assumes that if two records do not share a common key then they are distinct records.
-
canonicalize
(record_cluster)¶ Constructs a canonical representation of a duplicate cluster by finding canonical values for each field
Parameters: record_cluster (list) – A list of records within a duplicate cluster, where the records are dictionaries with field names as keys and field values as values