Skip to content

KeyError exception caused by a Corporation label returned by propablepeople #3

@mlollo

Description

@mlollo

I have a KeyError exception, while calculating the threshold in dedupe. One of my record is wrong and has a Corporation name in it. But it shouldn't cause an exception in parseratorvariable. (I'm using the Person Name FieldType).

This issue is raised :

Exception in thread Thread-6:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 76, in __call__
    filtered_pairs = self.fieldDistance(record_pairs)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/core.py", line 101, in fieldDistance
    distances = self.data_model.distances(records)
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/dedupe/datamodel.py", line 82, in distances
    record_2[field])
  File "/Users/computer/Documents/projects/dedupe/.venv/lib/python3.6/site-packages/parseratorvariable/__init__.py", line 90, in comparator
    variable_type = self.variable_types[variable_type_1]
KeyError: 'Corporation'

Either parseratorvariable is not handling the case of probablepeople is returning a wrong label or probablepeople is not returning an error if the label doesn't correspond to the type 'person'.

for more see datamade/probablepeople#74

For those who wants to patch this with a work around, what I have done is replacing the comparator method in parseratorvariable/__init__ (line 54) :
Add this lines where you are using dedupe library

import dedupe
import numpy
import parseratorvariable
from probableparsing import RepeatedLabelError

def comparator(self, field_1, field_2):
    distances = numpy.zeros(self.expanded_size)
    i = 0

    if not (field_1 and field_2):
        return distances

    distances[i] = 1
    i += 1

    try:
        parsed_variable_1, variable_type_1 = self.tagger(field_1)
        parsed_variable_2, variable_type_2 = self.tagger(field_2)
    except RepeatedLabelError as e:
        if self.log_file:
            import csv
            with open(self.log_file, 'a') as f:
                writer = csv.writer(f)
                writer.writerow([e.original_string.encode('utf8')])
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances

    if 'Ambiguous' in (variable_type_1, variable_type_2):
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    elif variable_type_1 != variable_type_2:
        distances[i:3] = [0, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    elif variable_type_1 == variable_type_2:
        distances[i:3] = [0, 1]

    if variable_type_1 not in self.variable_types:
        distances[i:3] = [1, 0]
        distances[-1] = self.compareString(field_1, field_2)
        return distances
    
    i += 2

    variable_type = self.variable_types[variable_type_1]

    distances[i:i + self.n_type_indicators] = variable_type['indicator']
    i += self.n_type_indicators

    i += variable_type['offset']
    for j, dist in enumerate(variable_type['compare'](parsed_variable_1,
                                                      parsed_variable_2),
                             i):
        distances[j] = dist

    unobserved_parts = numpy.isnan(distances[i:j + 1])
    distances[i:j + 1][unobserved_parts] = 0
    unobserved_parts = (~unobserved_parts).astype(int)
    distances[(i + self.n_parts):(j + 1 + self.n_parts)] = unobserved_parts

    return distances

parseratorvariable.ParseratorType.comparator = comparator

Then you can use dedupe.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions