Homework 7: Hashtables, Sorting and Complexity

Due: 2025-04-18 23:59:00EDT

Overview

In this assignment, we’ll explore hashtables, sorting and compare the efficiency of different methods of removing data duplication. Note that the data set we will be working with this time is very large (685,725 rows and 116 columns). Using Excel or Numbers to look at it is NOT recommended. It will take forever to open. Vim is really your best option here. For the same reason, you should write efficient code - pay attention to the big-O and no unnecessary loops!

1. Data Deduplication

One important, and potentially expensive task when handling data is data deduplication, whose goal is to make sure that each item in a data set only appears once. We’ll explore three different ways this can be done and determine which one is most efficient.

1.1 The Data: Stop, Question, and Frisk

The New York City police department follows a practice of regularly stopping, questioning, and frisking people on the street. A 2011 lawsuit against this practice by the NYCLU successfully curbed the police department’s frequent use of this practice, though concerns remain.

We’ll be looking at the data collected in 2011 at the height of the NYPD’s practice of stop-and-frisk. Each row in the data represents a single stop by a police officer, and the attributes (columns) include information about the stop (e.g., whether a weapon was found), information about the person stopped (e.g., age, race, etc.), and information about the stop location (e.g., the address). The dataset can be downloaded from https://drive.google.com/file/d/1NfEsYwWwkXAHPndVVQ0XO79M3xny9IbM/view?usp=sharing with the data specification downloaded from https://www.nyc.gov/assets/nypd/downloads/zip/analysis_and_planning/stop-question-frisk/SQF-File-Documentation.zip

1.2 Determining Equality

The deduplication goal we will consider here is to collect a list of individuals who were stopped during the year. One way to think about this is that we would like to know which people were stopped multiple times by the NYPD in 2011.

In order to determine if two rows contain the same person, we need to develop a method that checks the information that should be unique to each person and compares it to determine equality. Use the equality test that we discussed in class.

Hint

We discussed that the same suspect would have the same date of birth, sex, race, eye color, hair color, build, weight, and height.

Requirements:

Design a class named Incident.java to hold a single row from the CSV. It does not need to hold all fields in the CSV - you should store only the fields that will be necessary in order to deduplicate the data.
Implement a boolean equals(Object o) if the compared items are the same person according to the way you have decided to check uniqueness.

1.3 Data Storage

Requirements:

Make a class to store the full data set: Dataset.java. Your constructor for this class should take the name of a CSV file as input and should do the work of reading in the data from the CSV and parsing it into an ArrayList<Incident>. If the CSV doesn’t exist, use a try-catch to print a warning and terminate.

Create a method public ArrayList<Incident> getIncidents() which returns the array list you filled in the constructor.

1.4 Deduplication Methods

You will now add five data deduplication methods to Dataset.java. Each of these methods (described below) should return an ArrayList of your designed objects that contains only the non-duplicated individuals who were stop-and-frisked.

1.4.1 All Pairs Deduplication

One way to find the duplicates in a data set is to compare each item to each other item in the data set, checking duplicates as you go. Make sure not to count the comparison of an item to itself as a duplicate. We will call this the “all pairs” method of data deduplication.

Requirements

Create a method that uses this “all pairs” method of deduplication to return a list of non-duplicated items in a given data set. The method signature should be: ArrayList<Incident> allPairsDeduplication()

1.4.2 Hash Table Deduplication

Another way to find the duplicates is to create a key for each item in the data set and insert these items into a hashtable. Items hashing to the same key will overwrite each other, therefore resulting in deduplication.

Create a method public String key() in Incident.java The choice of key will determine whether two items are considered to be duplicates. Use the fields we discussed in class.

Requirements:

Create a method that uses this hashtable-based method of deduplication to return a list of non-duplicate items in a given data set. You may use the ProbeHashMap.java which you can download here: wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/labs/lab09/ProbeHashMap.java

You’ll also need the following files: wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/labs/lab09/AbstractHashMap.java
wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/labs/lab09/Entry.java
wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/labs/lab09/Map.java
wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/labs/lab09/AbstractMap.java

You program should first creates a ProbeHashMap with the size 1000003 (use the one-parameter constructor) and then insert the items into this hash table. Remember that the key should include all fields used in deduplication.
The deduplication method signature should be: ArrayList<Incident> hashLinearDeduplication(). Besides the list of non-duplicates, also collect the following statistics about hashing: 1) average number of probes during insertions, 2) max number of probes during insertions and 3) load factor after insertions. Note that you will need to update ProbeHashMap class to compute these values.

Create the following functions:

public double getAvgProbes()
public int getMaxProbes()
public double getLoadFactor()

Print out these statistics once after you have inserted all elements into the hashtable in the following format:

Average numer of probes: XXX
Max number of probes: XXX
Load factor: XXX

Now implement a DoubleHashMap.java class which extends AbstractHashMap and implements a hash table that supports double hashing on collision. Starting from ProbeHashMap and modifying it to add a secondary hash function is recommended. For your double hash, create a function public int f(int i, K k). Use the definition for f that we discussed in class.

Hint

f(i) = i * h’(k) where h’(k) = q - (k.hashCode() % q) i is the number of probes thus far, k is the key, and q is a prime < n.

For this assignment, in your double hash function set q = 1000001.

Similarily, create the following functions:

public double getAvgProbes()
public int getMaxProbes()
public double getLoadFactor()

The deduplication method signature in Dataset.java should be: ArrayList<Incident> hashDoubleDeduplication(). Did the hashing statistics change? Discuss in your README (see details in Section 2).

Now, experiment with a smaller value of q. Did your hashing statistics changes? Discuss in your README. Remember to change q back to 1000001 before submitting to gradescope.

1.4.3 Sorting for Deduplication

The last method for deduplication that we’ll look at sorts the data to check for duplicates. Once data is sorted, it is easier to check for duplicates since they will be adjacent. First, we’ll need some functions that allow us to sort.

Write a method to perform a quicksort on your data. As usual, object comparisions should be performed via calls to .compareTo. So, Incident should implemement Comparable. To begin, download Utils.java wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/hws/hw07/Utils.java .

Fill in the specified methods.

Java has a library with a buit-in sort method! Write a method that uses Collections.sort() to sort your data.
Create two deduplication methods that use these different sorting functions. The method signatures should be: ArrayList<Incident> quickSortDeduplication() and ArrayList<Incident> builtinSortDeduplication(). The quickSortDeduplication method should call Utils.quickSort(ArrayList<Incident>).

2. Complexity Analysis

You will now explore the complexity of the deduplication methods you’ve developed. The answers to these questions should be given in your README file. Create a Driver07.java class which calls the deduplication methods you defined in Dataset. Test your deduplication methods by printing the number of entries before and after deduplication. All methods should result in a deduplicated list of the same size. Using your driver, test your various methods and make observations to answer the following questions:

Requirements:

How much time (in milliseconds) does each duplication counting method take when run on the SQF data set?
Create a chart (with Excel or any other plotting software) showing the number of rows processed on the x-axis and the corresponding time used in milliseconds on the y-axis for each of these methods (where each method is a series on the graph). Note that this requires that you collect time data in your program using the following code snippet that you saw when we first talked about complexity and analysis of algorithms.

long startTime = System.currentTimeMillis();
// run code
long endTime = System.currentTimeMillis();
long elapsed = endTime - startTime;

Save the graph and name as complexity.png. Include this in your submission on gradescope.

Your allPairs deduplication technique can likely not handle more than 100k entries. Do not plot anything beyond that.

Explain which method is most efficient in your README, also note which sorting method (the built-in method or your quicksort method) was more efficient and hypothesize why this might be.

3. Command Line Input

You will receive a stop-and-frisk records file on the command line as a single argument like this: 2011.csv

As in your other assignments, make sure that your program works correctly even if the CSV file is not in your current directory.

Note that we may choose to test your work using data from a different year, so be sure to actually read the data in from the command line and not hard-code the file name.

Requirements:

Take in a CSV file from the command line input.
Deduplicate the data and print out the results formatted as above.

Submitting

Submit the following files to the assignment called HW07 on Gradescope

README: The usual plain text file README
- Discussion:
  1. How do the hashing statistics for ProbeHashMap and DoubleHashMap differ? Explain why you see this difference.
  2. How do different values of q affect the hashing statistics for DoubleHashMap? Explain why.
  3. How much time (in milliseconds) does each duplication counting method take when run on the full 2011 data set?
  4. Which deduplication method is most efficient and why?
  5. Which sorting method (the built-in method or your quicksort method) was more efficient and hypothesize why this might be.
- How much time you spent on the assignment
- What did you learn?
- (Optional): Any feedback or comments
Source files:
- Dataset.java
- DoubleHashMap.java
- Driver07.java
- Incident.java
- ProbeHashMap.java
- Utils.java and any other java files you created.
  Do not upload AbstractHashMap.java, AbstractMap.java, Entry.java, or Map.java.
complexity.png

Make sure to name these files exactly what we specify here. Otherwise, our autograders might not work and we might have to take points off.
DO NOT submit a .docx or .pdf file.
DO NOT submit any .class files or any of the csv files.
DO NOT submit any of the provided interfaces. We will use the ones we provide for testing.

Copying the files

You will likely want to submit the files from your own laptop/computer. To copy the files from goldengate to your own machine, open up a terminal on your own computer. If you have a mac or linux, you can use the Terminal application, if you have a windows you can use the Powershell app.

Run the following command:

$ scp <username>@goldengate.cs.brynmawr.edu/home/<username>/cs151/homeworks/hw00/*

Make sure to replace <username> with your username. scp is a command to secure-copy files from one server to another. In this case, from goldengate to your own machine.