Due: 2024-12-01 23:59:00EDT
In this assignment, we’ll explore hashtables, sorting and compare the efficiency of different methods of removing data duplication. Note that the data set we will be working with this time is very large (685,725 rows and 116 columns). Using Excel or Numbers to look at it is NOT recommended. It will take forever to open. Vim is really your best option here. For the same reason, you should write efficient code - pay attention to the big-O and no unnecessary loops!
One important, and potentially expensive task when handling data is data deduplication, whose goal is to make sure that each item in a data set only appears once. We’ll explore three different ways this can be done and determine which one is most efficient.
The New York City police department follows a practice of regularly stopping, questioning, and frisking people on the street. A 2011 lawsuit against this practice by the NYCLU successfully curbed the police department’s frequent use of this practice, though concerns remain.
We’ll be looking at the data collected in 2011 at the height of the NYPD’s practice of stop-and-frisk. Each row in the data represents a single stop by a police officer, and the attributes (columns) include information about the stop (e.g., whether a weapon was found), information about the person stopped (e.g., age, race, etc.), and information about the stop location (e.g., the address). The dataset can be downloaded from https://drive.google.com/file/d/1NfEsYwWwkXAHPndVVQ0XO79M3xny9IbM/view?usp=sharing with the data specification downloaded from https://www.nyc.gov/assets/nypd/downloads/zip/analysis_and_planning/stop-question-frisk/SQF-File-Documentation.zip
The deduplication goal we will consider here is to collect a list of individuals who were stopped during the year. One way to think about this is that we would like to know which people were stopped multiple times by the NYPD in 2011.
In order to determine if two rows contain the same person, we need to develop a method that checks the information that should be unique to each person and compares it to determine equality. Use the equality test that we discussed in class.
Requirements:
Incident.java
to hold a single row from the CSV. It does not need to hold
all fields in the CSV - you should store only the fields that will be necessary
in order to deduplicate the data.boolean equals(Object o)
if the compared items are
the same person according to the way you have decided to check uniqueness.Requirements:
Dataset.java
. Your constructor for this class should
take the name of a CSV file as input and should do the work of reading in the
data from the CSV and parsing it into an ArrayList<Incident>
.
If the CSV doesn’t exist, use a try-catch to print a warning and terminate.Create a method public ArrayList<Incident> getIncidents()
which returns the array list you filled in the constructor.
You will now add five data deduplication methods to Dataset.java
.
Each of these methods (described below) should return an ArrayList of
your designed objects that contains only the non-duplicated individuals who were
stop-and-frisked.
One way to find the duplicates in a data set is to compare each item to each other item in the data set, checking duplicates as you go. Make sure not to count the comparison of an item to itself as a duplicate. We will call this the “all pairs” method of data deduplication.
Requirements
ArrayList<Incident> allPairsDeduplication()
Another way to find the duplicates is to create a key for each item in the data set and insert these items into a hashtable. Items hashing to the same key will overwrite each other, therefore resulting in deduplication.
Create a method public String key()
in Incident.java
The choice of key will determine whether two items are considered
to be duplicates. Use the fields we discussed in class.
Requirements:
ProbeHashMap.java
which you can download here: wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/labs/lab09/ProbeHashMap.java
You’ll also need the following files:
wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/labs/lab09/AbstractHashMap.java
wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/labs/lab09/Entry.java
wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/labs/lab09/Map.java
wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/labs/lab09/AbstractMap.java
You program should first creates
a ProbeHashMap
with the size 1000003 (use the one-parameter constructor)
and then insert the items into this hash table. Remember that the key should include all fields used in deduplication.
The deduplication method signature should be:
ArrayList<Incident> hashLinearDeduplication()
.
Besides the list of non-duplicates, also collect the following statistics about
hashing: 1) average number of probes during insertions, 2) max number of probes
during insertions and 3) load factor after insertions. Note that you will need
to update ProbeHashMap
class to compute these values.
Create the following functions:
public double getAvgProbes()
public int getMaxProbes()
public double getLoadFactor()
Print out these statistics once after you have inserted all elements into the hashtable in the following format:
Average numer of probes: XXX
Max number of probes: XXX
Load factor: XXX
DoubleHashMap.java
class which extends AbstractHashMap
and
implements a hash table that supports double hashing on collision. Starting
from ProbeHashMap and modifying it to add a secondary hash function is
recommended. For your double hash, create a function public int f(int i, K k)
.
Use the definition for f
that we discussed in class.For this assignment, in your double hash function set q = 1000001.
Similarily, create the following functions:
public double getAvgProbes()
public int getMaxProbes()
public double getLoadFactor()
The deduplication method signature in Dataset.java
should be:
ArrayList<Incident> hashDoubleDeduplication()
.
Did the hashing statistics change? Discuss in your README (see details in
Section 2).
Now, experiment with a smaller value of q. Did your hashing statistics changes? Discuss in your README. Remember to change q back to 1000001 before submitting to gradescope.
The last method for deduplication that we’ll look at sorts the data to check for duplicates. Once data is sorted, it is easier to check for duplicates since they will be adjacent. First, we’ll need some functions that allow us to sort.
.compareTo
. So, Incident
should implemement Comparable
.
To begin, download Utils.java
wget https://raw.githubusercontent.com/BMC-CS-151/BMC-CS-151.github.io/main/hws/hw07/Utils.java
.Fill in the specified methods.
Collections.sort()
to sort your data.ArrayList<Incident> quickSortDeduplication()
and
ArrayList<Incident> builtinSortDeduplication()
. The quickSortDeduplication
method should call Utils.quickSort(ArrayList<Incident>)
.You will now explore the complexity of the deduplication methods you’ve
developed. The answers to these questions should be given in your README file.
Create a Driver07.java
class which calls the deduplication methods you defined in Dataset
.
Test your deduplication methods by printing the number of entries before and after deduplication.
All methods should result in a deduplicated list of the same size. Using your driver,
test your various methods and make observations to answer the following questions:
Requirements:
How much time (in milliseconds) does each duplication counting method take when run on the SQF data set?
Create a chart (with Excel or any other plotting software) showing the number of rows processed on the x-axis and the corresponding time used in milliseconds on the y-axis for each of these methods (where each method is a series on the graph). Note that this requires that you collect time data in your program using the following code snippet that you saw when we first talked about complexity and analysis of algorithms.
long startTime = System.currentTimeMillis();
// run code
long endTime = System.currentTimeMillis();
long elapsed = endTime - startTime;
Save the graph and name as complexity.png
.
Include this in your submission on gradescope.
Your allPairs
deduplication technique can likely not handle more than 100k entries.
Do not plot anything beyond that.
You will receive a stop-and-frisk records file on the command line as a single
argument like this:
2011.csv
As in your other assignments, make sure that your program works correctly even if the CSV file is not in your current directory.
Note that we may choose to test your work using data from a different year, so be sure to actually read the data in from the command line and not hard-code the file name.
Requirements:
Submit the following files to the assignment called HW07
on Gradescope
Dataset.java
DoubleHashMap.java
Driver07.java
Incident.java
ProbeHashMap.java
Utils.java
and any other java files you created.AbstractHashMap.java
, AbstractMap.java
, Entry.java
, or Map.java
.complexity.png
Make sure to name these files exactly what we specify here. Otherwise,
our autograders might not work and we might have to take points off.
DO NOT submit a .docx or .pdf file.
DO NOT submit any .class
files or any of the csv
files.
DO NOT submit any of the provided interfaces. We will use the ones
we provide for testing.
You will likely want to submit the files from your own laptop/computer.
To copy the files from goldengate
to your own machine,
open up a terminal on your own computer. If you have a mac or linux, you can use the
Terminal application, if you have a windows you can use the Powershell app.
Run the following command:
$ scp <username>@goldengate.cs.brynmawr.edu/home/<username>/cs151/homeworks/hw00/*
Make sure to replace <username>
with your username. scp
is a command to secure-copy files
from one server to another. In this case, from goldengate
to your own machine.