Interface DataSet

  • All Superinterfaces:
    java.lang.Iterable<java.util.List<Attribute>>
    All Known Implementing Classes:
    MySQLDataSet, TextFileDataSet

    public interface DataSet
    extends java.lang.Iterable<java.util.List<Attribute>>
    DataSet is a container compound by examples. Each example (or DataRecord) is a set of named attributes either discrete or continuous that represents a particular case of some type of class object. This DataSet contain methods to manipulate those DataRecord and calculate information gain, and entropy among the dataset.
     TODO:
     - methods to read from .names/.data files
     - methods to export in format .names/.data
     - methods to read from databases tables?
     - optimizations in the sorting to avoid multiple sorting.
     
    • Method Summary

      All Methods Instance Methods Abstract Methods 
      Modifier and Type Method Description
      Attribute allTheSame()
      Returns the most common output attribute if the rest of the attributes are exactly the same over the whole data set.
      boolean allTheSameOutput()  
      void close()
      Closes the underlying data source if possible.
      java.util.HashMap<Attribute,java.lang.Integer> getFrequencies​(int lo, int hi, int fieldIndex)
      Gets a map between the different values of the attribute at the fieldIndex and their respective frequencies.
      int getItemsCount()  
      MetaData getMetaData()  
      int getOutputIndex()  
      DataSet getSubset​(int lo, int hi)
      Create a slice of this data set as a new data set from [lo, hi).
      java.lang.Iterable<java.util.List<Attribute>> sortOver​(int fieldIndex)
      Sort the data set over the fieldIndex as primary key and the output index to break any ties.
      java.lang.Iterable<java.util.List<Attribute>> sortOver​(int lo, int hi, int fieldIndex)
      Sort the data set over the fieldIndex as primary key and the output index to break any ties, and limit the elements to [lo, hi).
      DataSet[] splitKeepingRelation​(double proportion)
      Splits this data set into two new dataset where the proportion between the output classes is kept.
      • Methods inherited from interface java.lang.Iterable

        forEach, iterator, spliterator
    • Method Detail

      • getOutputIndex

        int getOutputIndex()
        Returns:
        The index of the field used as output.
      • getItemsCount

        int getItemsCount()
        Returns:
        The total count of elements in this data set.
      • getMetaData

        MetaData getMetaData()
        Returns:
        A MetaData object containing information about the attributes on the data set.
      • sortOver

        java.lang.Iterable<java.util.List<Attribute>> sortOver​(int fieldIndex)
        Sort the data set over the fieldIndex as primary key and the output index to break any ties. This index is remember for future internal references.
        Parameters:
        fieldIndex - The field to sort over
        Returns:
        A iterable representation of this data set sorted over the field index.
      • sortOver

        java.lang.Iterable<java.util.List<Attribute>> sortOver​(int lo,
                                                               int hi,
                                                               int fieldIndex)
        Sort the data set over the fieldIndex as primary key and the output index to break any ties, and limit the elements to [lo, hi). This index is remember for future internal references.
        Parameters:
        lo - The lower bound (inclusive) of the data set to be returned
        hi - The upper bound (exclusive) of the data set to be returned.
        fieldIndex - The field to sort over
        Returns:
        A iterable representation of this data set sorted over the field index from [lo, hi).
      • getSubset

        DataSet getSubset​(int lo,
                          int hi)
        Create a slice of this data set as a new data set from [lo, hi).
        Parameters:
        lo - The lower bound (inclusive) of the data set to be returned
        hi - The upper bound (exclusive) of the data set to be returned.
        Returns:
        A new data set that is the copy of this from [lo, hi)
      • allTheSameOutput

        boolean allTheSameOutput()
        Returns:
        True if all the classes (value of output index) are the same.
      • allTheSame

        Attribute allTheSame()
        Returns the most common output attribute if the rest of the attributes are exactly the same over the whole data set. If there is one single record with one single attribute different from the rest, then this method will return null.
        Returns:
        The most common output attribute or null if there is one record different from the rest.
      • splitKeepingRelation

        DataSet[] splitKeepingRelation​(double proportion)
        Splits this data set into two new dataset where the proportion between the output classes is kept. The first dataset contains the proportion of the original data set, for instance, if the data set has 100 elements distributed between 2 classes, in a 60/40 proportion, this first set will contain (60*proportion + 40*proportion) elements and the second data set will contain the rest. This method is useful to generate training/test sets from one massive data set.
        Parameters:
        proportion - the percentage of element to be keep of each class for the first data set.
        Returns:
        an array of 2 positions with the dataset as described above.
      • getFrequencies

        java.util.HashMap<Attribute,java.lang.Integer> getFrequencies​(int lo,
                                                                      int hi,
                                                                      int fieldIndex)
        Gets a map between the different values of the attribute at the fieldIndex and their respective frequencies. It limits the count space to [lo, hi).
        Parameters:
        lo - The lower bound (inclusive) of the data set to be returned
        hi - The upper bound (exclusive) of the data set to be returned.
        fieldIndex - The field to count
        Returns:
        a map with the different values and their respective frequencies.
      • close

        void close()
        Closes the underlying data source if possible.