What is best practice for high number of properties

hharrington · May 24, 2023, 6:42pm

Hi,
according to this doc ( Limits on Table Column Count and Row Size )
I should keep each table to around 30 columns. I have 279 properties so I’d need 10 tables. That seems like a lot of tables to manage and also 10 relationships i’d need to manage. Can you offer any insight the best method to handle this amount of data? fewer tables means higher space and more latency but more tables means more join operations which also has an overhead as well. If you have any links to articles about this I’d love to have them.

thanks,
HH

mark-piller · May 24, 2023, 6:54pm

Hello,

Could you please shed some light on where the number 279 comes from?

When I think about tables and properties, the first step is to identify all the entities in the system. They typically translate to tables. Then for each entity, I think about properties that define an entity (for example, an entity Person would have the properties name, age, phoneNumber, etc). Properties map to columns. Then come the relationships (a Person may have one or more Address entities, etc).

Regards,
Mark

hharrington · May 24, 2023, 10:06pm

Mark,

sure I have a bunch of soil samples each will have a sample_id which is unique each sample will have fields for mounts of trace elements like iron (fe), potassium (K), al (aluminum), etc these will all be a decimal number between 90000 - 0.001 and also a notes field (which i realize now can be broken out into its own table ) so thats about 75 trace elements and 75 trace notes

Ive got a similar table for oxides they are also a decimal same range as above. there is also a notes field which again I can break into two tables there are 15 trace elements and 15 notes fields I can break those up into two tables as well

I have another 60 fields with other soil info like moisture, a bunch of different organic compounds with the same decimal range and I can break this up into 30 measurements and 30 notes

finally I have the sample info 40 fields that contain (sample_id) and state, county, longitude, latitude most of these will be string fields. I guess I could break out country, state, county, province, district and maybe a few others into their own table

now I"m at 10 ish tables i expect to have 200k entries and most of the app will be running searches using all the attributes ( or a combination of them anyway) to find samples that match whatever they need. and then searching for distance for samples given a long,lat.
Hope that helps.

Thanks for your time, I have never done dev work so I do have a learning curve.

HH

mark-piller · May 25, 2023, 1:56am

Hi HH,

Let’s focus on the soil samples first. Do the properties of a soil sample change over time? or once they are taken, they stay unchanged for the same sample?

Additionally, do you envision running database queries to retrieve soil samples based on the trace elements they contain?

The reason I ask these questions is because the DB structure to store the data may vary depending on your answers.

Regards,
Mark

hharrington · May 25, 2023, 1:49pm

Mark,

in general a sample should be static unless there is a new sample taken but that should be rare it is possible a new sample from a new location might be added. this data has been compiled over a decade so changes to already existing samples will be close to nill. New samples may happen maybe 30 over a two year span but as of this moment there a no plans for new samples since the existing data is not being used ( hence my foray with Backendless )

There are two types of query’s that will be used.

find all samples that match ( filters) and show all samples within a radius of [insert location]

the samples search will be run against multiple trace elements, oxides, and the organic compounds
for example show all samples with FE < 1.2 and SI > 3.4 and moisture between x and Y and SiO2 (oxide) < 2

so most searches are going to hit all three categories. Since there are so many things that can be used as a filter, i was planning to list them in a few drop down lists along with <,>,=, between for each and limit the total number of filters to something sane ( like 10 ish) this depends a bit on input from the company once I have a POC working.

I know this means my where clause will be a nightmare, I’m panning to have each drop down filter store its name and operator (name will be the column name and operator will be <,>,=, between in pagedata. Then have the search button add them all together to form the ware clause ignoring any that are null.

Let me know if I can provide more information to help out.

HH

mark-piller · May 25, 2023, 5:04pm

Hi HH,

Based on the information you shared, I recommend storing all element “readings” in a single column of the JSON type. For example, the data stored there could look like this:

{
  "potassium":4500,
  "iron":1200,
  "aluminum":5500
}

This way the schema becomes much lighter and you do not lose any of the querying capabilities. Moreover, you can expand the amount of data stored in JSON depending on the soil sample type (if there is such a thing).

For more information about JSON data in the database, read the following documentation:
https://backendless.com/docs/rest/data_json_overview.html

Hope this helps.

Mark

hharrington · May 25, 2023, 5:27pm

Mark,

I had thought about doing just that, but read that using json for items frequently accessed can lead to performance issues and is better adapted for unstructured data. For Backendless, will the performance lost from json access offset the performance gained by not needing to use as many joins?

What about data duplication? i’d need each field name (trace elements ) in every sample row right? so i’d end up with “potassium” as a string for each row. or would that make no real difference ( in terms of performance )

Am I off base here? I’m not a dba so I’m trying to educate myself.

Thanks for taking a look,
HH

mark-piller · May 25, 2023, 7:44pm

HH,

I am not aware of any performance loss resulting from using the JSON column type. Additionally, from the descriptions you provided, I do not see a need for any related tables (i.e. joins).

As for duplication, there would be none. Your JSON would contain only the elements which are needed to describe a specific soil sample. If one sample has potassium, the JSON in that row would have a value for it, other soil samples without potassium would omit that key/value in the JSON structure.

Regards,
Mark

hharrington · May 25, 2023, 8:01pm

Mark,

thanks for looking at this. I thought looking at a related table would require a join operation between the tables. does Backendless do this differently or am I just misinformed?

about duplication, If I have a bunch of samples with iron and store them in a json I will duplicate “iron” for each row that has it. or is this just a scale thing? IE if I duplicate the string “potassium” across 100k rows that’s 900000 characters which would be about 1 MB of storage, not enough to worry about.

thanks,
HH

sergey.kuk · May 26, 2023, 8:50am

hello @hharrington

I thought looking at a related table would require a join operation between the tables. does Backendless do this differently or am I just misinformed?

Relation in backendless is a join operations.

IE if I duplicate the string “potassium” across 100k rows that’s 900000 characters which would be about 1 MB of storage, not enough to worry about.

It is not enough to worry about if you do not suppose to use JSON column in the where clause. In another case, it may be a problem because JSON columns have no indexes. So if you have a huge number of records and search by JSON column you may have a performance issue. So the question you may ask is what is the huge number, and I can’t say for sure. It should be tested.

hharrington · May 31, 2023, 1:01pm

I"m thinking of creating a table with 3 fields sampleid, attribute_name, and value

looking at the trace group, i’d have 70 entries in this table for each single sample_id so I’ll have a 4-5 million rows in this table. I was thinking i could put an index on each column to speed up searching. Then I’d add a one to many relationship (using the sampleid field) between the sample info table to the trace table.

this would reduce the number of columns and not use json so I’d not pay the performance hit for searching json objects or for having a table with 70 columns.

does this sound reasonable ?